How to filter out duplicate data when inserting large amounts of data into MySQL

1. Discover the problem
2. Delete all duplicate data without leaving any
3. Delete duplicate data from the delete table and keep only one
4. Start deleting duplicate data and keep only one

Preface:

The reason for working overtime is to go online and solve the problem of duplicate data in the online database. bug in the program was discovered, which was easily solved. The problem was a bit severe in correcting the duplicate data online.

There are duplicate data in 6 tables in the online library, 2 of which are relatively large, one with more than 960,000 and the other with more than 300,000. Because I have dealt with the same problem before, I directly used the Python deduplication script from last time. The script is very simple, which is to connect to the database, find out the duplicate data, and delete it in a loop.

Emmmm, but the efficiency is too low, one piece per second, about 20,000+ duplicate data, and the estimated time is about 8 hours. . .

There is a problem in blindly relying on the things of predecessors without thinking for yourself! Always wondering why it worked before and why it doesn’t work now is also a problem! I found that I have been in a bad state recently and have lost the desire to explore and seek knowledge. Today is a wake-up call for me and I feel like I have finally found my way back to the right path.

Now, let's get back to the topic. The deduplication steps are introduced in detail below.

1. Discover the problem

CREATE TABLE `animal` (  
  `id` int(11) NOT NULL AUTO_INCREMENT,  
  `name` varchar(20) DEFAULT NULL,  
  `age` int(11) DEFAULT NULL,  
  PRIMARY KEY (`id`)  
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42');

Goal: We want to remove data with the same name .

First, see which data is repeated.

SELECT name,count( 1 )   
FROM  
 student   
GROUP BY  
NAME   
HAVING  
 count( 1 ) > 1;

Output:

name count(1) cat 2 dog 2

The data name cat and dog are repeated, and each repeated data has two entries;

Select * From Table Where Repeating Field In (Select Repeating Field From Table Group By Repeating Field Having Count(1)>1)

2. Delete all duplicate data without leaving any

Direct deletion will result in an error.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT NAME   
 FROM  
  student   
 GROUP BY  
 NAME   
HAVING  
 count( 1 ) > 1)

Error:

1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s

The reason is: while updating this table, this table is queried, while querying this table, this table is updated again, which can be understood as a deadlock. MySQL does not support this operation of updating and querying the same table.

Solution: Query the columns of data to be updated as a third-party table, and then filter and update them.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT  
  t.NAME   
FROM  
 ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t)

3. Delete duplicate data from the delete table and keep only one

Before deleting, we can check what kind of duplicate data we want to delete.

SELECT  
 *   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 )

What does this mean? First, we group by name and find the data with the smallest id. These data are the fire seeds we want to keep. Then, we query the data with ids that are not in it, which are the duplicate data we want to delete.

4. Start deleting duplicate data and keep only one

It's very simple. Just replace select delete .

DELETE   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 )

The execution of a table with 900,000+ records is super fast.

This is the end of this article about how to filter out duplicate data during MySQL bulk inserts. For more information about how to filter out duplicate data in MySQL, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:

MYSQL Must Know Reading Notes Chapter 6 Filtering Data
MYSQL uses regular expressions to filter data
Comparison of storage engines supported by MySQL database
MySQL helps you understand index pushdown in seconds
Python numpy implements multiple loops to read files and filter data at equal intervals
Explain in detail how to remove or filter certain values or rows in a data set in pandas?
How to filter data using orderBy and filter in angularJs

<<: Responsive Web Design Learning (3) - How to improve the performance of web pages on mobile devices

>>: base target="" controls the link's target open frame

CSS description of the implementation code for displaying text at the end of the horizontal progress bar

How to filter out duplicate data when inserting large amounts of data into MySQL

Table of contents

1. Discover the problem

2. Delete all duplicate data without leaving any

3. Delete duplicate data from the delete table and keep only one

4. Start deleting duplicate data and keep only one

CSS description of the implementation code for displaying text at the end of the horizontal progress bar

jQuery clicks on the love effect

Detailed explanation of CSS image splicing technology (sprite image)

Detailed explanation of Vue's calculated properties

In-depth discussion of memory principles: Are variables stored in the heap or stack in JS?

How to prevent the scroll bar from affecting the page width when the scroll bar appears on the page

Introduction to using data URI scheme to embed images in web pages

A brief introduction to Linux environment variable files

How to add custom system services to CentOS7 systemd

Detailed tutorial on installing mysql 5.7.26 on centOS7.4

Recommend

Teach you to quickly build a web cluster project based on nginx

Detailed explanation of the solution to the problem that the content pointed to by the iframe's src does not refresh

Summarize the common application problems of XHTML code

Detailed explanation of Bind mounts for Docker data storage

Alibaba Cloud Ubuntu 16.04 builds IPSec service

How to use nginx to access local static resources on Linux server

MySQL turns off password strength verification

Detailed explanation of the role and principle of key in Vue

Mysql NULL caused the pit

In-depth analysis of Nginx virtual host

mysql installer community 8.0.12.0 installation graphic tutorial

Detailed explanation of redo log and undo log in MySQL

How to smoothly upgrade and rollback Nginx version in 1 minute

How to use MySQL 5.7 temporary tablespace to avoid pitfalls

Problems and solutions when installing MySQL8.0.13 on Win10 system