How to filter out duplicate data when inserting large amounts of data into MySQL

How to filter out duplicate data when inserting large amounts of data into MySQL

Preface:

The reason for working overtime is to go online and solve the problem of duplicate data in the online database. bug in the program was discovered, which was easily solved. The problem was a bit severe in correcting the duplicate data online.

There are duplicate data in 6 tables in the online library, 2 of which are relatively large, one with more than 960,000 and the other with more than 300,000. Because I have dealt with the same problem before, I directly used the Python deduplication script from last time. The script is very simple, which is to connect to the database, find out the duplicate data, and delete it in a loop.

Emmmm, but the efficiency is too low, one piece per second, about 20,000+ duplicate data, and the estimated time is about 8 hours. . .

There is a problem in blindly relying on the things of predecessors without thinking for yourself! Always wondering why it worked before and why it doesn’t work now is also a problem! I found that I have been in a bad state recently and have lost the desire to explore and seek knowledge. Today is a wake-up call for me and I feel like I have finally found my way back to the right path.

Now, let's get back to the topic. The deduplication steps are introduced in detail below.

1. Discover the problem

CREATE TABLE `animal` (  
  `id` int(11) NOT NULL AUTO_INCREMENT,  
  `name` varchar(20) DEFAULT NULL,  
  `age` int(11) DEFAULT NULL,  
  PRIMARY KEY (`id`)  
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('1', 'cat', '12');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('2', 'dog', '13');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('3', 'camel', '25');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('4', 'cat', '32');  
INSERT INTO `pilipa_dds`.`student` (`id`, `name`, `age`) VALUES ('5', 'dog', '42'); 


Goal: We want to remove data with the same name .

First, see which data is repeated.

SELECT name,count( 1 )   
FROM  
 student   
GROUP BY  
NAME   
HAVING  
 count( 1 ) > 1; 


Output:

name count(1) cat 2 dog 2

The data name cat and dog are repeated, and each repeated data has two entries;

Select * From Table Where Repeating Field In (Select Repeating Field From Table Group By Repeating Field Having Count(1)>1) 

2. Delete all duplicate data without leaving any

Direct deletion will result in an error.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT NAME   
 FROM  
  student   
 GROUP BY  
 NAME   
HAVING  
 count( 1 ) > 1) 


Error:

1093 - You can't specify target table 'student' for update in FROM clause, Time: 0.016000s 


The reason is: while updating this table, this table is queried, while querying this table, this table is updated again, which can be understood as a deadlock. MySQL does not support this operation of updating and querying the same table.

Solution: Query the columns of data to be updated as a third-party table, and then filter and update them.

DELETE   
FROM  
 student   
WHERE  
 NAME IN (  
 SELECT  
  t.NAME   
FROM  
 ( SELECT NAME FROM student GROUP BY NAME HAVING count( 1 ) > 1 ) t) 

3. Delete duplicate data from the delete table and keep only one

Before deleting, we can check what kind of duplicate data we want to delete.

SELECT  
 *   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 ) 


What does this mean? First, we group by name and find the data with the smallest id. These data are the fire seeds we want to keep. Then, we query the data with ids that are not in it, which are the duplicate data we want to delete.

4. Start deleting duplicate data and keep only one

It's very simple. Just replace select delete .

DELETE   
FROM  
 student   
WHERE  
 id NOT IN (  
 SELECT  
  t.id   
 FROM  
 ( SELECT MIN( id ) AS id FROM student GROUP BY `name` ) t   
 ) 


The execution of a table with 900,000+ records is super fast.

This is the end of this article about how to filter out duplicate data during MySQL bulk inserts. For more information about how to filter out duplicate data in MySQL, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • MYSQL Must Know Reading Notes Chapter 6 Filtering Data
  • MYSQL uses regular expressions to filter data
  • Comparison of storage engines supported by MySQL database
  • MySQL helps you understand index pushdown in seconds
  • Python numpy implements multiple loops to read files and filter data at equal intervals
  • Explain in detail how to remove or filter certain values ​​or rows in a data set in pandas?
  • How to filter data using orderBy and filter in angularJs

<<:  Responsive Web Design Learning (3) - How to improve the performance of web pages on mobile devices

>>:  base target="" controls the link's target open frame

Recommend

Teach you to quickly build a web cluster project based on nginx

Table of contents 1. Project Environment 2. Proje...

Summarize the common application problems of XHTML code

<br />For some time, I found that many peopl...

Detailed explanation of Bind mounts for Docker data storage

Before reading this article, I hope you have a pr...

Alibaba Cloud Ubuntu 16.04 builds IPSec service

Introduction to IPSec IPSec (Internet Protocol Se...

How to use nginx to access local static resources on Linux server

1. Check whether port 80 is occupied. Generally, ...

MySQL turns off password strength verification

About password strength verification: [root@mysql...

Detailed explanation of the role and principle of key in Vue

Table of contents 1. Let’s start with the conclus...

Mysql NULL caused the pit

Using NULL in comparison operators mysql> sele...

In-depth analysis of Nginx virtual host

Table of contents 1. Virtual Host 1.1 Virtual Hos...

mysql installer community 8.0.12.0 installation graphic tutorial

This tutorial shares the installation of mysql in...

Detailed explanation of redo log and undo log in MySQL

The most important logs in the MySQL log system a...

How to smoothly upgrade and rollback Nginx version in 1 minute

Today, let's talk about a situation that is o...

How to use MySQL 5.7 temporary tablespace to avoid pitfalls

Introduction MySQL 5.7 aims to be the most secure...

Problems and solutions when installing MySQL8.0.13 on Win10 system

Operating system: Window10 MySQL version: 8.0.13-...