One sql statement completes MySQL deduplication and keeps one

A few days ago, when I was working on a requirement, I needed to clean up duplicate records in MySQL. At that time, I thought of writing it out through code traversal, but then I felt it was too complicated, and I thought that the problem should be solved through a SQL statement. After checking the information and consulting the experts, I came up with a very convenient SQL statement. Here I share this SQL statement and ideas.

Demand Analysis

There are duplicate records in the database, delete one and keep one (the basis for determining whether it is a duplicate is multiple fields)

Solution

When I encountered this requirement, I probably had some ideas in mind. The first thing that comes to my mind is that it can be solved with a SQL statement, but I am not very experienced in complex SQL statements, so I want to ask an expert for help.

Find someone to help

Because this requirement is a bit urgent, the first thing I thought of was to find a colleague in this field to solve it, and then share this problem with my colleague. As a result, this guy just Baidu and gave me a SQL statement that I had never used before, and asked me to try it myself. A lot of things rushed through my heart...

Baidu

Found a sql statement:

DELETE
FROM
 vitamin a
WHERE
 (a.peopleId, a.seq) IN (
  SELECT
   peopleId,
   seq
  FROM
   vitae
  GROUP BY
   peopleId,
   seq
  HAVING
   count(*) > 1
 )
AND rowid NOT IN (
 SELECT
  min(rowid)
 FROM
  vitae
 GROUP BY
  peopleId,
  seq
 HAVING
  count(*) > 1
)

This statement is found in the article [Delete duplicate data in MySQL and keep only one]. The idea of this SQL statement is very clear, and there are three steps:

SELECT peopleId, seq FROM vitae GROUP BY peopleId, seq HAVING count(*) > 1 to find duplicate records in the table as a condition

SELECT min(rowid) FROM vitae GROUP BY peopleId, seq HAVING count(*) > 1 Query the smallest ID value in the duplicate records in the table as the second condition

Finally, according to the above two conditions, delete the remaining duplicate records except the smallest ID in the duplicate records

But unfortunately, an error occurred when running this statement. The general meaning of the error is that the table cannot be updated at the same time as the query.

Code Solution

Based on the above SQL statement, it is possible to achieve the same purpose in two steps through code:

Remove the duplicate data sets first

According to the queried data set, loop to delete the remaining duplicate data

I had the idea and wrote it quickly, but I was shocked when I ran it. It took about 116 seconds . Then I thought I must find a SQL statement that can be used. I posted the code and the running results:

Perfect [Remove Duplicates and Keep One] SQL

Finally, I got the perfect answer in a technical group. Look at this SQL statement:

DELETE consum_record
FROM
 consum_record, 
 (
  SELECT
   min(id) id,
   user_id,
   monetary,
   consume_time
  FROM
   consum_record
  GROUP BY
   user_id,
   monetary,
   consume_time
  HAVING
   count(*) > 1
 ) t2
WHERE
 consum_record.user_id = t2.user_id 
 and consum_record.monetary = t2.monetary
 and consum_record.consume_time = t2.consume_time
AND consum_record.id > t2.id;

If you look carefully at the above SQL statement, it is not difficult to figure out the idea. It can be understood in three steps:

(SELECT min(id) id, user_id, monetary, consume_time FROM consum_record GROUP BY user_id, monetary, consume_time HAVING count(*) > 1 ) t2 (temporary table t2), which contains the minimum ID of each duplicate record

consum_record.user_id = t2.user_id and consum_record.monetary = t2.monetary and consum_record.consume_time = t2.consume_time are the fields that are used to determine the duplication basis.

According to the conditions, delete the records in the original table whose id is greater than the id in t2

When I saw this sentence, I thought it was amazing. Such a simple SQL statement can actually solve such a complex problem. It's really interesting!

It also runs super fast. The original code loop execution takes about 116 seconds , but here it only takes 0.3 seconds . Amazing~

Summarize

As a PHP programmer, it stands to reason that SQL should not lag behind. However, in reality, there are too many things to do and my current SQL level is only at an average level. I will find an opportunity to improve my knowledge in this area in the future.

That’s all for today.

You may also be interested in: