Summary of SQL deduplication methods

Summary of SQL deduplication methods

When using SQL to extract data, we often encounter duplicate values ​​in the table. For example, if we want to get UV (unique visitors), we need to deduplicate.

In MySQL, distinct or group by clause is usually used, but in SQLs that support window functions (such as Hive SQL , Oracle , etc.), the ROW_NUMBER window function can also be used for deduplication.

For example, there is a table task like this:

Remark:

  • task_id : task id;
  • order_id : order id;
  • start_time : start time

Note : One task corresponds to multiple orders

We need to find the total number of tasks. Since task_id is not unique, we need to remove duplicates:

distinct

 -- List all unique values ​​of task_id (after deduplication)

select distinct task_id
from Task;

--Total number of tasks select count(distinct task_id) task_num
from Task;


distinct is usually less efficient. It is not suitable for displaying specific values ​​after deduplication, and is generally used together with count to calculate the number of entries.
When distinct is used, it is placed after select to deduplicate the values ​​of all the subsequent fields. For example, if there are two fields after distinct , then the two records 1,1 and 1,2 are not duplicate values.

group by

 -- List all unique values ​​of task_id (after deduplication, null is also a value)
-- select task_id
-- from Task
-- group by task_id;

--Total number of tasks select count(task_id) task_num
from (select task_id
   from Task
   group by task_id) tmp;

row_number

row_number is a window function with the following syntax:

row_number() over (partition by <用于分組的字段名> order by <用于組內排序的字段名>)
partition by part can be omitted.

 -- Use select count(case when rn=1 then task_id else null end) task_num in SQL that supports window functions
from (select task_id
    , row_number() over (partition by task_id order by start_time) rn
  from Task) tmp;

In addition, let's use a table test to explain the use of distinct and group by in deduplication:

 -- The semicolon below is used to separate rows select distinct user_id
from Test; -- returns 1; 2

select distinct user_id, user_type
from Test; -- returns 1, 1; 1, 2; 2, 1

select user_id
from Test
group by user_id; -- returns 1; 2

select user_id, user_type
from Test
group by user_id, user_type; -- returns 1, 1; 1, 2; 2, 1

select user_id, user_type
from Test
group by user_id; 
  -- Hive, Oracle, etc. will report an error, but MySQL can be written like this.
-- Returns 1, 1 or 1, 2; 2, 1 (two rows in total). Only the fields after group by will be deduplicated, which means the number of records returned at the end is equal to the number of records in the previous SQL statement, that is, 2 records. For fields that are not placed after group by but are placed in select, only one record will be returned (usually the first one, but there should be no pattern).

This is the end of this article on the summary of SQL deduplication methods. For more relevant SQL deduplication methods, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Summary of three deduplication methods in SQL
  • Detailed example of using the distinct method in MySQL
  • How to optimize MySQL deduplication operation to the extreme
  • A simple method to merge and remove duplicate MySQL tables
  • MySQL deduplication methods
  • Detailed explanation of two methods of deduplication in MySQL and example code
  • SQL Learning Notes 5: How to remove duplicates and assign values ​​to newly added fields

<<:  Share 10 of the latest web front-end frameworks (translation)

>>:  Pure CSS to achieve hover image pop-out pop-up effect example code

Recommend

Detailed analysis of classic JavaScript recursion case questions

Table of contents What is recursion and how does ...

Build Maven projects faster in Docker

Table of contents I. Overview 2. Conventional mul...

Vue implements irregular screenshots

Table of contents Image capture through svg CSS p...

Detailed explanation of the method of comparing dates in MySQL

If there is a table product with a field add_time...

Seven Principles of a Skilled Designer (1): Font Design

Well, you may be a design guru, or maybe that'...

Detailed analysis and testing of SSD performance issues in MySQL servers

【question】 We have an HP server. When the SSD wri...

HTML table markup tutorial (16): title horizontal alignment attribute ALIGN

By default, the table title is horizontally cente...

Docker overlay realizes container intercommunication across hosts

Table of contents 1. Docker configuration 2. Crea...

Create a movable stack widget function using flutter

This post focuses on a super secret Flutter proje...

How to reduce the memory and CPU usage of web pages

<br />Some web pages may not look large but ...

Navicat for MySQL 15 Registration and Activation Detailed Tutorial

1. Download Navicat for MySQL 15 https://www.navi...

JavaScript implements AI tic-tac-toe game through the maximum and minimum algorithm

Without further ado, let’s run the screenshot dir...