Comparative Analysis of IN and Exists in MySQL Statements

Background

Recently, when writing SQL statements, I was hesitant about whether to choose IN or Exists, so I wrote out the SQL of both methods to compare the execution efficiency. I found that the query efficiency of IN is much higher than that of Exists, so I naturally assumed that the efficiency of IN is better than Exists. However, in line with the principle of getting to the bottom of things, I want to know whether this conclusion applies to all scenarios and why this result occurs.
I checked the relevant information online and it can be roughly summarized as follows: when the external table is small and the internal table is large, Exists is applicable; when the external table is large and the internal table is small, IN is applicable. Then I am confused, because in my SQL statement, the outer table has only 10,000 data, and the inner table has 300,000 data. According to the Internet, the efficiency of Exists should be higher than IN, but my result is just the opposite! !
"No investigation, no right to speak"! So I started to study the actual execution process of IN and Exists, and from a practical perspective, I tried to find the root cause, which led to this blog post.

Experimental data

My experimental data includes two tables: t_author table and t_poetry table.
The amount of data in the corresponding table:

t_author table, 13355 records;
t_poetry table, 289,917 records.

The corresponding table structure is as follows:

CREATE TABLE t_poetry (
id bigint(20) NOT NULL AUTO_INCREMENT,
poetry_id bigint(20) NOT NULL COMMENT 'Poetry id',
poetry_name varchar(200) NOT NULL COMMENT 'Poetry name',
<font color=red> author_id bigint(20) NOT NULL COMMENT 'Author id'</font>
PRIMARY KEY ( id ),
UNIQUE KEY pid_idx ( poetry_id ) USING BTREE,
KEY aid_idx ( author_id ) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=291270 DEFAULT CHARSET=utf8mb4

CREATE TABLE t_author (
id int(15) NOT NULL AUTO_INCREMENT,
author_id bigint(20) NOT NULL,</font>
author_name varchar(32) NOT NULL,
dynasty varchar(16) NOT NULL,
poetry_num int(8) NOT NULL DEFAULT '0'
PRIMARY KEY ( id ),
UNIQUE KEY authorid_idx ( author_id ) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=13339 DEFAULT CHARSET=utf8mb4

Execution plan analysis IN execution process

SQL example: select * from tabA where tabA.x in (select x from tabB where y>0 );

Its execution plan:
(1) Execute the subquery of the tabB table to obtain the result set B, and use the index y of the tabB table;
(2) Execute a query on the tabA table. The query condition is that tabA.x is in result set B. The index x of the tabA table can be used.

Exists execution process

SQL example: select from tabA where exists (select from tabB where y>0);

Its execution plan:

(1) First retrieve all records from the tabA table.
(2) For each record of table tabA, associate table tabB row by row to determine whether the subquery of table tabB returns any data. Versions after 5.5 use Block Nested Loop.
(3) If the subquery returns data, the current record of tabA is returned to the result set.
tabA is equivalent to traversing the entire table data, and tabB can use the index.

Experimental procedures

The experiment analyzes the IN and Exists SQL statements for the same result set.
SQL statement containing IN:

select from t_author ta where author_id in
(select author_id from t_poetry tp where tp.poetry_id>3650 );

SQL statement containing Exists:

select from t_author ta where exists
(select * from t_poetry tp where tp.poetry_id>3650 and tp.author_id=ta.author_id);

Data from the first experiment

t_author table, 13355 records; t_poetry table, subquery filter result set where poetry_id>293650, 121 records;

Execution Results

Using exists takes 0.94 seconds, and using in takes 0.03 seconds. IN is more efficient than Exists.

Cause Analysis

The result set of the subquery on the t_poetry table is very small, and both can use indexes on the t_poetry table, and the consumption of the t_poetry subquery is basically the same. The difference between the two is that when using in, the t_author table can use the index:

When using exists, the t_author table is fully scanned:

When the subquery result set is small, the query time is mainly reflected in the traversal of the t_author table.

Second experimental data

t_author table, 13,355 records; t_poetry table, subquery filter result set where poetry_id>3650, 287,838 records;

Execution time

Using exists takes 0.12 seconds, and using in takes 0.48 seconds. Exists is more efficient than IN .

Cause Analysis

The index usage of the two experiments is consistent with that of the first experiment. The only difference is the size of the subquery filtering result set. However, the experimental results are different from the first one. In this case, the subquery result set is very large. Let's look at the MySQL query plan:
When using in, since the subquery result set is very large, both t_author and t_poetry tables are close to full table scans. At this time, the difference in the time consumption of traversing the t_author table can be ignored on the overall efficiency. There is one more line <auto_key> in the execution plan. In the case of a close full table scan, the MySQL optimizer chooses auto_key to traverse the t_author table:

When using exists, the change in data volume does not change the execution plan. However, due to the large subquery result set, MySQL versions after 5.5 use Block Nested-Loop (Block nested loop, introducing join buffer, similar to cache function) when matching the query results of exists, which begins to have a significant impact on query efficiency, especially when the subquery result set is large, it can significantly improve the query matching efficiency:

Experimental Conclusion

Based on the above two experiments and their results, we can clearly understand the execution process of IN and Exists, and summarize the applicable scenarios of IN and Exists:

IN queries can use indexes on both inner and outer tables; Exists queries can use indexes only on inner tables. When the subquery result set is large and the outer table is small, the Block Nested Loop of Exists begins to play a role and makes up for the defect that the outer table cannot use indexes, and the query efficiency will be better than IN. When the subquery result set is small and the external table is large, the optimization effect of the Block nested loop of Exists is not obvious, and the advantage of the external table index of IN plays a major role. In this case, the query efficiency of IN is better than that of Exists. What is said online is inaccurate. In fact, the "size of the table" does not refer to the internal table and the external table, but the external table and the subquery result set . The last point is also the most important one: there is no absolute truth in the world. Grasping the essence of things and conducting practical verification for different scenarios is the most reliable and effective method. Supplementary information on problems found during the experiment

When analyzing the above exists statements under different data sets, I found that the larger the data set, the shorter the time consumed, which is very strange.
The specific query conditions are:

where tp.poetry_id>3650, takes 0.13S
where tp.poetry_id>293650, takes 0.46S

Possible reasons: The larger the condition value is, the later the query is, the more records need to be traversed, resulting in more time consumption. This explanation needs to be further verified.

You may also be interested in:

Comparison of the usage of EXISTS and IN in MySQL
Basic usage of exists, in and any in MySQL
Introduction to the use and difference between in and exists in MySQL
MySQL exists and in detailed explanation and difference
Summary of the differences between in query and exists query in mySQL
MYSQL IN and EXISTS optimization examples
mysql not in, left join, IS NULL, NOT EXISTS efficiency problem record
Detailed explanation of the difference between in and exists in MySQL

<<: How to view and terminate running background programs in Linux

>>: Detailed explanation of Vue custom instructions and their use

js learning notes: class, super and extends keywords

Comparative Analysis of IN and Exists in MySQL Statements

js learning notes: class, super and extends keywords

TypeScript installation and use and basic data types

How to package the project into docker through idea

Python 3.7 installation tutorial for MacBook

Solve the problem that docker installation is completed and reported: bridge-nf-call-iptables is disabled

Detailed explanation of keepAlive use cases in Vue

MYSQL development performance research: optimization method for batch inserting data

How to handle token expiration in WeChat Mini Programs

How to build and deploy Node project with Docker

Detailed explanation of MySQL sql99 syntax inner join and non-equivalent join

Recommend

The implementation process of Linux process network traffic statistics

Summary of the advantages of Vue3 vs. Vue2

Summary of the minesweeping project implemented in JS

Native JS to achieve digital table special effects

Introduction to the use of em in elastic layout in CSS3: How many pixels is 1em?

Summary of pitfalls in importing ova files into vmware

XHTML language default CSS style

How to add, delete and modify columns in MySQL database

WeChat Mini Program User Authorization Best Practices Guide

MySQL 5.7.23 decompression version installation tutorial with pictures and text

JavaScript functional programming basics

Use CSS3 to implement button hover flash dynamic special effects code

HTML table only displays the outer border of the table

Complete MySQL Learning Notes

Detailed explanation of Linux server status and performance related commands