MySql quick insert tens of millions of large data examples

MySql quick insert tens of millions of large data examples

In the field of data analysis, database is our good helper. Not only can we accept our query time, but we can also do further analysis based on it. Therefore, we must insert data into the database. In practical applications, we often encounter data volumes of tens of millions or even larger. If there is no quick insertion method, it will be ineffective and take a lot of time.

When I participated in Alibaba's Tianchi Big Data Algorithm Competition (popular music trend prediction), I encountered such a problem. Before optimizing database query and insertion, I wasted a lot of time. Before optimization, it took an incredible 12 hours just to insert 15 million data items (using the most basic one-by-one insertion). This also prompted me to think about how to optimize database insertion and query operations to improve efficiency.

In the process of continuous optimization, the performance has been greatly improved. In the process of querying and aggregating the download, play, and favorite numbers of more than 26,000 songs from the database in time series, the query generation operation speed was reduced from the estimated 40 hours to just over an hour. In terms of database insertion, the performance has been greatly improved. When tested on a new data set, more than 54.9 million pieces of data were inserted in 20 minutes. Let me share my thoughts below.

The optimization process is divided into two steps. The first step is to use the experimental static reader to read data from the CSV file. When the data reaches a certain amount, the multi-threaded insertion into the database program is started. The second step is to use MySQL batch insertion operations.

The first step is to read the file and start inserting multithreading

Here, reaching a certain amount is a question that needs to be considered. In my experiment, I started using 100w as this amount, but a new problem emerged, the Java heap memory overflowed, and finally 10W was used as the standard.

Of course, there can be other quantities, depending on what you like.

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
import preprocess.ImportDataBase;
 
public class MuiltThreadImportDB {
 
 /**
  * Java multi-threaded reading of large files and storage * 
  * @param args
  */
 private static int m_record = 99999;
 private static BufferedReader br = null;
 private ArrayList<String> list;
 private static int m_thread = 0;
 static {
 try {
  br = new BufferedReader(
  new FileReader(
  "E:/tianci/IJCAI15 Data/data_format1/user_log_format1.csv"),8192);
 
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 try {
  br.readLine(); // Remove CSV Header
 } catch (IOException e) {
  e.printStackTrace();
 }
 }
 
 public void start() {
 String line;
 int count = 0;
 list = new ArrayList<String>(m_record + 1);
 synchronized (br) {
  try {
 while ((line = br.readLine()) != null) {
  if (count < m_record) {
 list.add(line);
 count++;
  } else {
 list.add(line);
 count = 0;
 Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
 t1.start();
 list = new ArrayList<String>(m_record + 1);
  }
 }
 
 if (list != null) {
  Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
  t1.start();
 }
  } catch (IOException e) {
 e.printStackTrace();
  }
 }
 }
 
 public static void main(String[] args) {
 new MuiltThreadImportDB().start();
 } 
}

The second step is to use multithreading to insert data in batches

class MultiThread implements Runnable {
 private ArrayList<String> list;
 
 public MultiThread(ArrayList<String> list) {
 this.list = list;
 }
 
 public void run() {
 try {
  ImportDataBase insert = new ImportDataBase(list);
  insert.start();
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 display(this.list);
 }
 
 public void display(List<String> list) {
 // for (String str : list) {
 // System.out.println(str);
 // }
 System.out.print(Thread.currentThread().getName() + " :");
 System.out.println(list.size());
 }
 
}

In batch operations, the prepareStatement class of MySQL is used, and of course the batch operations of the statement class are also used, but the performance is not as good as the former. The former can reach an insertion speed of 10,000+ per second, while the latter can only reach 2,000+;

public int insertUserBehaviour(ArrayList<String> sqls) throws SQLException {
 
 String sql = "insert into user_behaviour_log (user_id,item_id,cat_id,merchant_id,brand_id,time_stamp,action_type)"
 + " values(?,?,?,?,?,?,?)";
 preStmt = conn.prepareStatement(sql);
 for (int i = 0; i < sqls.size(); i++) {
  UserLog log = new UserLog(sqls.get(i));
  preStmt.setString(1, log.getUser_id());
  preStmt.setString(2, log.getItem_id());
  preStmt.setString(3, log.getCat_id());
  preStmt.setString(4, log.getMerchant_id());
  preStmt.setString(5, log.getBrand_id());
  preStmt.setString(6, log.getTimeStamp());
  preStmt.setString(7, log.getActionType());
  preStmt.addBatch();
  if ((i + 1) % 10000 == 0) {
 preStmt.executeBatch();
 conn.commit();
 preStmt.clearBatch();
  }
 }
 preStmt.executeBatch();
 conn.commit();
 return 1;
 }

Of course, we also experimented with different MySQL storage engines, InnoDB and MyISM. The experimental results showed that InnoDB is faster (about 3 times), which may be related to the new version of MySQL. The author's MySQL version is 5.6.

Finally, let’s summarize the methods to improve the insertion speed under large amounts of data.

For Java code, use multi-threaded insertion and batch submission.

In terms of database, do not use indexes when establishing the table structure, otherwise the index B+ tree will have to be maintained during the insertion process; modify the storage engine, generally the default is InnoDB (the new version can use the default, but the old version may require it).

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • Optimizing query speed of MySQL with tens of millions of data using indexes
  • MySQL loop inserts tens of millions of data
  • How to quickly paginate MySQL data volumes of tens of millions
  • Detailed explanation of several practical solutions for quickly deleting large amounts of data (tens of millions) in MySQL
  • Summary of SQL query optimization knowledge points for MySQL tens of millions of big data
  • How to quickly create tens of millions of test data in MySQL
  • Detailed explanation of 30 SQL query optimization techniques for MySQL tens of millions of large data
  • Optimizing the performance of paging query for MySQL with tens of millions of data
  • How to optimize MySQL fast paging for tens of millions of pages
  • Detailed explanation of MySQL database tens of millions of data query and storage

<<:  Two practical ways to enable proxy in React

>>:  How to install JDK and Mysql on Ubuntu 18.04 Linux system

Recommend

HTML form tag tutorial (2):

This tutorial introduces the application of vario...

MySQL Daemon failed to start error solution

MySQL Daemon failed to start error solution A few...

MySQL functional index optimization solution

When using MySQL, many developers often perform f...

Brief analysis of the MySQL character set causing database recovery errors

Importing data with incorrect MySQL character set...

Detailed explanation of Nginx regular expressions

Nginx (engine x) is a high-performance HTTP and r...

How to use Axios asynchronous request API in Vue

Table of contents Setting up a basic HTTP request...

How to use shell scripts in node

background During development, we may need some s...

A brief analysis of the differences between px, rem, em, vh, and vw in CSS

Absolute length px px is the pixel value, which i...

MySQL 8.0.16 installation and configuration tutorial under Windows 10

This article shares with you the graphic tutorial...

MySQL 4 common master-slave replication architectures

Table of contents One master and multiple slaves ...

Detailed introduction to Mysql date query

Query the current date SELECT CURRENT_DATE(); SEL...