Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt
sort user_2019.txt user_2020.txt user_2020.txt | uniq -u
user_2020.txt - user_2019.txt:
sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:

How to detect file system integrity based on AIDE in Linux
Detailed explanation of commands to read and write remote files using Vim in Linux system
Detailed explanation of various practical uses of virtual device files in Linux system
Solution to the "No such file or directory" prompt when executing executable files in Linux
How to quickly copy large files under Linux
Detailed explanation of the problem that the space is not released after the Linux file is deleted
Linux file management command example analysis [display, view, statistics, etc.]

<<: In-depth understanding of MySQL long transactions

>>: js to realize a simple disc clock

Navicat Premium operates MySQL database (executes sql statements)

Implementing file content deduplication and intersection and difference in Linux

Navicat Premium operates MySQL database (executes sql statements)

How to implement adaptive container with equal aspect ratio using CSS

Detailed explanation of how to use structural pseudo-class selectors and pseudo-element selectors in CSS3

Two practical ways to enable proxy in React

How to solve the phantom read problem in MySQL

uniapp implements date and time picker

JavaScript to achieve the effect of clicking on the self-made menu

Appreciation of the low-key and elegant web design in black, white and gray

Docker overlay realizes container intercommunication across hosts

Detailed explanation of the loading rules of the require method in node.js

Recommend

Implementing a simple age calculator based on HTML+JS

Detailed explanation of JavaScript's built-in Date object

Native JS to implement login box email prompt

Let's talk in detail about whether setState in React is a macro task or a micro task

Take you to understand MySQL character set settings in 5 minutes

JavaScript implements cool mouse tailing effects

HTML head tag detailed introduction

Implementation of importing and exporting docker images

A brief discussion on the issue of element dragging and sorting in table

Navicat imports csv data into mysql

Solution to the error when importing MySQL big data in Navicat

What does this.parentNode.parentNode (parent node of parent node) mean?

Detailed explanation of several ways to create a top-left triangle in CSS

Analysis of the principle of centering elements with CSS

How to click on the a tag to pop up the input file upload dialog box