Implementing file content deduplication and intersection and difference in Linux

Implementing file content deduplication and intersection and difference in Linux

1. Data Deduplication

In daily work, there may be data duplication when using Hive or Impala to query and export, but you don’t want to re-execute the query (the query time is a bit long and the exported file content is large), so you think of using Linux commands to remove duplicate data from the file content.

The following is an example:

You can see that aaa.txx has 3 duplicate data

I want to remove the redundant data and keep only one

sort aaa.txt | uniq > bbb.txt

Remove duplicate data from the aaa.txt file and output it to bbb.txt

You can see that only one piece of data is retained in the bbb.txt file

2. Data intersection, union, and difference

1) Intersection (equivalent to user_2019 inner join user_2020 on user_2019.user_no=user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq -d

2) Union (equivalent to user_2019.user_no union user_2020.user_no)

sort user_2019.txt user_2020.txt | uniq

3) Difference

user_2019.txt-user_2020.txt

sort user_2019.txt user_2020.txt user_2020.txt | uniq -u

user_2020.txt - user_2019.txt:

sort user_2020.txt user_2019.txt user_2019.txt | uniq -u

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • How to detect file system integrity based on AIDE in Linux
  • Detailed explanation of commands to read and write remote files using Vim in Linux system
  • Detailed explanation of various practical uses of virtual device files in Linux system
  • Solution to the "No such file or directory" prompt when executing executable files in Linux
  • How to quickly copy large files under Linux
  • Detailed explanation of the problem that the space is not released after the Linux file is deleted
  • Linux file management command example analysis [display, view, statistics, etc.]

<<:  In-depth understanding of MySQL long transactions

>>:  js to realize a simple disc clock

Recommend

Implementing a simple age calculator based on HTML+JS

Table of contents Preface Demonstration effect HT...

Detailed explanation of JavaScript's built-in Date object

Table of contents Date Object Creating a Date Obj...

Native JS to implement login box email prompt

This article shares a native JS implementation of...

Take you to understand MySQL character set settings in 5 minutes

Table of contents 1. Content Overview 2. Concepts...

JavaScript implements cool mouse tailing effects

After watching this, I guarantee that you have ha...

HTML head tag detailed introduction

There are many tags and elements in the HTML head ...

Implementation of importing and exporting docker images

Docker usage of gitlab gitlab docker Startup Comm...

A brief discussion on the issue of element dragging and sorting in table

Recently, when using element table, I often encou...

Navicat imports csv data into mysql

This article shares with you how to use Navicat t...

Solution to the error when importing MySQL big data in Navicat

The data that Navicat has exported cannot be impo...

What does this.parentNode.parentNode (parent node of parent node) mean?

The parent node of the parent node, for example, t...

Detailed explanation of several ways to create a top-left triangle in CSS

Today we will introduce several ways to use CSS t...

Analysis of the principle of centering elements with CSS

It is a very common requirement to set the horizo...

How to click on the a tag to pop up the input file upload dialog box

html Copy code The code is as follows: <SPAN cl...