15-minute parallel artifact GNU Parallel Getting Started Guide

15-minute parallel artifact GNU Parallel Getting Started Guide

GNU Parallel is a shell tool for executing computational tasks in parallel on one or more computers. This article briefly introduces the use of GNU Parallel.

This cpu is multi-core.

Generally, two cores work like this:

This is how quad cores work:

Here is how the 16 cores work:

Okay, it’s not dark anymore. If you continue to criticize Intel, I will be beaten.

One weekend morning when I was bored, I spent half a day going through the man page and tutorial of gnu parallel. Haha, I have to say that this half day is well worth spending, because I feel it can save me more than half a day in the future.

This article does not attempt to translate the gnu parallel man page or tutorial. Because there are ready-made translations, you can see them here or here.

But after seeing the weird ::: and the strange {}{#}{.}{\} placeholders in parallel a few times ago, I backed off. Such ugly syntax is unattractive. Fortunately, I looked at a few examples to calm myself down, and then tried it myself, and found that it was really a magical tool.

The main purpose of this article is to lure you into using this tool and tell you why and how to use it.

why

There is only one purpose for using gnu parallel, and that is to be fast!

Fast installation

(wget -O - pi.dk/3 || curl pi.dk/3/) | bash

The author said it takes 10 seconds to install. The actual situation in the country may not be enough. But it doesn’t take too long. In fact, it is a single-file Perl script with more than 10,000 lines (yes, you read that right, all modules are in this file, this is a feature~). After that, I wrote fabric scripts and copied them directly to each node machine. Then chmod the execution permission.
Then there is fast execution, which will execute your program in parallel using multiple cores of the system:
Above:

grep a 1G log.

Using parallel, and directly grep without parallel. The result is obvious, a difference of 20 times. This is much more effective than using ack or ag optimization.

Note: This is the result of executing on a 48-core server.

how

The easiest way is to use xargs. There is a parameter -P in xargs that can take advantage of multiple cores.

For example:

$ time echo {1..5} |xargs -n 1 sleep

real 0m15.005s
user 0m0.000s
sys 0m0.000s

This line of xargs passes each echo number as a parameter to sleep, so the total sleep time is 1+2+3+4+5=15 seconds.

If the -P parameter is used to allocate the data to 5 cores, each core will sleep for 1, 2, 3, 4, and 5 seconds, so the total sleep time after execution is 5 seconds.

$ time echo {1..5} |xargs -n 1 -P 5 sleep

real 0m5.003s
user 0m0.000s
sys 0m0.000s

The preparation is over. Generally, the first mode of parallel is to replace xargs -P.

For example, compress all HTML files.

find . -name '*.html' | parallel gzip --best

Parameter transfer mode

The first mode is to use parallel parameter passing. The commands coming in from the front of the pipeline are passed as parameters to the commands that follow and are executed in parallel.

for example

huang$ seq 5 | parallel echo pre_placeholder_{}
pre_placehoder_1
pre_placehoder_2
pre_placehoder_3
pre_placehoder_4
pre_placehoder_5

{} is a placeholder used to hold the incoming parameters.

In cloud computing operations, batch operations are often performed, such as creating 10 cloud hard disks.

seq 10 | parallel cinder create 10 --display-name test_{}

Create 50 cloud hosts

Copy the code as follows:
seq 50 | parallel nova boot --image image_id --flavor 1 --availability-zone az_id --nic vnetwork=private --vnc-password 000000 vm-test_{}

Deleting cloud hosts in batches

nova list | grep some_pattern | awk '{print $2}' | parallel nova delete

Rewrite the for loop

As you can see, I actually replaced many places where loops need to be written with parallel, and enjoyed the convenience brought by parallelism.
The reason is that when performing a for loop, it is most likely to be parallelized because the objects placed in the loop are context-independent.

Universal abstraction, shell loop:

 (for x in `cat list`; do
 do_something $x
 done) | process_output

Can be written directly

 cat list | parallel do_something | process_output

If there are too many contents in the loop

 (for x in `cat list`; do
 do_something $x
 [... 100 lines that do something with $x ...]
 done) | process_output

It's better to write a script

 doit() {
 x=$1
 do_something $x
 [... 100 lines that do something with $x ...]
 }
 export -f doit
 cat list | parallel doit

And it can also avoid a lot of troublesome escapes.

--pipe mode

Another mode is parallel --pipe

At this time, the command in front of the pipeline is not used as a parameter, but as standard input to the following command.

For example:

cat my_large_log |parallel --pipe grep pattern 

Without --pipe, each line in mylog is expanded into a grep pattern line command. With --pipe, the command is no different from cat mylog | grep pattern, except that the commands are distributed to different cores for execution.

Okay, that’s the basic concept! The rest are just the specific usage of various parameters, such as how many cores to use, place_holder replacement, various ways to pass parameters, parallel execution but ensuring the order of result output (-k), and the magical cross-node parallel computing. Just look at the man page to find out.

bonus

Having a small tool to convert to parallel at hand, in addition to making your daily execution faster, another benefit is to test concurrency.

Many interfaces will have some bugs under concurrent operations. For example, some judgements are made at the code level that the database is not locked. As a result, concurrent requests are made, and each request is judged to be passed when it reaches the server. When they are written together, the limit is exceeded. Previously, the for loop was executed serially and did not trigger these problems. But if you really want to test concurrency, you have to write a script or use Python's mulitiprocessing to encapsulate it. But I have parallel at hand, and added the following two aliases in bashrc

alias p='parallel'
alias pp='parallel --pipe -k' 

It is very convenient to create concurrency in this way. I only need to add a p after the pipeline, and I can create concurrency at any time to observe the response.

For example

seq 50 | p -n0 -q curl 'example.com'

Make concurrent requests based on the number of your cores. -n0 means that the seq output is not passed as a parameter to the subsequent command.

Gossip time: Xianglin Sao of GNU

As a lover of free software gossip, every time I discover a new and interesting software, I always google the keyword site:https://news.ycombinator.com and關鍵詞site:http://www.reddit.com/ . Check out the reviews and you may find unexpected things during the discussions.

Then I saw a complaint on hacker news, which basically said that every time you trigger the execution of parallel, a text will pop up telling you that if you use this tool for academic purposes (many life science-related people are using this tool), you have to cite his paper, otherwise you will pay him 10,000 euros. I learned a word from this, called Nagware, which refers specifically to software that nags you like Tang Seng to get you to pay. Although I think the article should be cited if it is really used, as this student said:

I agree it's a great tool, except for the nagware messages and their content. Imagine if the author of cd or ls had the same attitude...

In addition, the author really likes others to cite his software, so much so that I also saw it in NEWS:

Principle time

Directly quote the author's answer on stackoverflow

GNU Parallel is a general parallelizer and makes it easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

in conclusion

This article mainly introduces a real parallel tool, explains its two main modes, gives a tip, and gossips about the unknown side of the GNU world. Hope it's useful for you.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

<<:  MySQL 5.6.27 Installation Tutorial under Linux

>>:  Angular framework detailed explanation of view abstract definition

Recommend

Summary of Common Commands for Getting Started with MySQL Database Basics

This article uses examples to describe the common...

How to get form data in Vue

Table of contents need Get data and submit Templa...

5 MySQL GUI tools recommended to help you with database management

There are many database management tools for MySQ...

Analysis of common usage examples of MySQL process functions

This article uses examples to illustrate the comm...

Three ways to communicate between React components (simple and easy to use)

Table of contents 1. Parent-child component commu...

Vue.js Textbox with Dropdown component

A Textbox with Dropdown allows users to select an...

Vue implements multi-tab component

To see the effect directly, a right-click menu ha...

MySQL5.7 single instance self-starting service configuration process

1.MySQL version [root@clq system]# mysql -v Welco...

Use CSS to draw a file upload pattern

As shown below, if it were you, how would you ach...

How to use JS to check if an element is within the viewport

Preface Share two methods to monitor whether an e...

Let’s talk in detail about how browsers view closures

Table of contents Preface Introduction to Closure...

Remote development with VSCode and SSH

0. Why do we need remote development? When develo...

Detailed explanation of CentOS configuration of Nginx official Yum source

I have been using the CentOS purchased by Alibaba...

MySQL cross-database transaction XA operation example

This article uses an example to describe the MySQ...