author | František Kučera <franta-hg@frantovo.cz> |
Mon, 21 Feb 2022 00:43:11 +0100 | |
branch | v_0 |
changeset 329 | 5bc2bb8b7946 |
parent 316 | d7ae02390fac |
permissions | -rw-r--r-- |
294
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
1 |
<stránka |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
4 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
5 |
<nadpis>Computing hashes in parallel</nadpis> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
6 |
<perex>utilize all CPU cores while computing SHA-256 and other file hashes</perex> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
7 |
<m:pořadí-příkladu>03800</m:pořadí-příkladu> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
8 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml"> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
10 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
11 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
12 |
Using <code>relpipe-in-filesystem</code> we can gather various file attributes |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
13 |
– basic (name, size, type, …), extended (<em>xattr</em> like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata… |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
14 |
or compute hashes of the file content (SHA-256, SHA-512 etc.). |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
15 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
16 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
17 |
<p>This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
18 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
19 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
20 |
Background: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
21 |
Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
22 |
It means that computing hashes of multiple files sequentially will take much more time than it could. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
23 |
So it is better to compute the hashes in parallel and utilize multiple cores of our CPU. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
24 |
On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
25 |
And this is a perfect task for <m:name/> and especially <code>relpipe-in-filesystem</code> which is the first tool in our collection that implements streamlets and parallel processing. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
26 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
27 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
28 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
29 |
Following script prints list of files in our <code>/bin</code> directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
30 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
31 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
32 |
<m:pre src="examples/parallel-hashes-1.sh" jazyk="bash"/> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
33 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
34 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
35 |
Output looks like this: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
36 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
37 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
38 |
<m:pre src="examples/parallel-hashes-1.txt" jazyk="text"/> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
39 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
40 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
41 |
This pipeline consists of four steps: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
42 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
43 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
44 |
<ul> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
45 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
46 |
<code>findFiles</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
47 |
– prepares the list of files separated by <code>\0</code> byte; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
48 |
we can do also some basic filtering here |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
49 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
50 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
51 |
<code>fetchAttributes</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
52 |
– does the heavy work – computes SHA-256 hash of each file; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
53 |
thanks to <code>--parallel N</code> option, utilizes N cores of our CPU; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
54 |
we can experiment with the N value and look how the total time decreases |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
55 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
56 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
57 |
<code>aggregate</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
58 |
– uses SQL to order the records and SQL window function to show, how many files have the same content; |
316
d7ae02390fac
relpipe-tr-guile.cpp → relpipe-tr-scheme.cpp
František Kučera <franta-hg@frantovo.cz>
parents:
294
diff
changeset
|
59 |
in this step we could use also <code>relpipe-tr-awk</code> or <code>relpipe-tr-scheme</code> if we prefer AWK or Scheme to SQL |
294
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
60 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
61 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
62 |
<code>relpipe-out-tabular</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
63 |
– formats the results as a table in the terminal (we could use e.g. <code>relpipe-out-gui</code> to call a GUI viewer or format the results as XML, CSV or other format) |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
64 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
65 |
</ul> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
66 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
67 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
68 |
In the case of the <code>/bin</code> directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
69 |
But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
70 |
or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
71 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
72 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
73 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
74 |
Following script shows how we can compute hashes using multiple algorithms: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
75 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
76 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
77 |
<m:pre src="examples/parallel-hashes-2.sh" jazyk="bash"/> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
78 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
79 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
80 |
There are two variants: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
81 |
In <code>fetchAttributes1</code> we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (<code>--parallel 4</code>) over records. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
82 |
In <code>fetchAttributes2</code> we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (<code>--parallel 4</code>) over records. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
83 |
This is a common way how streamlets work: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
84 |
If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation). |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
85 |
But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
86 |
The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
87 |
The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
88 |
It is up to the user to choose the optimal (or good enough) way – there is no <em>automagic</em> mechanism. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
89 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
90 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
91 |
</text> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
92 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
93 |
</stránka> |