relpipe-data/examples-parallel-hashes.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 316 d7ae02390fac
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
294
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     1
<stránka
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     2
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     3
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     4
	
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     5
	<nadpis>Computing hashes in parallel</nadpis>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     6
	<perex>utilize all CPU cores while computing SHA-256 and other file hashes</perex>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     7
	<m:pořadí-příkladu>03800</m:pořadí-příkladu>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     8
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     9
	<text xmlns="http://www.w3.org/1999/xhtml">
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    10
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    11
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    12
			Using <code>relpipe-in-filesystem</code> we can gather various file attributes
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    13
			– basic (name, size, type, …), extended (<em>xattr</em> like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata…
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    14
			or compute hashes of the file content (SHA-256, SHA-512 etc.).
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    15
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    16
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    17
		<p>This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    18
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    19
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    20
			Background:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    21
			Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    22
			It means that computing hashes of multiple files sequentially will take much more time than it could.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    23
			So it is better to compute the hashes in parallel and utilize multiple cores of our CPU.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    24
			On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    25
			And this is a perfect task for <m:name/> and especially <code>relpipe-in-filesystem</code> which is the first tool in our collection that implements streamlets and parallel processing.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    26
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    27
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    28
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    29
			Following script prints list of files in our <code>/bin</code> directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    30
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    31
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    32
		<m:pre src="examples/parallel-hashes-1.sh" jazyk="bash"/>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    33
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    34
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    35
			Output looks like this:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    36
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    37
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    38
		<m:pre src="examples/parallel-hashes-1.txt" jazyk="text"/>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    39
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    40
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    41
			This pipeline consists of four steps:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    42
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    43
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    44
		<ul>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    45
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    46
				<code>findFiles</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    47
				– prepares the list of files separated by <code>\0</code> byte;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    48
				we can do also some basic filtering here
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    49
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    50
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    51
				<code>fetchAttributes</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    52
				– does the heavy work – computes SHA-256 hash of each file;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    53
				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    54
				we can experiment with the N value and look how the total time decreases
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    55
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    56
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    57
				<code>aggregate</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    58
				– uses SQL to order the records and SQL window function to show, how many files have the same content;
316
d7ae02390fac relpipe-tr-guile.cpp → relpipe-tr-scheme.cpp
František Kučera <franta-hg@frantovo.cz>
parents: 294
diff changeset
    59
				in this step we could use also <code>relpipe-tr-awk</code> or <code>relpipe-tr-scheme</code> if we prefer AWK or Scheme to SQL
294
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    60
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    61
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    62
				<code>relpipe-out-tabular</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    63
				– formats the results as a table in the terminal (we could use e.g. <code>relpipe-out-gui</code> to call a GUI viewer or format the results as XML, CSV or other format)
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    64
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    65
		</ul>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    66
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    67
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    68
			In the case of the <code>/bin</code> directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    69
			But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    70
			or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    71
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    72
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    73
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    74
			Following script shows how we can compute hashes using multiple algorithms:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    75
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    76
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    77
		<m:pre src="examples/parallel-hashes-2.sh" jazyk="bash"/>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    78
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    79
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    80
			There are two variants:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    81
			In <code>fetchAttributes1</code> we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (<code>--parallel 4</code>) over records.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    82
			In <code>fetchAttributes2</code> we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (<code>--parallel 4</code>) over records.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    83
			This is a common way how streamlets work:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    84
			If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation).
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    85
			But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    86
			The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    87
			The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    88
			It is up to the user to choose the optimal (or good enough) way – there is no <em>automagic</em> mechanism.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    89
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    90
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    91
	</text>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    92
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    93
</stránka>