relpipe-data/examples-parallel-hashes.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 00:43:11 +0100
branchv_0
changeset 329 5bc2bb8b7946
parent 316 d7ae02390fac
permissions -rw-r--r--
Release v0.18

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Computing hashes in parallel</nadpis>
	<perex>utilize all CPU cores while computing SHA-256 and other file hashes</perex>
	<m:pořadí-příkladu>03800</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			Using <code>relpipe-in-filesystem</code> we can gather various file attributes
			– basic (name, size, type, …), extended (<em>xattr</em> like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata…
			or compute hashes of the file content (SHA-256, SHA-512 etc.).
		</p>
		
		<p>This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.</p>
		
		<p>
			Background:
			Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage.
			It means that computing hashes of multiple files sequentially will take much more time than it could.
			So it is better to compute the hashes in parallel and utilize multiple cores of our CPU.
			On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures.
			And this is a perfect task for <m:name/> and especially <code>relpipe-in-filesystem</code> which is the first tool in our collection that implements streamlets and parallel processing.
		</p>
		
		<p>
			Following script prints list of files in our <code>/bin</code> directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have:
		</p>
		
		<m:pre src="examples/parallel-hashes-1.sh" jazyk="bash"/>
		
		<p>
			Output looks like this:
		</p>
		
		<m:pre src="examples/parallel-hashes-1.txt" jazyk="text"/>
		
		<p>
			This pipeline consists of four steps:
		</p>
		
		<ul>
			<li>
				<code>findFiles</code>
				– prepares the list of files separated by <code>\0</code> byte;
				we can do also some basic filtering here
			</li>
			<li>
				<code>fetchAttributes</code>
				– does the heavy work – computes SHA-256 hash of each file;
				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
				we can experiment with the N value and look how the total time decreases
			</li>
			<li>
				<code>aggregate</code>
				– uses SQL to order the records and SQL window function to show, how many files have the same content;
				in this step we could use also <code>relpipe-tr-awk</code> or <code>relpipe-tr-scheme</code> if we prefer AWK or Scheme to SQL
			</li>
			<li>
				<code>relpipe-out-tabular</code>
				– formats the results as a table in the terminal (we could use e.g. <code>relpipe-out-gui</code> to call a GUI viewer or format the results as XML, CSV or other format)
			</li>
		</ul>
		
		<p>
			In the case of the <code>/bin</code> directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary.
			But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives
			or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not.
		</p>
		
		<p>
			Following script shows how we can compute hashes using multiple algorithms:
		</p>
		
		<m:pre src="examples/parallel-hashes-2.sh" jazyk="bash"/>
		
		<p>
			There are two variants:
			In <code>fetchAttributes1</code> we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (<code>--parallel 4</code>) over records.
			In <code>fetchAttributes2</code> we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (<code>--parallel 4</code>) over records.
			This is a common way how streamlets work:
			If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation).
			But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record.
			The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records.
			The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process.
			It is up to the user to choose the optimal (or good enough) way – there is no <em>automagic</em> mechanism.
		</p>
		
	</text>

</stránka>