diff -r b862d16a2e9f -r abbc9bcfbcc4 relpipe-data/examples-parallel-hashes.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-parallel-hashes.xml Mon Feb 03 22:10:07 2020 +0100 @@ -0,0 +1,93 @@ + + + Computing hashes in parallel + utilize all CPU cores while computing SHA-256 and other file hashes + 03800 + + + +

+ Using relpipe-in-filesystem we can gather various file attributes + – basic (name, size, type, …), extended (xattr like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata… + or compute hashes of the file content (SHA-256, SHA-512 etc.). +

+ +

This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.

+ +

+ Background: + Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage. + It means that computing hashes of multiple files sequentially will take much more time than it could. + So it is better to compute the hashes in parallel and utilize multiple cores of our CPU. + On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures. + And this is a perfect task for and especially relpipe-in-filesystem which is the first tool in our collection that implements streamlets and parallel processing. +

+ +

+ Following script prints list of files in our /bin directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have: +

+ + + +

+ Output looks like this: +

+ + + +

+ This pipeline consists of four steps: +

+ +

+ findFiles + – prepares the list of files separated by \0 byte; + we can do also some basic filtering here +
+ fetchAttributes + – does the heavy work – computes SHA-256 hash of each file; + thanks to --parallel N option, utilizes N cores of our CPU; + we can experiment with the N value and look how the total time decreases +
+ aggregate + – uses SQL to order the records and SQL window function to show, how many files have the same content; + in this step we could use also relpipe-tr-awk or relpipe-tr-guile if we prefer AWK or Guile/Scheme to SQL +
+ relpipe-out-tabular + – formats the results as a table in the terminal (we could use e.g. relpipe-out-gui to call a GUI viewer or format the results as XML, CSV or other format) +

+ +

+ In the case of the /bin directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary. + But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives + or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not. +

+ +

+ Following script shows how we can compute hashes using multiple algorithms: +

+ + + +

+ There are two variants: + In fetchAttributes1 we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (--parallel 4) over records. + In fetchAttributes2 we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (--parallel 4) over records. + This is a common way how streamlets work: + If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation). + But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record. + The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records. + The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process. + It is up to the user to choose the optimal (or good enough) way – there is no automagic mechanism. +

+ + + +