--- a/relpipe-data/download.xml Thu Jan 30 14:57:09 2020 +0100
+++ b/relpipe-data/download.xml Mon Feb 03 22:10:07 2020 +0100
@@ -22,7 +22,7 @@
hg clone https://hg.globalcode.info/relpipe/relpipe-in-xml.cpp;
hg clone https://hg.globalcode.info/relpipe/relpipe-in-xmltable.cpp;
hg clone https://hg.globalcode.info/relpipe/relpipe-lib-cli.cpp;
-hg clone https://hg.globalcode.info/relpipe/relpipe-lib-protocol.cpp;
+hg clone https://hg.globalcode.info/relpipe/relpipe-lib-common.cpp;
hg clone https://hg.globalcode.info/relpipe/relpipe-lib-reader.cpp;
hg clone https://hg.globalcode.info/relpipe/relpipe-lib-writer.cpp;
hg clone https://hg.globalcode.info/relpipe/relpipe-lib-xmlwriter.cpp;
@@ -57,6 +57,7 @@
<h2>Released versions</h2>
<ul>
+ <li>2020-01-31: <m:a href="release-v0.15">v0.15</m:a></li>
<li>2019-10-30: <m:a href="release-v0.14">v0.14</m:a></li>
<li>2019-07-30: <m:a href="release-v0.13">v0.13</m:a></li>
<li>2019-05-28: <m:a href="release-v0.12">v0.12</m:a></li>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-parallel-hashes.xml Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,93 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Computing hashes in parallel</nadpis>
+ <perex>utilize all CPU cores while computing SHA-256 and other file hashes</perex>
+ <m:pořadí-příkladu>03800</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ Using <code>relpipe-in-filesystem</code> we can gather various file attributes
+ – basic (name, size, type, …), extended (<em>xattr</em> like e.g. original URL), metadata embedded in files (JPEG Exif, PNG, PDF etc.), XPath values from XML, JAR/ZIP metadata…
+ or compute hashes of the file content (SHA-256, SHA-512 etc.).
+ </p>
+
+ <p>This example shows how we can compute various file content hashes and how to do it efficiently on a machine with multiple CPU cores.</p>
+
+ <p>
+ Background:
+ Contemporary storage (especially SSD or even RAM) is usually fast enough that the bottleneck is the CPU and not the storage.
+ It means that computing hashes of multiple files sequentially will take much more time than it could.
+ So it is better to compute the hashes in parallel and utilize multiple cores of our CPU.
+ On the other hand, we are going to collect several file attributes and we are working with structured data, which means that we have to preserve the structure and in the end merge all pieces together without corrupting the structures.
+ And this is a perfect task for <m:name/> and especially <code>relpipe-in-filesystem</code> which is the first tool in our collection that implements streamlets and parallel processing.
+ </p>
+
+ <p>
+ Following script prints list of files in our <code>/bin</code> directory and their SHA-256 hashes and also tells us, how many identical (i.e. exactly same content) files we have:
+ </p>
+
+ <m:pre src="examples/parallel-hashes-1.sh" jazyk="bash"/>
+
+ <p>
+ Output looks like this:
+ </p>
+
+ <m:pre src="examples/parallel-hashes-1.txt" jazyk="text"/>
+
+ <p>
+ This pipeline consists of four steps:
+ </p>
+
+ <ul>
+ <li>
+ <code>findFiles</code>
+ – prepares the list of files separated by <code>\0</code> byte;
+ we can do also some basic filtering here
+ </li>
+ <li>
+ <code>fetchAttributes</code>
+ – does the heavy work – computes SHA-256 hash of each file;
+ thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
+ we can experiment with the N value and look how the total time decreases
+ </li>
+ <li>
+ <code>aggregate</code>
+ – uses SQL to order the records and SQL window function to show, how many files have the same content;
+ in this step we could use also <code>relpipe-tr-awk</code> or <code>relpipe-tr-guile</code> if we prefer AWK or Guile/Scheme to SQL
+ </li>
+ <li>
+ <code>relpipe-out-tabular</code>
+ – formats the results as a table in the terminal (we could use e.g. <code>relpipe-out-gui</code> to call a GUI viewer or format the results as XML, CSV or other format)
+ </li>
+ </ul>
+
+ <p>
+ In the case of the <code>/bin</code> directory, the results are not so exciting – we see that the files with same content are just symlinks to the same binary.
+ But we can run this pipeline on a different directory and discover real duplicates that occupy precious space on our hard drives
+ or we can build an index for fast searching (even offline media) and checking whether we have a file with given content or not.
+ </p>
+
+ <p>
+ Following script shows how we can compute hashes using multiple algorithms:
+ </p>
+
+ <m:pre src="examples/parallel-hashes-2.sh" jazyk="bash"/>
+
+ <p>
+ There are two variants:
+ In <code>fetchAttributes1</code> we compute MD5 hash and then SHA-1 hash for each record (file). And we have parallelism (<code>--parallel 4</code>) over records.
+ In <code>fetchAttributes2</code> we compute MD5 and SHA-1 hashes in parallel for each record (file). And we have also parallelism (<code>--parallel 4</code>) over records.
+ This is a common way how streamlets work:
+ If we ask a single streamlet instance to compute multiple attributes, it is done sequentially (usually – depends on particular streamlet implementation).
+ But if we create multiple instances of a streamlet, we have automatically multiple processes that work in parallel on each record.
+ The advantage of this kind of parallelism is that we can utilize multiple CPU cores even with one or few records.
+ The disadvantage is that if there is some common initialization phase (like parsing the XML file or other format etc.), this work is doubled in each process.
+ It is up to the user to choose the optimal (or good enough) way – there is no <em>automagic</em> mechanism.
+ </p>
+
+ </text>
+
+</stránka>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-runnable-jars.xml Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,68 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Finding runnable JARs</nadpis>
+ <perex>look for Java archives with a main class</perex>
+ <m:pořadí-příkladu>03900</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ Java archives (<code>*.jar</code> files) that have set the main class can be started using <code>java -jar program.jar</code> command.
+ Let us find all JARs under a certain directory that are runnable.
+ </p>
+
+ <m:pre src="examples/runnable-jars.sh" jazyk="bash"/>
+
+ <p>
+ The script above will print output like this one:
+ </p>
+
+ <m:pre src="examples/runnable-jars.txt" jazyk="text"/>
+
+ <p>
+ This pipeline consists of five steps:
+ </p>
+
+ <ul>
+ <li>
+ <code>findFiles</code>
+ – prepares the list of files separated by <code>\0</code> byte;
+ if we omit the <code>-iname '*.jar'</code>, the result will be the same,
+ just more files will be examinated
+ </li>
+ <li>
+ <code>fetchAttributes</code>
+ – does the heavy work – tries to open each given file as a JAR (same as ZIP format)
+ and looks for the <code>Main-Class</code> field in the <code>META-INF/MANIFEST.MF</code> file (if any);
+ because the <code>jar_info</code> streamlet itself is written in Java, it simply uses existing Java functionality for main class lookup instead of reimplementing it in custom code;
+ thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
+ we can experiment with the N value and look how the total time decreases
+ </li>
+ <li>
+ <code>filterRunable</code>
+ – uses AWK to skip the records (files) that does not have a main class;
+ in this step we could use also <code>relpipe-tr-sql</code> or <code>relpipe-tr-guile</code> if we prefer SQL or Guile/Scheme to AWK
+ </li>
+ <li>
+ <code>shortenPath</code>
+ – replaces part of the absolute path with the <code>~</code> shortcut
+ (just to make it shorter and hide our username)
+ </li>
+ <li>
+ <code>relpipe-out-tabular</code>
+ – formats the results as a table in the terminal (we could use e.g. <code>relpipe-out-gui</code> to call a GUI viewer or format the results as XML, CSV or other format)
+ </li>
+ </ul>
+
+ <p>
+ We can omit the <code>-iname '*.jar'</code> and run this pipeline on another directory
+ in order to find all valid JAR and ZIP files regardless their extension.
+ We will get also the number of entries (files and directories) in each archive.
+ In future versions, this streamlet might be extended to optionally provide files from the archive or their list e.g. in form of XML.
+ </p>
+
+ </text>
+
+</stránka>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-xhtml-filesystem-xpath.xml Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,60 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Collecting statistics from XHTML pages</nadpis>
+ <perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
+ <m:pořadí-příkladu>04000</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
+ We can use this feature to collect data from e.g. XHTML pages.
+ </p>
+
+ <m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
+
+ <p>
+ The script above will show this barchart and statistics:
+ </p>
+
+ <m:img src="img/xhtml-filesystem-xpath-1.png"/>
+
+ <p>
+ This pipeline consists of four steps:
+ </p>
+
+ <ul>
+ <li>
+ <code>findFiles</code>
+ – prepares the list of files separated by <code>\0</code> byte;
+ we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
+ </li>
+ <li>
+ <code>fetchAttributes</code>
+ – does the heavy work – tries to parse each given file as a XML
+ and if valid, extracts several values specified by the XPath expressions;
+ thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
+ we can experiment with the N value and look how the total time decreases
+ </li>
+ <li>
+ <code>filterAndOrder</code>
+ – uses SQL to skip the records (files) that are not XHTML
+ and takes five valid files with most number of headlines
+ </li>
+ <li>
+ <code>relpipe-out-gui</code>
+ – displays the data is a GUI window and generates a bar chart from the numeric values
+ (we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
+ </li>
+ </ul>
+
+ <p>
+ We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
+ Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
+ </p>
+
+ </text>
+
+</stránka>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/parallel-hashes-1.sh Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+findFiles() {
+ find /bin/ -print0;
+}
+
+fetchAttributes() {
+ relpipe-in-filesystem \
+ --parallel 4 \
+ --file path \
+ --file type \
+ --file size \
+ --streamlet hash;
+}
+
+aggregate() {
+ relpipe-tr-sql \
+ --relation "file_hashes" \
+ "SELECT
+ path,
+ type,
+ size,
+ sha256,
+ count(*) OVER (PARTITION BY sha256) AS same_hash_count
+ FROM filesystem
+ ORDER BY same_hash_count, sha256, path, type";
+}
+
+findFiles | fetchAttributes | aggregate | relpipe-out-tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/parallel-hashes-1.txt Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,40 @@
+file_hashes:
+ ╭─────────────────────────────────────────┬───────────────┬────────────────┬──────────────────────────────────────────────────────────────────┬──────────────────────────╮
+ │ path (string) │ type (string) │ size (integer) │ sha256 (string) │ same_hash_count (string) │
+ ├─────────────────────────────────────────┼───────────────┼────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────┤
+ │ /bin/expiry │ f │ 31000 │ 006c97d68fbddf175f326e554693ceaea984d6406bb5f837f1a00a7c6008218d │ 1 │
+ │ /bin/mapscrn │ f │ 27216 │ 00941d8eb6dc9ddf4b7d0651bd21ea1df6e325259f1d3ba9f7916d1e29ec5977 │ 1 │
+ │ /bin/stdbuf │ f │ 51904 │ 00a5270c7b0262754886e4d26ebc1a5a03911c46fa3c02e2b8d2b346be1f924a │ 1 │
+ │ /bin/ps2ps2 │ f │ 669 │ 00d9eb918871124f72c14404158d08db63c24c38a9f426fbc0a556b4d7febab2 │ 1 │
+ │ /bin/kernel-install │ f │ 4639 │ 00e85383894393a0cf3a851839a57eb96056788bea2553c8c166fc4b814daa55 │ 1 │
+ │ /bin/ionice │ f │ 30800 │ 020a4770df648af0e608425a1dba3df35a14dad7bb4d3f17dde3e3142a35f820 │ 1 │
+ │ /bin/dh_python2 │ f │ 1056 │ 02d870b729b8c14e0fdf287a3dbfc161570d04ab75c242ff368801eaeb4dd742 │ 1 │
+ │ /bin/lavadecode │ f │ 18760 │ 03a751439b0be2b65827c0e54fd569dbc0cd6dc6fd561dc8afdd7df04bb0414c │ 1 │
+ │ /bin/libnetcfg │ f │ 15775 │ 03ea004e8921626bdfecbc5d4b200fca2185da59ce4b4bd5407109064525defa │ 1 │
+ │ /bin/opldecode │ f │ 18752 │ 040517423bce47a55d1b6ef6b8232226fe0dce90039447e2a1ab4e0838162128 │ 1 │
+ │ /bin/openssl │ f │ 736776 │ 04997b88144b719a6e71b5e206d2c8b067dd827f0bdcad1c0e5e7a395bcf54f0 │ 1 │
+ │ … │ … │ … │ … │ … │
+ │ /bin/i386 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
+ │ /bin/linux32 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
+ │ /bin/linux64 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
+ │ /bin/setarch │ f │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
+ │ /bin/x86_64 │ l │ 22880 │ b704c7eae64ebde4d9a989aa35e1573a310241e16266596dc049513fbfdc1bf3 │ 5 │
+ │ /bin/cc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
+ │ /bin/gcc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
+ │ /bin/gcc-8 │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
+ │ /bin/x86_64-linux-gnu-gcc │ l │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
+ │ /bin/x86_64-linux-gnu-gcc-8 │ f │ 1100664 │ ee2ce583149c835d967a612e3dce22a4b46ca660c4b41783ac9a1cbcba202e0d │ 5 │
+ │ /bin/lzcat │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/lzma │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/unlzma │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/unxz │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/xz │ f │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/xzcat │ l │ 81192 │ a9769e8c10b2f19a9ccbe449a64fbeaa5bbca06465b98802b7d5785918897ba8 │ 6 │
+ │ /bin/lzegrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ │ /bin/lzfgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ │ /bin/lzgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ │ /bin/xzegrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ │ /bin/xzfgrep │ l │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ │ /bin/xzgrep │ f │ 5628 │ fbb4431fbf461d43c8a8473d8afd461a3a64c5dc6d3a35dd0b15dca2253ec4e9 │ 6 │
+ ╰─────────────────────────────────────────┴───────────────┴────────────────┴──────────────────────────────────────────────────────────────────┴──────────────────────────╯
+Record count: 1001
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/parallel-hashes-2.sh Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,30 @@
+#!/bin/bash
+
+findFiles() {
+ find /bin/ -print0;
+}
+
+fetchAttributes1() {
+ relpipe-in-filesystem \
+ --parallel 4 \
+ --file path \
+ --file type \
+ --file size \
+ --streamlet hash \
+ --option attribute md5 \
+ --option attribute sha1;
+}
+
+fetchAttributes2() {
+ relpipe-in-filesystem \
+ --parallel 4 \
+ --file path \
+ --file type \
+ --file size \
+ --streamlet hash \
+ --option attribute md5 \
+ --streamlet hash \
+ --option attribute sha1;
+}
+
+findFiles | fetchAttributes2 | relpipe-out-tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/release-v0.15.sh Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,50 @@
+# Install dependencies as root:
+su -c "apt install g++ make cmake mercurial pkg-config"
+su -c "apt install libxerces-c-dev" # needed only for relpipe-in-xml module
+su -c "apt install guile-2.2-dev" # needed only for relpipe-tr-guile module; guile-2.0-dev also works but requires a patch (see below)
+su -c "apt install gawk" # needed only for relpipe-tr-awk module
+su -c "apt install libxml++2.6-dev" # needed only for relpipe-in-xmltable module
+su -c "apt install libsqlite3-dev" # needed only for relpipe-tr-sql module
+
+# Run rest of installation as a non-root user:
+export RELPIPE_VERSION="v0.15"
+export RELPIPE_SRC=~/src
+export RELPIPE_BUILD=~/build
+export RELPIPE_INSTALL=~/install
+export PKG_CONFIG_PATH="$RELPIPE_INSTALL/lib/pkgconfig/:$PKG_CONFIG_PATH"
+export PATH="$RELPIPE_INSTALL/bin:$PATH"
+
+rm -rf "$RELPIPE_BUILD"/relpipe-*
+mkdir -p "$RELPIPE_SRC" "$RELPIPE_BUILD" "$RELPIPE_INSTALL"
+
+# Helper functions:
+relpipe_download() { for m in "$@"; do cd "$RELPIPE_SRC" && ([[ -d "relpipe-$m.cpp" ]] && hg pull -R "relpipe-$m.cpp" && hg update -R "relpipe-$m.cpp" "$RELPIPE_VERSION" || hg clone -u "$RELPIPE_VERSION" https://hg.globalcode.info/relpipe/relpipe-$m.cpp) || break; done; }
+relpipe_install() { for m in "$@"; do cd "$RELPIPE_BUILD" && mkdir -p relpipe-$m.cpp && cd relpipe-$m.cpp && cmake -DCMAKE_INSTALL_PREFIX:PATH="$RELPIPE_INSTALL" "$RELPIPE_SRC/relpipe-$m.cpp" && make && make install || break; done; }
+
+# Download all sources:
+relpipe_download lib-common lib-reader lib-writer lib-cli lib-xmlwriter in-cli in-fstab in-xml in-xmltable in-csv in-filesystem in-recfile out-gui.qt out-nullbyte out-ods out-tabular out-xml out-csv out-asn1 out-recfile tr-cut tr-grep tr-python tr-sed tr-validator tr-guile tr-awk tr-sql
+
+# Optional: At this point, we have all dependencies and sources downloaded, so we can disconnect this computer from the internet in order to verify that our build process is sane, deterministic and does not depend on any external resources.
+
+# Build and install libraries:
+relpipe_install lib-common lib-reader lib-writer lib-cli lib-xmlwriter
+
+# Build and install tools:
+relpipe_install in-fstab in-cli in-fstab in-xml in-xmltable in-csv in-recfile tr-cut tr-grep tr-sed tr-guile tr-awk tr-sql out-nullbyte out-ods out-tabular out-xml out-csv out-asn1 out-recfile in-filesystem
+
+# Load Bash completion scripts:
+for c in "$RELPIPE_SRC"/relpipe-*/bash-completion.sh ; do . "$c"; done
+
+# Enable streamlet examples:
+export RELPIPE_IN_FILESYSTEM_STREAMLET_PATH="$RELPIPE_SRC"/relpipe-in-filesystem.cpp/streamlet-examples/
+
+# Clean-up:
+unset -f relpipe_install
+unset -f relpipe_download
+unset -v RELPIPE_VERSION
+unset -v RELPIPE_SRC
+unset -v RELPIPE_BUILD
+unset -v RELPIPE_INSTALL
+
+# Compute hashes of your binaries in parallel:
+find /bin/ -print0 | relpipe-in-filesystem --parallel 8 --file path --file size --streamlet hash | relpipe-out-csv
\ No newline at end of file
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/runnable-jars.sh Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+findFiles() {
+ find ~/.m2/ -iname '*.jar' -printf '%p\0';
+}
+
+fetchAttributes() {
+ relpipe-in-filesystem \
+ --parallel 4 \
+ --file path \
+ --streamlet jar_info;
+}
+
+filterRunable() {
+ relpipe-tr-awk --relation '.*' --where 'main_class';
+}
+
+shortenPath() {
+ relpipe-tr-sed '.*' 'path' "^$HOME/?" "~/";
+}
+
+findFiles | fetchAttributes | filterRunable | shortenPath | relpipe-out-tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/runnable-jars.txt Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,20 @@
+filesystem:
+ ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────┬───────────────────╮
+ │ path (string) │ main_class (string) │ entries (integer) │
+ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┼───────────────────┤
+ │ ~/.m2/repository/com/example/demo1/0.0.1-SNAPSHOT/demo1-0.0.1-SNAPSHOT.jar │ org.springframework.boot.loader.JarLauncher │ 116 │
+ │ ~/.m2/repository/net/java/dev/jna/jna/5.2.0/jna-5.2.0.jar │ com.sun.jna.Native │ 165 │
+ │ ~/.m2/repository/org/sonatype/sisu/sisu-inject-bean/1.4.2/sisu-inject-bean-1.4.2.jar │ org.sonatype.guice.bean.containers.Main │ 165 │
+ │ ~/.m2/repository/org/apache/maven/surefire/surefire-grouper/2.12.4/surefire-grouper-2.12.4.jar │ org.apache.maven.surefire.group.parse.GroupMatcherParser │ 34 │
+ │ ~/.m2/repository/org/apache/commons/commons-compress/1.11/commons-compress-1.11.jar │ org.apache.commons.compress.archivers.Lister │ 267 │
+ │ ~/.m2/repository/org/apache/commons/commons-compress/1.14/commons-compress-1.14.jar │ org.apache.commons.compress.archivers.Lister │ 328 │
+ │ ~/.m2/repository/org/apache/commons/commons-compress/1.18/commons-compress-1.18.jar │ org.apache.commons.compress.archivers.Lister │ 377 │
+ │ ~/.m2/repository/org/apache/commons/commons-compress/1.16.1/commons-compress-1.16.1.jar │ org.apache.commons.compress.archivers.Lister │ 350 │
+ │ ~/.m2/repository/org/apache/ant/ant-launcher/1.7.0/ant-launcher-1.7.0.jar │ org.apache.tools.ant.launch.Launcher │ 12 │
+ │ ~/.m2/repository/org/apache/ant/ant/1.7.0/ant-1.7.0.jar │ org.apache.tools.ant.Main │ 801 │
+ │ ~/.m2/repository/org/eclipse/sisu/org.eclipse.sisu.inject/0.0.0.M5/org.eclipse.sisu.inject-0.0.0.M5.jar │ org.eclipse.sisu.launch.Main │ 236 │
+ │ ~/.m2/repository/org/projectlombok/lombok/1.16.22/lombok-1.16.22.jar │ lombok.launch.Main │ 932 │
+ │ ~/.m2/repository/org/beanshell/bsh/2.0b4/bsh-2.0b4.jar │ bsh.Console │ 238 │
+ │ ~/.m2/repository/cz/frantovo/PigLatin/1.0-SNAPSHOT/PigLatin-1.0-SNAPSHOT.jar │ cz.frantovo.piglatin.CLIStarter │ 14 │
+ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┴───────────────────╯
+Record count: 14
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/xhtml-filesystem-xpath.sh Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+XMLNS_H="http://www.w3.org/1999/xhtml"
+
+# If we set xmlns_h="…", we can omit: --option xmlns_h "$XMLNS_H"
+# because XML namespaces can be provided either as an option or as an environmental variable.
+# Options have precedence.
+
+findFiles() {
+ find -print0;
+}
+
+fetchAttributes() {
+ relpipe-in-filesystem \
+ --parallel 8 \
+ --file name \
+ --streamlet xpath \
+ --option xmlns_h "$XMLNS_H" \
+ --option attribute '.' --option mode boolean --as 'valid_xml' \
+ --option attribute 'namespace-uri()' --as 'root_xmlns' \
+ --option attribute '/h:html/h:head/h:title' --as 'title' \
+ --option attribute 'count(//h:h1)' --as 'h1_count' \
+ --option attribute 'count(//h:h2)' --as 'h2_count' \
+ --option attribute 'count(//h:h3)' --as 'h3_count'
+}
+
+filterAndOrder() {
+ relpipe-tr-sql \
+ --relation "pages" \
+ "SELECT
+ name,
+ title,
+ h1_count,
+ h2_count,
+ h3_count
+ FROM filesystem WHERE root_xmlns = ?
+ ORDER BY h1_count + h2_count + h3_count DESC
+ LIMIT 5" \
+ --type-cast 'h1_count' integer \
+ --type-cast 'h2_count' integer \
+ --type-cast 'h3_count' integer \
+ --parameter "$XMLNS_H";
+}
+
+findFiles | fetchAttributes | filterAndOrder | relpipe-out-gui -title "Pages and titles"
Binary file relpipe-data/img/streamlet-release-v0.15.png has changed
Binary file relpipe-data/img/xhtml-filesystem-xpath-1.png has changed
--- a/relpipe-data/implementation.xml Thu Jan 30 14:57:09 2020 +0100
+++ b/relpipe-data/implementation.xml Mon Feb 03 22:10:07 2020 +0100
@@ -22,7 +22,7 @@
relpipe-in-xml.cpp executable input c++ GNU GPLv3
relpipe-in-xmltable.cpp executable input c++ GNU GPLv3
relpipe-lib-cli.cpp library header-only c++ GNU GPLv3
- relpipe-lib-protocol.cpp library header-only c++ GNU LGPLv3 or GPLv2
+ relpipe-lib-common.cpp library shared c++ GNU LGPLv3 or GPLv2
relpipe-lib-reader.cpp library shared c++ GNU LGPLv3 or GPLv2
relpipe-lib-writer.cpp library shared c++ GNU LGPLv3 or GPLv2
relpipe-lib-xmlwriter.cpp library header-only c++ GNU GPLv3
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/release-v0.15.xml Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,260 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Release v0.15</nadpis>
+ <perex>new public release of Relational pipes</perex>
+ <m:release>v0.15</m:release>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+ <p>
+ We are pleased to introduce you the new development version of <m:name/>.
+ This release brings two big new features: streamlets and parallel processing + several smaller improvements.
+ </p>
+
+ <ul>
+ <li>
+ <strong>SLEB128</strong>: variable-length integers are now signed (i.e. can be even negative!) and encoded as SLEB128</li>
+ <li>
+ <strong>streamlets in relpipe-in-filesystem</strong>: see details below</li>
+ <li>
+ <strong>parallel processing in relpipe-in-filesystem</strong>: see details below</li>
+ <li>
+ <strong>multiple modes in relpipe-in-xmltable</strong>: see details below</li>
+ <li>
+ <strong>XInclude in relpipe-in-xmltable</strong>: use <code>--xinclude true</code> to process XIncludes before converting XML to relations</li>
+ <li>
+ <strong>relpipe-lib-protocol → relpipe-lib-common</strong>: this module was renamed and converted to a shared library, it will contain some common functions instead of just the header files</li>
+ </ul>
+
+ <p>
+ See the <m:a href="examples">examples</m:a> and <m:a href="screenshots">screenshots</m:a> pages for details.
+ </p>
+
+ <p>
+ Please note that this is still a development relasease and thus the API (libraries, CLI arguments, formats) might and will change.
+ Any suggestions, ideas and bug reports are welcome in our <m:a href="contact">mailing list</m:a>.
+ </p>
+
+ <h2>Streamlets</h2>
+
+ <p>
+ <em>Streamlet</em> is a small stream that inflows into the main stream, fuse with it and (typically) brings new attributes.
+ </p>
+ <p>
+ From the technical point of view, streamlets are something between classic <m:a href="classic-example">filters</m:a> and functions.
+ Unlike a function, the streamlet can be written in any programming language and runs as a separate process.
+ Unlike a filter, the streamlet does not relplace whole stream with a new one, but reads certain attributes from the original stream and adds some new ones back.
+ Common feature of filters and streamlets is that both continually read the input and continually deliver outputs, so the memory requirements are usually constant and „infinite“ streams might be processed.
+ And unlike ordinary commands (executed e.g. using <code>xargs</code> or a shell loop over a set of files), the streamlet does not <code>fork()</code> and <code>exec()</code> on each input file – the single streamlet process is reused for all records in the stream which is much more efficient (especially if there is some expensive initialization phase).
+ </p>
+
+ <p>
+ Because streamlets are small scripts or compiled programs, they can be used for extending <m:name/> with minimal effort.
+ A streamlet can be e.g. few-lines Bash script – or on the other hand: a more powerful C++ or Java program.
+ Currently we have templates/examples written in Bash, C++ and Java. But it is possible to use any scripting or programming language.
+ The streamlet communicates with its parent (who manages the whole stream) through a simple <a href="img/streamlet-release-v0.15.png">message-based protocol</a>.
+ Full documentation will be published when stable (before v1.0.0) as a part of the public API.
+ </p>
+
+ <p>
+ The first module where streamlets have been implemented is <code>relpipe-in-filesystem</code>.
+ Streamlets in this module get a single input attribute (the file path) and add various file metadata to the stream.
+ We have e.g. streamlets that compute hashes (SHA-256 etc.), extract metadata from image files (PNG, JPEG etc.) or PDF documents (title, author… or even the full content in plain-text), <m:a href="streamlets-preview">OCR</m:a>-recognized text from images,
+ count lines of code, extract portions of XML files using XPath or some metadata from JAR/ZIP files.
+ The streamlets are a way how to keep <code>relpipe-in-filesystem</code> simple with small code footpring while making it extensible and thus powerful.
+ We are not going to face the question: „Should we add this nice feature (+) and thus also this library dependency (-)? Would it be bloatware or not?“.
+ We (or the users) can add any feature through streamlets while the core <code>relpipe-in-filesystem</code> will stay simple and nobody (who does not need that feature) will not suffer from the growing complexity.
+ </p>
+ <p>
+ But streamlets are not limited to <code>relpipe-in-filesystem</code> – they are a general concept and there will be <code>relpipe-tr-streamlet</code> module.
+ Such streamlets will get any set of input attributes (not only file names) defined by the user and compute values based on them.
+ Such streamlet can e.g. modify a text attribute, compute a sum of numeric attributes, encrypt or decrypt values or interact with some external systems.
+ Writing a streamlet is easier than writing a transformation (like <code>relpipe-tr-*</code>) and it is more than OK to write simple single-purpose ad-hoc streamlets.
+ It is like writing simple shell scripts or functions.
+ Examples of really simple streamlets are: <code>inode</code> (Bash) and <code>pid</code> (C++).
+ It requires implementing only two functions: first one returns names and types of the output attributes and the second one returns that attributes for a record.
+ However, the streamlets might be parametrized through options, might return dynamic number of output attributes and might provide complex logic.
+ Some streamlets will become a stable part of the <m:name/> specification and API (<code>xpath</code> and <code>hash</code> seems to be such ones).
+ </p>
+ <p>
+ One of open questions is whether to have streamlets in <code>relpipe-in-filesystem</code> when we have <code>relpipe-tr-streamlet</code>.
+ <em>One tool should do one thing</em> and <em>we should not duplicate the effort</em>…
+ But it still makes some sense because the file streamlets are specific kind of streamlets and e.g. Bash completion should suggest them if we work with files but not with other data.
+ And it is also nice to have all metadata collecting on the same level in a single command (i.e. <code>--streamlet</code> beside <code>--file</code> and <code>--xattr</code>)
+ than having to collect basic and extended file attributes using single command and collect other file metadata using different command.
+ </p>
+
+ <h2>Parallel processing</h2>
+
+ <p>
+ There are two kinds of parallelism: over attributes and over records.
+ </p>
+
+ <p>
+ Because streamlets are forked processes, they are quite naturally parallelized over attributes.
+ We can e.g. compute SHA-1 hash in one streamlet and SHA-256 hash in another streamlet and we will utilize two CPU cores (or we can ask one streamlet to compute both SHA-1 and SHA-256 hashes and then we will utilize only one CPU core).
+ The <code>relpipe-in-filesystem</code> tool simply 1) feeds all streamlet instances with the current file name, 2) streamlets work in parallel and then 3) the tool collects results from all streamlets.
+ </p>
+
+ <p>
+ But it would not be enough. Today, we usually have more CPU cores than heavy attributes (like hashes).
+ So we need to process multiple records in parallel.
+ The first design proposal (not implemented) was that the tool will simply distribute the file names to STDINs of particular streamlet processes in the round-robin fashion and processes will write to the common STDOUT (just with a lock for synchronization to keep the records atomic – the <m:name/> data format is specifically designed for such use).
+ This will be really simple and somehow helpful (better than nothing).
+ But this design has a significant flaw: the tool is not aware of how busy particular streamlet processes are and will feed them with tasks (file names) equally.
+ So it will work satisfactorily only in case that all tasks have similar difficultness.
+ This is unfortunately not the usual case because e.g. computing a hash of a big file takes much more time than computing a hash of a small file.
+ Thus some streamlet processes will be overloaded while other will be idle and in the end whole group will be waiting for the overloaded ones (and only one or few CPU cores will be utilized).
+ So this is not a good way to go.
+ </p>
+
+ <p>
+ The solution is using a queue. The tool will feed the tasks (file names in the <code>relpipe-in-filesystem</code> case) to the queue
+ and the streamlet processes will fetch them from the queue as soon as they are idle.
+ So we will utilize all the CPU cores all the time (obviously if we have more records than CPU cores, which is usually true).
+ Because our target platform are POSIX operating systems (and primary one is GNU/Linux), we choose POSIX MQ as the queue.
+ POSIX MQ is a nice and simple technology, it is standardized and really classic. It does not require any broker process or any third-party library so it does not bring additional dependencies – it is provided directly by the OS.
+ However, fallback is still possible:
+ a) if we set <code>--parallel 1</code> (which is default behavior), it will run directly in a single process without the queue;
+ b) the POSIX MQ have quite simple API so it is possible to write an adapter and port the tool to another system that does not have POSIX MQ and still enjoy the parallelism (or simply reimplement this API using shared memory and a semaphore).
+ </p>
+
+ <p>
+ We could add another queue to the output side and use it for serialization of the stream (which flows to the single STDOUT/FD).
+ But it is not necessary (thanks to the <m:name/> format design) and would add just some overhead.
+ So on the output side, we use just a POSIX semaphore (and a lock/guard based on it).
+ Thus the tool still has no other dependencies than the standard library and the operating system.
+ </p>
+
+ <p>
+ If we still have idle CPU cores or machines and need even more parallelism, streamlets can fork their own sub-processes, use threads or some technology like MPI or OpenMP.
+ However, simple parallel processing of records (<code>--parallel N</code>) is usually more than suitable and efficiently utilize our hardware.
+ </p>
+
+ <h2>XPath modes</h2>
+
+ <p>
+ Both <code>relpipe-in-xmltable</code> and the <code>xpath</code> streamlet uses XPath language to extract values from XML documents.
+ There are several modes of value extraction:
+ </p>
+ <ul>
+ <li>
+ <code>string</code>: this is default option, simply the text content
+ </li>
+ <li>
+ <code>boolean</code>: the value converted to a boolean in the XPath fashion;
+ <!--
+ can be used also to check whether given file is valid XML:
+ <code>- -streamlet xpath - -option attribute . - -option mode boolean - -as valid_xml</code>
+ -->
+ </li>
+ <li>
+ <code>raw-xml</code>: a portion of original XML document;
+ this is a way to put multiple values or any structured data in a single attribute;
+ if the XPath points to multiple nodes, it can still be returned as a valid XML document using a configurable wrapper node (so we can return e.g. all headlines from a document)
+ </li>
+ <li>
+ <code>line-number</code>: number of the line where given node was found;
+ this can be used for referencing particular place in the document</li>
+ <li>
+ <code>xpath</code>: XPath pointing to particular node;
+ it would be a different XPath expression than the original one (which might point to a set of nodes);
+ this can also be used for referencing particular place in the document</li>
+ </ul>
+
+ <p>
+ Both tools share the naming convention and are configured in a similar way – using e.g. <code>relpipe-in-xmltable --mode raw-xml</code> or <code>--streamlet xpath --option mode raw-xml</code>.
+ </p>
+
+ <h2>Feature overview</h2>
+
+ <h3>Data types</h3>
+ <ul>
+ <li m:since="v0.8">boolean</li>
+ <li m:since="v0.15">variable-length signed integer (SLEB128)</li>
+ <li m:since="v0.8">string in UTF-8</li>
+ </ul>
+ <h3>Inputs</h3>
+ <ul>
+ <li m:since="v0.11">Recfile</li>
+ <li m:since="v0.9">XML</li>
+ <li m:since="v0.13">XMLTable</li>
+ <li m:since="v0.9">CSV</li>
+ <li m:since="v0.9">file system</li>
+ <li m:since="v0.8">CLI</li>
+ <li m:since="v0.8">fstab</li>
+ <li m:since="v0.14">SQL script</li>
+ </ul>
+ <h3>Transformations</h3>
+ <ul>
+ <li m:since="v0.13">sql: filtering and transformations using the SQL language</li>
+ <li m:since="v0.12">awk: filtering and transformations using the classic AWK tool and language</li>
+ <li m:since="v0.10">guile: filtering and transformations defined in the Scheme language using GNU Guile</li>
+ <li m:since="v0.8">grep: regular expression filter, removes unwanted records from the relation</li>
+ <li m:since="v0.8">cut: regular expression attribute cutter (removes or duplicates attributes and can also DROP whole relation)</li>
+ <li m:since="v0.8">sed: regular expression replacer</li>
+ <li m:since="v0.8">validator: just a pass-through filter that crashes on invalid data</li>
+ <li m:since="v0.8">python: highly experimental</li>
+ </ul>
+ <h3>Streamlets</h3>
+ <ul>
+ <li m:since="v0.15">xpath (example, unstable)</li>
+ <li m:since="v0.15">hash (example, unstable)</li>
+ <li m:since="v0.15">jar_info (example, unstable)</li>
+ <li m:since="v0.15">mime_type (example, unstable)</li>
+ <li m:since="v0.15">exiftool (example, unstable)</li>
+ <li m:since="v0.15">pid (example, unstable)</li>
+ <li m:since="v0.15">cloc (example, unstable)</li>
+ <li m:since="v0.15">exiv2 (example, unstable)</li>
+ <li m:since="v0.15">inode (example, unstable)</li>
+ <li m:since="v0.15">lines_count (example, unstable)</li>
+ <li m:since="v0.15">pdftotext (example, unstable)</li>
+ <li m:since="v0.15">pdfinfo (example, unstable)</li>
+ <li m:since="v0.15">tesseract (example, unstable)</li>
+ </ul>
+ <h3>Outputs</h3>
+ <ul>
+ <li m:since="v0.11">ASN.1 BER</li>
+ <li m:since="v0.11">Recfile</li>
+ <li m:since="v0.9">CSV</li>
+ <li m:since="v0.8">tabular</li>
+ <li m:since="v0.8">XML</li>
+ <li m:since="v0.8">nullbyte</li>
+ <li m:since="v0.8">GUI in Qt</li>
+ <li m:since="v0.8">ODS (LibreOffice)</li>
+ </ul>
+
+ <h2>New examples</h2>
+ <ul>
+ <li><m:a href="examples-parallel-hashes">Computing hashes in parallel</m:a></li>
+ <li><m:a href="examples-runnable-jars">Finding runnable JARs</m:a></li>
+ <li><m:a href="examples-xhtml-filesystem-xpath">Collecting statistics from XHTML pages</m:a></li>
+ </ul>
+
+ <h2>Backward incompatible changes</h2>
+
+ <p>
+ The data format has changed: SLEB128 is now used for encoding numbers.
+ If the data format was used only on-thy-fly, no additional steps are required during upgrade.
+ If the data format was used for persistence (streams redirected to files), recommended upgrade procedure is:
+ convert files to XML using old version of <code>relpipe-out-xml</code> and then convert it from XML back using new version of <code>relpipe-in-xml</code>.
+ </p>
+
+ <h2>Installation</h2>
+
+ <p>
+ Instalation was tested on Debian GNU/Linux 10.2.
+ The process should be similar on other distributions.
+ </p>
+
+ <m:pre src="examples/release-v0.15.sh" jazyk="bash" odkaz="ano"/>
+
+ <p>
+ <m:name/> are modular thus you can download and install only parts you need (the libraries are needed always).
+ Tools <code>out-gui.qt</code> and <code>tr-python</code> require additional libraries and are not built by default.
+ </p>
+
+ </text>
+
+</stránka>
\ No newline at end of file
--- a/relpipe-data/roadmap.xml Thu Jan 30 14:57:09 2020 +0100
+++ b/relpipe-data/roadmap.xml Mon Feb 03 22:10:07 2020 +0100
@@ -62,6 +62,7 @@
<li>verify the format from the performance point of view</li>
<li>improve parsing (corrupted input may currently lead to huge memory allocations), more fuzzing</li>
<li>code clean-up and refactoring, move some reusable parts to common libraries</li>
+ <li>test the build with another compiler and tune the code</li>
<li>pkg-config: version numbers, debug vs. release</li>
<li>packaging for Guix SD and .deb and .rpm distributions, Snapcraft, Flatpak etc.</li>
</ul>