--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-in-filesystem-tr-sql-indexing.xml Wed Oct 30 00:34:58 2019 +0100
@@ -0,0 +1,143 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Indexing and searching the filesystem</nadpis>
+ <perex>build an index of the filesystem and search it faster or offline using SQL</perex>
+ <m:pořadí-příkladu>03500</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ Thanks to the <code>relpipe-in-filesystem</code> we can collect metadata (or even the file contents)
+ and store them for later use in an index file.
+ Such index is useful for faster access and for offline work (we can index e.g. an optical disc or external or network HDD).
+ </p>
+
+ <p>
+ We can simply pipe the relational data into a file and use this file as the index.
+ Or we can use some other format. In this example, we will use an SQLite file as the index.
+ </p>
+
+ <p>
+ First step is to collect the file metadata. We will index just a subset of our filesystem,
+ the <code>/bin/</code> and <code>/usr/bin/</code> directories:
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[find /bin/ /usr/bin/ -print0 \
+ | relpipe-in-filesystem --relation "program" \
+ | relpipe-tr-sql --file bin.sqlite --file-keep true]]></m:pre>
+
+ <p>
+ This index allows us to do fast searches and various analysis.
+ We can e.g. find 20 largest binaries:
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+ --file bin.sqlite \
+ --relation "largest" \
+ "SELECT path, size FROM program WHERE type = 'f' ORDER BY size DESC LIMIT 20" \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>How very:</p>
+
+ <m:pre jazyk="text"><![CDATA[largest:
+ ╭──────────────────────────────┬───────────────╮
+ │ path (string) │ size (string) │
+ ├──────────────────────────────┼───────────────┤
+ │ /usr/bin/blender │ 76975440 │
+ │ /usr/bin/blenderplayer │ 32199344 │
+ │ /usr/bin/mscore │ 24252992 │
+ │ /usr/bin/mysql_embedded │ 23004600 │
+ │ /usr/bin/node │ 18369616 │
+ │ /usr/bin/galax-parse │ 18365264 │
+ │ /usr/bin/galax-run │ 18360496 │
+ │ /usr/bin/clementine │ 16818328 │
+ │ /usr/bin/emacs25-nox │ 15055112 │
+ │ /usr/bin/doxygen │ 14924104 │
+ │ /usr/bin/rosegarden │ 14416952 │
+ │ /usr/bin/snap │ 13472520 │
+ │ /usr/bin/audacity │ 13257064 │
+ │ /usr/bin/pgadmin3 │ 13098800 │
+ │ /usr/bin/qemu-system-aarch64 │ 12564688 │
+ │ /usr/bin/qemu-system-arm │ 12370192 │
+ │ /usr/bin/qemu-system-ppc64 │ 12280864 │
+ │ /usr/bin/qemu-system-ppc │ 11738208 │
+ │ /usr/bin/qemu-system-x86_64 │ 11658464 │
+ │ /usr/bin/qemu-system-i386 │ 11623776 │
+ ╰──────────────────────────────┴───────────────╯
+Record count: 20]]></m:pre>
+
+ <p>
+ And we can collect additional metadata and append them to our index file.
+ In this example, we get lists of dynamically linked libraries using the <code>ldd</code> tool
+ for each binary and store the lists in our index:
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+ --file bin.sqlite \
+ --relation bin "SELECT path FROM program WHERE type = 'f'" \
+ | relpipe-out-nullbyte \
+ | while read_nullbyte f; do
+ ldd "$f" | perl -ne 'if (/ => (.*) \(/) { print "$ENV{f},$1\n"; }';
+ done \
+ | relpipe-in-csv \
+ "dependency" \
+ "program" string \
+ "library" string \
+ | relpipe-tr-sql --file bin.sqlite]]></m:pre>
+
+ <p>And then we can make a „popularity contest“ and find 20 most often used libraries:</p>
+
+ <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+ --file bin.sqlite \
+ --relation "popular_libraries" "
+ SELECT
+ d.library,
+ count(*) AS count
+ FROM dependency AS d
+ JOIN program AS p ON (d.program = p.path)
+ GROUP BY library
+ ORDER BY count DESC
+ LIMIT 20" \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>Well, well… here we are:</p>
+
+
+ <m:pre jazyk="bash"><![CDATA[popular_libraries:
+ ╭────────────────────────────────────────────┬────────────────╮
+ │ library (string) │ count (string) │
+ ├────────────────────────────────────────────┼────────────────┤
+ │ /lib/x86_64-linux-gnu/libc.so.6 │ 2508 │
+ │ /lib/x86_64-linux-gnu/libpthread.so.0 │ 1487 │
+ │ /lib/x86_64-linux-gnu/libdl.so.2 │ 1364 │
+ │ /lib/x86_64-linux-gnu/libm.so.6 │ 1271 │
+ │ /lib/x86_64-linux-gnu/librt.so.1 │ 1057 │
+ │ /lib/x86_64-linux-gnu/libz.so.1 │ 1019 │
+ │ /lib/x86_64-linux-gnu/libgcc_s.so.1 │ 811 │
+ │ /lib/x86_64-linux-gnu/libpcre.so.3 │ 788 │
+ │ /lib/x86_64-linux-gnu/liblzma.so.5 │ 749 │
+ │ /usr/lib/x86_64-linux-gnu/libstdc++.so.6 │ 742 │
+ │ /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 │ 681 │
+ │ /lib/x86_64-linux-gnu/libbsd.so.0 │ 658 │
+ │ /usr/lib/x86_64-linux-gnu/libXau.so.6 │ 648 │
+ │ /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 │ 648 │
+ │ /usr/lib/x86_64-linux-gnu/libxcb.so.1 │ 648 │
+ │ /usr/lib/x86_64-linux-gnu/libX11.so.6 │ 638 │
+ │ /usr/lib/x86_64-linux-gnu/libpng16.so.16 │ 622 │
+ │ /lib/x86_64-linux-gnu/libgpg-error.so.0 │ 616 │
+ │ /lib/x86_64-linux-gnu/libgcrypt.so.20 │ 613 │
+ │ /usr/lib/x86_64-linux-gnu/liblz4.so.1 │ 575 │
+ ╰────────────────────────────────────────────┴────────────────╯
+Record count: 20]]></m:pre>
+
+ <p>
+ In future versions there might be an option to gather more file metadata like hashes, Exif etc.
+ But even in the current version, it is possible to gather any literally metadata using a custom script (as we have shown with <code>ldd</code> above).
+ Extended attributes are already supported (the <code>--xattr</code> option).
+ </p>
+
+ </text>
+
+</stránka>