examples: Indexing and searching the filesystem v_0
authorFrantišek Kučera <franta-hg@frantovo.cz>
Wed, 30 Oct 2019 00:34:58 +0100
branchv_0
changeset 281 0b6b1781a0a5
parent 280 eccf2de78284
child 282 ec02133045a3
examples: Indexing and searching the filesystem
relpipe-data/examples-in-filesystem-tr-sql-indexing.xml
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-in-filesystem-tr-sql-indexing.xml	Wed Oct 30 00:34:58 2019 +0100
@@ -0,0 +1,143 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>Indexing and searching the filesystem</nadpis>
+	<perex>build an index of the filesystem and search it faster or offline using SQL</perex>
+	<m:pořadí-příkladu>03500</m:pořadí-příkladu>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		
+		<p>
+			Thanks to the <code>relpipe-in-filesystem</code> we can collect metadata (or even the file contents)
+			and store them for later use in an index file.
+			Such index is useful for faster access and for offline work (we can index e.g. an optical disc or external or network HDD).
+		</p>
+		
+		<p>
+			We can simply pipe the relational data into a file and use this file as the index.
+			Or we can use some other format. In this example, we will use an SQLite file as the index.
+		</p>
+		
+		<p>
+			First step is to collect the file metadata. We will index just a subset of our filesystem,
+			the <code>/bin/</code> and <code>/usr/bin/</code> directories:
+		</p>
+		
+		<m:pre jazyk="bash"><![CDATA[find /bin/ /usr/bin/ -print0 \
+	| relpipe-in-filesystem --relation "program" \
+	| relpipe-tr-sql --file bin.sqlite --file-keep true]]></m:pre>
+	
+		<p>
+			This index allows us to do fast searches and various analysis.
+			We can e.g. find 20 largest binaries:
+		</p>
+		
+		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+	--file bin.sqlite \
+	--relation "largest" \
+		"SELECT path, size FROM program WHERE type = 'f' ORDER BY size DESC LIMIT 20" \
+	| relpipe-out-tabular]]></m:pre>
+	
+		<p>How very:</p>
+		
+		<m:pre jazyk="text"><![CDATA[largest:
+ ╭──────────────────────────────┬───────────────╮
+ │ path                (string) │ size (string) │
+ ├──────────────────────────────┼───────────────┤
+ │ /usr/bin/blender             │ 76975440      │
+ │ /usr/bin/blenderplayer       │ 32199344      │
+ │ /usr/bin/mscore              │ 24252992      │
+ │ /usr/bin/mysql_embedded      │ 23004600      │
+ │ /usr/bin/node                │ 18369616      │
+ │ /usr/bin/galax-parse         │ 18365264      │
+ │ /usr/bin/galax-run           │ 18360496      │
+ │ /usr/bin/clementine          │ 16818328      │
+ │ /usr/bin/emacs25-nox         │ 15055112      │
+ │ /usr/bin/doxygen             │ 14924104      │
+ │ /usr/bin/rosegarden          │ 14416952      │
+ │ /usr/bin/snap                │ 13472520      │
+ │ /usr/bin/audacity            │ 13257064      │
+ │ /usr/bin/pgadmin3            │ 13098800      │
+ │ /usr/bin/qemu-system-aarch64 │ 12564688      │
+ │ /usr/bin/qemu-system-arm     │ 12370192      │
+ │ /usr/bin/qemu-system-ppc64   │ 12280864      │
+ │ /usr/bin/qemu-system-ppc     │ 11738208      │
+ │ /usr/bin/qemu-system-x86_64  │ 11658464      │
+ │ /usr/bin/qemu-system-i386    │ 11623776      │
+ ╰──────────────────────────────┴───────────────╯
+Record count: 20]]></m:pre>
+
+		<p>
+			And we can collect additional metadata and append them to our index file.
+			In this example, we get lists of dynamically linked libraries using the <code>ldd</code> tool
+			for each binary and store the lists in our index:
+		</p>
+
+		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+		--file bin.sqlite \
+		--relation bin "SELECT path FROM program WHERE type = 'f'" \
+	| relpipe-out-nullbyte \
+	| while read_nullbyte f; do 
+		ldd "$f" | perl -ne 'if (/ => (.*) \(/) { print "$ENV{f},$1\n"; }';
+	  done \
+	| relpipe-in-csv \
+		"dependency" \
+			"program" string \
+			"library" string \
+	| relpipe-tr-sql --file bin.sqlite]]></m:pre>
+	
+		<p>And then we can make a „popularity contest“ and find 20 most often used libraries:</p>
+		
+		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
+	--file bin.sqlite \
+	--relation "popular_libraries" "
+		SELECT 
+			d.library, 
+			count(*) AS count 
+		FROM dependency AS d 
+		JOIN program AS p ON (d.program = p.path) 
+		GROUP BY library
+		ORDER BY count DESC
+		LIMIT 20" \
+	| relpipe-out-tabular]]></m:pre>
+		
+		<p>Well, well… here we are:</p>
+		
+		
+		<m:pre jazyk="bash"><![CDATA[popular_libraries:
+ ╭────────────────────────────────────────────┬────────────────╮
+ │ library                           (string) │ count (string) │
+ ├────────────────────────────────────────────┼────────────────┤
+ │ /lib/x86_64-linux-gnu/libc.so.6            │ 2508           │
+ │ /lib/x86_64-linux-gnu/libpthread.so.0      │ 1487           │
+ │ /lib/x86_64-linux-gnu/libdl.so.2           │ 1364           │
+ │ /lib/x86_64-linux-gnu/libm.so.6            │ 1271           │
+ │ /lib/x86_64-linux-gnu/librt.so.1           │ 1057           │
+ │ /lib/x86_64-linux-gnu/libz.so.1            │ 1019           │
+ │ /lib/x86_64-linux-gnu/libgcc_s.so.1        │ 811            │
+ │ /lib/x86_64-linux-gnu/libpcre.so.3         │ 788            │
+ │ /lib/x86_64-linux-gnu/liblzma.so.5         │ 749            │
+ │ /usr/lib/x86_64-linux-gnu/libstdc++.so.6   │ 742            │
+ │ /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 │ 681            │
+ │ /lib/x86_64-linux-gnu/libbsd.so.0          │ 658            │
+ │ /usr/lib/x86_64-linux-gnu/libXau.so.6      │ 648            │
+ │ /usr/lib/x86_64-linux-gnu/libXdmcp.so.6    │ 648            │
+ │ /usr/lib/x86_64-linux-gnu/libxcb.so.1      │ 648            │
+ │ /usr/lib/x86_64-linux-gnu/libX11.so.6      │ 638            │
+ │ /usr/lib/x86_64-linux-gnu/libpng16.so.16   │ 622            │
+ │ /lib/x86_64-linux-gnu/libgpg-error.so.0    │ 616            │
+ │ /lib/x86_64-linux-gnu/libgcrypt.so.20      │ 613            │
+ │ /usr/lib/x86_64-linux-gnu/liblz4.so.1      │ 575            │
+ ╰────────────────────────────────────────────┴────────────────╯
+Record count: 20]]></m:pre>
+
+		<p>
+			In future versions there might be an option to gather more file metadata like hashes, Exif etc.
+			But even in the current version, it is possible to gather any literally metadata using a custom script (as we have shown with <code>ldd</code> above).
+			Extended attributes are already supported (the <code>--xattr</code> option).
+		</p>
+		
+	</text>
+
+</stránka>