relpipe-data/examples-in-filesystem-tr-sql-indexing.xml
author František Kučera <franta-hg@frantovo.cz>
Wed, 30 Oct 2019 00:34:58 +0100
branchv_0
changeset 281 0b6b1781a0a5
child 297 192b0059a6c4
permissions -rw-r--r--
examples: Indexing and searching the filesystem

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Indexing and searching the filesystem</nadpis>
	<perex>build an index of the filesystem and search it faster or offline using SQL</perex>
	<m:pořadí-příkladu>03500</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			Thanks to the <code>relpipe-in-filesystem</code> we can collect metadata (or even the file contents)
			and store them for later use in an index file.
			Such index is useful for faster access and for offline work (we can index e.g. an optical disc or external or network HDD).
		</p>
		
		<p>
			We can simply pipe the relational data into a file and use this file as the index.
			Or we can use some other format. In this example, we will use an SQLite file as the index.
		</p>
		
		<p>
			First step is to collect the file metadata. We will index just a subset of our filesystem,
			the <code>/bin/</code> and <code>/usr/bin/</code> directories:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[find /bin/ /usr/bin/ -print0 \
	| relpipe-in-filesystem --relation "program" \
	| relpipe-tr-sql --file bin.sqlite --file-keep true]]></m:pre>
	
		<p>
			This index allows us to do fast searches and various analysis.
			We can e.g. find 20 largest binaries:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
	--file bin.sqlite \
	--relation "largest" \
		"SELECT path, size FROM program WHERE type = 'f' ORDER BY size DESC LIMIT 20" \
	| relpipe-out-tabular]]></m:pre>
	
		<p>How very:</p>
		
		<m:pre jazyk="text"><![CDATA[largest:
 ╭──────────────────────────────┬───────────────╮
 │ path                (string) │ size (string) │
 ├──────────────────────────────┼───────────────┤
 │ /usr/bin/blender             │ 76975440      │
 │ /usr/bin/blenderplayer       │ 32199344      │
 │ /usr/bin/mscore              │ 24252992      │
 │ /usr/bin/mysql_embedded      │ 23004600      │
 │ /usr/bin/node                │ 18369616      │
 │ /usr/bin/galax-parse         │ 18365264      │
 │ /usr/bin/galax-run           │ 18360496      │
 │ /usr/bin/clementine          │ 16818328      │
 │ /usr/bin/emacs25-nox         │ 15055112      │
 │ /usr/bin/doxygen             │ 14924104      │
 │ /usr/bin/rosegarden          │ 14416952      │
 │ /usr/bin/snap                │ 13472520      │
 │ /usr/bin/audacity            │ 13257064      │
 │ /usr/bin/pgadmin3            │ 13098800      │
 │ /usr/bin/qemu-system-aarch64 │ 12564688      │
 │ /usr/bin/qemu-system-arm     │ 12370192      │
 │ /usr/bin/qemu-system-ppc64   │ 12280864      │
 │ /usr/bin/qemu-system-ppc     │ 11738208      │
 │ /usr/bin/qemu-system-x86_64  │ 11658464      │
 │ /usr/bin/qemu-system-i386    │ 11623776      │
 ╰──────────────────────────────┴───────────────╯
Record count: 20]]></m:pre>

		<p>
			And we can collect additional metadata and append them to our index file.
			In this example, we get lists of dynamically linked libraries using the <code>ldd</code> tool
			for each binary and store the lists in our index:
		</p>

		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
		--file bin.sqlite \
		--relation bin "SELECT path FROM program WHERE type = 'f'" \
	| relpipe-out-nullbyte \
	| while read_nullbyte f; do 
		ldd "$f" | perl -ne 'if (/ => (.*) \(/) { print "$ENV{f},$1\n"; }';
	  done \
	| relpipe-in-csv \
		"dependency" \
			"program" string \
			"library" string \
	| relpipe-tr-sql --file bin.sqlite]]></m:pre>
	
		<p>And then we can make a „popularity contest“ and find 20 most often used libraries:</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
	--file bin.sqlite \
	--relation "popular_libraries" "
		SELECT 
			d.library, 
			count(*) AS count 
		FROM dependency AS d 
		JOIN program AS p ON (d.program = p.path) 
		GROUP BY library
		ORDER BY count DESC
		LIMIT 20" \
	| relpipe-out-tabular]]></m:pre>
		
		<p>Well, well… here we are:</p>
		
		
		<m:pre jazyk="bash"><![CDATA[popular_libraries:
 ╭────────────────────────────────────────────┬────────────────╮
 │ library                           (string) │ count (string) │
 ├────────────────────────────────────────────┼────────────────┤
 │ /lib/x86_64-linux-gnu/libc.so.6            │ 2508           │
 │ /lib/x86_64-linux-gnu/libpthread.so.0      │ 1487           │
 │ /lib/x86_64-linux-gnu/libdl.so.2           │ 1364           │
 │ /lib/x86_64-linux-gnu/libm.so.6            │ 1271           │
 │ /lib/x86_64-linux-gnu/librt.so.1           │ 1057           │
 │ /lib/x86_64-linux-gnu/libz.so.1            │ 1019           │
 │ /lib/x86_64-linux-gnu/libgcc_s.so.1        │ 811            │
 │ /lib/x86_64-linux-gnu/libpcre.so.3         │ 788            │
 │ /lib/x86_64-linux-gnu/liblzma.so.5         │ 749            │
 │ /usr/lib/x86_64-linux-gnu/libstdc++.so.6   │ 742            │
 │ /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 │ 681            │
 │ /lib/x86_64-linux-gnu/libbsd.so.0          │ 658            │
 │ /usr/lib/x86_64-linux-gnu/libXau.so.6      │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libXdmcp.so.6    │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libxcb.so.1      │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libX11.so.6      │ 638            │
 │ /usr/lib/x86_64-linux-gnu/libpng16.so.16   │ 622            │
 │ /lib/x86_64-linux-gnu/libgpg-error.so.0    │ 616            │
 │ /lib/x86_64-linux-gnu/libgcrypt.so.20      │ 613            │
 │ /usr/lib/x86_64-linux-gnu/liblz4.so.1      │ 575            │
 ╰────────────────────────────────────────────┴────────────────╯
Record count: 20]]></m:pre>

		<p>
			In future versions there might be an option to gather more file metadata like hashes, Exif etc.
			But even in the current version, it is possible to gather any literally metadata using a custom script (as we have shown with <code>ldd</code> above).
			Extended attributes are already supported (the <code>--xattr</code> option).
		</p>
		
	</text>

</stránka>