relpipe/relpipe-web: relpipe-data/examples-in-filesystem-tr-sql-indexing.xml@192b0059a6c4


<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Indexing and searching the filesystem</nadpis>
	<perex>build an index of the filesystem and search it faster or offline using SQL</perex>
	<m:pořadí-příkladu>03500</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			Thanks to the <code>relpipe-in-filesystem</code> we can collect metadata (or even the file contents)
			and store them for later use in an index file.
			Such index is useful for faster access and for offline work (we can index e.g. an optical disc or external or network HDD).
		</p>
		
		<p>
			We can simply pipe the relational data into a file and use this file as the index.
			Or we can use some other format. In this example, we will use an SQLite file as the index.
		</p>
		
		<p>
			First step is to collect the file metadata. We will index just a subset of our filesystem,
			the <code>/bin/</code> and <code>/usr/bin/</code> directories:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[find /bin/ /usr/bin/ -print0 \
	| relpipe-in-filesystem --relation "program" \
	| relpipe-tr-sql --data-source-string 'Driver=SQLite3;Database=file:bin.sqlite']]></m:pre>
	
		<p>
			This index allows us to do fast searches and various analysis.
			We can e.g. find 20 largest binaries:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
	--data-source-string 'Driver=SQLite3;Database=file:bin.sqlite' \
	--relation "largest" \
		"SELECT path, size FROM program WHERE type = 'f' ORDER BY size DESC LIMIT 20" \
	| relpipe-out-tabular]]></m:pre>
	
		<p>How very:</p>
		
		<m:pre jazyk="text"><![CDATA[largest:
 ╭──────────────────────────────┬───────────────╮
 │ path                (string) │ size (string) │
 ├──────────────────────────────┼───────────────┤
 │ /usr/bin/blender             │ 76975440      │
 │ /usr/bin/blenderplayer       │ 32199344      │
 │ /usr/bin/mscore              │ 24252992      │
 │ /usr/bin/mysql_embedded      │ 23004600      │
 │ /usr/bin/node                │ 18369616      │
 │ /usr/bin/galax-parse         │ 18365264      │
 │ /usr/bin/galax-run           │ 18360496      │
 │ /usr/bin/clementine          │ 16818328      │
 │ /usr/bin/emacs25-nox         │ 15055112      │
 │ /usr/bin/doxygen             │ 14924104      │
 │ /usr/bin/rosegarden          │ 14416952      │
 │ /usr/bin/snap                │ 13472520      │
 │ /usr/bin/audacity            │ 13257064      │
 │ /usr/bin/pgadmin3            │ 13098800      │
 │ /usr/bin/qemu-system-aarch64 │ 12564688      │
 │ /usr/bin/qemu-system-arm     │ 12370192      │
 │ /usr/bin/qemu-system-ppc64   │ 12280864      │
 │ /usr/bin/qemu-system-ppc     │ 11738208      │
 │ /usr/bin/qemu-system-x86_64  │ 11658464      │
 │ /usr/bin/qemu-system-i386    │ 11623776      │
 ╰──────────────────────────────┴───────────────╯
Record count: 20]]></m:pre>

		<p>
			And we can collect additional metadata and append them to our index file.
			In this example, we get lists of dynamically linked libraries using the <code>ldd</code> tool
			for each binary and store the lists in our index:
		</p>

		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
		--data-source-string 'Driver=SQLite3;Database=file:bin.sqlite' \
		--relation bin "SELECT path FROM program WHERE type = 'f'" \
	| relpipe-out-nullbyte \
	| while read_nullbyte f; do 
		ldd "$f" | perl -ne 'if (/ => (.*) \(/) { print "$ENV{f},$1\n"; }';
	  done \
	| relpipe-in-csv \
		"dependency" \
			"program" string \
			"library" string \
	| relpipe-tr-sql --data-source-string 'Driver=SQLite3;Database=file:bin.sqlite']]></m:pre>
	
		<p>And then we can make a „popularity contest“ and find 20 most often used libraries:</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-sql \
	--data-source-string 'Driver=SQLite3;Database=file:bin.sqlite' \
	--relation "popular_libraries" "
		SELECT 
			d.library, 
			count(*) AS count 
		FROM dependency AS d 
		JOIN program AS p ON (d.program = p.path) 
		GROUP BY library
		ORDER BY count DESC
		LIMIT 20" \
	| relpipe-out-tabular]]></m:pre>
		
		<p>Well, well… here we are:</p>
		
		
		<m:pre jazyk="bash"><![CDATA[popular_libraries:
 ╭────────────────────────────────────────────┬────────────────╮
 │ library                           (string) │ count (string) │
 ├────────────────────────────────────────────┼────────────────┤
 │ /lib/x86_64-linux-gnu/libc.so.6            │ 2508           │
 │ /lib/x86_64-linux-gnu/libpthread.so.0      │ 1487           │
 │ /lib/x86_64-linux-gnu/libdl.so.2           │ 1364           │
 │ /lib/x86_64-linux-gnu/libm.so.6            │ 1271           │
 │ /lib/x86_64-linux-gnu/librt.so.1           │ 1057           │
 │ /lib/x86_64-linux-gnu/libz.so.1            │ 1019           │
 │ /lib/x86_64-linux-gnu/libgcc_s.so.1        │ 811            │
 │ /lib/x86_64-linux-gnu/libpcre.so.3         │ 788            │
 │ /lib/x86_64-linux-gnu/liblzma.so.5         │ 749            │
 │ /usr/lib/x86_64-linux-gnu/libstdc++.so.6   │ 742            │
 │ /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 │ 681            │
 │ /lib/x86_64-linux-gnu/libbsd.so.0          │ 658            │
 │ /usr/lib/x86_64-linux-gnu/libXau.so.6      │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libXdmcp.so.6    │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libxcb.so.1      │ 648            │
 │ /usr/lib/x86_64-linux-gnu/libX11.so.6      │ 638            │
 │ /usr/lib/x86_64-linux-gnu/libpng16.so.16   │ 622            │
 │ /lib/x86_64-linux-gnu/libgpg-error.so.0    │ 616            │
 │ /lib/x86_64-linux-gnu/libgcrypt.so.20      │ 613            │
 │ /usr/lib/x86_64-linux-gnu/liblz4.so.1      │ 575            │
 ╰────────────────────────────────────────────┴────────────────╯
Record count: 20]]></m:pre>

		<p>
			In future versions there might be an option to gather more file metadata like hashes, Exif etc.
			But even in the current version, it is possible to gather any literally metadata using a custom script (as we have shown with <code>ldd</code> above).
			Extended attributes are already supported (the <code>--xattr</code> option).
		</p>
		
		<p>
			n.b. if we use a database frequently it is convenient to configure it as a data source in the <code>~/.odbc.ini</code> file
			– and then connect to it using the <code>--data-source-name</code> option and its name.
		</p>
		
	</text>

</stránka>
author	František Kučera <franta-hg@frantovo.cz>
	Sat, 06 Jun 2020 01:57:24 +0200
branch	v_0
changeset 297	192b0059a6c4
parent 281	0b6b1781a0a5
permissions	-rw-r--r--