|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Indexing and searching the filesystem</nadpis> |
|
6 <perex>build an index of the filesystem and search it faster or offline using SQL</perex> |
|
7 <m:pořadí-příkladu>03500</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 Thanks to the <code>relpipe-in-filesystem</code> we can collect metadata (or even the file contents) |
|
13 and store them for later use in an index file. |
|
14 Such index is useful for faster access and for offline work (we can index e.g. an optical disc or external or network HDD). |
|
15 </p> |
|
16 |
|
17 <p> |
|
18 We can simply pipe the relational data into a file and use this file as the index. |
|
19 Or we can use some other format. In this example, we will use an SQLite file as the index. |
|
20 </p> |
|
21 |
|
22 <p> |
|
23 First step is to collect the file metadata. We will index just a subset of our filesystem, |
|
24 the <code>/bin/</code> and <code>/usr/bin/</code> directories: |
|
25 </p> |
|
26 |
|
27 <m:pre jazyk="bash"><![CDATA[find /bin/ /usr/bin/ -print0 \ |
|
28 | relpipe-in-filesystem --relation "program" \ |
|
29 | relpipe-tr-sql --file bin.sqlite --file-keep true]]></m:pre> |
|
30 |
|
31 <p> |
|
32 This index allows us to do fast searches and various analysis. |
|
33 We can e.g. find 20 largest binaries: |
|
34 </p> |
|
35 |
|
36 <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \ |
|
37 --file bin.sqlite \ |
|
38 --relation "largest" \ |
|
39 "SELECT path, size FROM program WHERE type = 'f' ORDER BY size DESC LIMIT 20" \ |
|
40 | relpipe-out-tabular]]></m:pre> |
|
41 |
|
42 <p>How very:</p> |
|
43 |
|
44 <m:pre jazyk="text"><![CDATA[largest: |
|
45 ╭──────────────────────────────┬───────────────╮ |
|
46 │ path (string) │ size (string) │ |
|
47 ├──────────────────────────────┼───────────────┤ |
|
48 │ /usr/bin/blender │ 76975440 │ |
|
49 │ /usr/bin/blenderplayer │ 32199344 │ |
|
50 │ /usr/bin/mscore │ 24252992 │ |
|
51 │ /usr/bin/mysql_embedded │ 23004600 │ |
|
52 │ /usr/bin/node │ 18369616 │ |
|
53 │ /usr/bin/galax-parse │ 18365264 │ |
|
54 │ /usr/bin/galax-run │ 18360496 │ |
|
55 │ /usr/bin/clementine │ 16818328 │ |
|
56 │ /usr/bin/emacs25-nox │ 15055112 │ |
|
57 │ /usr/bin/doxygen │ 14924104 │ |
|
58 │ /usr/bin/rosegarden │ 14416952 │ |
|
59 │ /usr/bin/snap │ 13472520 │ |
|
60 │ /usr/bin/audacity │ 13257064 │ |
|
61 │ /usr/bin/pgadmin3 │ 13098800 │ |
|
62 │ /usr/bin/qemu-system-aarch64 │ 12564688 │ |
|
63 │ /usr/bin/qemu-system-arm │ 12370192 │ |
|
64 │ /usr/bin/qemu-system-ppc64 │ 12280864 │ |
|
65 │ /usr/bin/qemu-system-ppc │ 11738208 │ |
|
66 │ /usr/bin/qemu-system-x86_64 │ 11658464 │ |
|
67 │ /usr/bin/qemu-system-i386 │ 11623776 │ |
|
68 ╰──────────────────────────────┴───────────────╯ |
|
69 Record count: 20]]></m:pre> |
|
70 |
|
71 <p> |
|
72 And we can collect additional metadata and append them to our index file. |
|
73 In this example, we get lists of dynamically linked libraries using the <code>ldd</code> tool |
|
74 for each binary and store the lists in our index: |
|
75 </p> |
|
76 |
|
77 <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \ |
|
78 --file bin.sqlite \ |
|
79 --relation bin "SELECT path FROM program WHERE type = 'f'" \ |
|
80 | relpipe-out-nullbyte \ |
|
81 | while read_nullbyte f; do |
|
82 ldd "$f" | perl -ne 'if (/ => (.*) \(/) { print "$ENV{f},$1\n"; }'; |
|
83 done \ |
|
84 | relpipe-in-csv \ |
|
85 "dependency" \ |
|
86 "program" string \ |
|
87 "library" string \ |
|
88 | relpipe-tr-sql --file bin.sqlite]]></m:pre> |
|
89 |
|
90 <p>And then we can make a „popularity contest“ and find 20 most often used libraries:</p> |
|
91 |
|
92 <m:pre jazyk="bash"><![CDATA[relpipe-in-sql \ |
|
93 --file bin.sqlite \ |
|
94 --relation "popular_libraries" " |
|
95 SELECT |
|
96 d.library, |
|
97 count(*) AS count |
|
98 FROM dependency AS d |
|
99 JOIN program AS p ON (d.program = p.path) |
|
100 GROUP BY library |
|
101 ORDER BY count DESC |
|
102 LIMIT 20" \ |
|
103 | relpipe-out-tabular]]></m:pre> |
|
104 |
|
105 <p>Well, well… here we are:</p> |
|
106 |
|
107 |
|
108 <m:pre jazyk="bash"><![CDATA[popular_libraries: |
|
109 ╭────────────────────────────────────────────┬────────────────╮ |
|
110 │ library (string) │ count (string) │ |
|
111 ├────────────────────────────────────────────┼────────────────┤ |
|
112 │ /lib/x86_64-linux-gnu/libc.so.6 │ 2508 │ |
|
113 │ /lib/x86_64-linux-gnu/libpthread.so.0 │ 1487 │ |
|
114 │ /lib/x86_64-linux-gnu/libdl.so.2 │ 1364 │ |
|
115 │ /lib/x86_64-linux-gnu/libm.so.6 │ 1271 │ |
|
116 │ /lib/x86_64-linux-gnu/librt.so.1 │ 1057 │ |
|
117 │ /lib/x86_64-linux-gnu/libz.so.1 │ 1019 │ |
|
118 │ /lib/x86_64-linux-gnu/libgcc_s.so.1 │ 811 │ |
|
119 │ /lib/x86_64-linux-gnu/libpcre.so.3 │ 788 │ |
|
120 │ /lib/x86_64-linux-gnu/liblzma.so.5 │ 749 │ |
|
121 │ /usr/lib/x86_64-linux-gnu/libstdc++.so.6 │ 742 │ |
|
122 │ /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 │ 681 │ |
|
123 │ /lib/x86_64-linux-gnu/libbsd.so.0 │ 658 │ |
|
124 │ /usr/lib/x86_64-linux-gnu/libXau.so.6 │ 648 │ |
|
125 │ /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 │ 648 │ |
|
126 │ /usr/lib/x86_64-linux-gnu/libxcb.so.1 │ 648 │ |
|
127 │ /usr/lib/x86_64-linux-gnu/libX11.so.6 │ 638 │ |
|
128 │ /usr/lib/x86_64-linux-gnu/libpng16.so.16 │ 622 │ |
|
129 │ /lib/x86_64-linux-gnu/libgpg-error.so.0 │ 616 │ |
|
130 │ /lib/x86_64-linux-gnu/libgcrypt.so.20 │ 613 │ |
|
131 │ /usr/lib/x86_64-linux-gnu/liblz4.so.1 │ 575 │ |
|
132 ╰────────────────────────────────────────────┴────────────────╯ |
|
133 Record count: 20]]></m:pre> |
|
134 |
|
135 <p> |
|
136 In future versions there might be an option to gather more file metadata like hashes, Exif etc. |
|
137 But even in the current version, it is possible to gather any literally metadata using a custom script (as we have shown with <code>ldd</code> above). |
|
138 Extended attributes are already supported (the <code>--xattr</code> option). |
|
139 </p> |
|
140 |
|
141 </text> |
|
142 |
|
143 </stránka> |