292
|
1 |
<stránka
|
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
|
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
|
|
4 |
|
|
5 |
<nadpis>Streamlets preview</nadpis>
|
|
6 |
<perex>an early example of streamlets in relpipe-in-filesystem</perex>
|
|
7 |
|
|
8 |
<text xmlns="http://www.w3.org/1999/xhtml">
|
|
9 |
|
|
10 |
<p>
|
|
11 |
<em>This is an early preview published at 2020-01-17 before the v0.15 release.</em>
|
|
12 |
</p>
|
|
13 |
|
|
14 |
<p>
|
|
15 |
First prepare some files:
|
|
16 |
</p>
|
|
17 |
|
|
18 |
<m:pre jazyk="shell"><![CDATA[$ wget --xattr https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png
|
|
19 |
$ wget --xattr https://sane-software.globalcode.info/v_0/ssm.en.pdf
|
|
20 |
$ wget --xattr https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf
|
|
21 |
|
|
22 |
$ ls -1
|
|
23 |
HURD_Live_CD.png
|
|
24 |
search.sh
|
|
25 |
sql-api_alt2xml_talk_2014.pdf
|
|
26 |
ssm.en.pdf]]></m:pre>
|
|
27 |
|
|
28 |
<p>
|
|
29 |
Collect metadata (file path, extended attributes, image size, number of PDF pages, number of text lines, OCR recognized text extracted from images and plain-text extracted from PDF files),
|
|
30 |
filter the results (do restriction), select only certain attributes (do projection)
|
|
31 |
and format result as a table:
|
|
32 |
</p>
|
|
33 |
|
|
34 |
<m:pre jazyk="shell"><![CDATA[find -print0 \
|
|
35 |
| relpipe-in-filesystem \
|
|
36 |
--file path \
|
|
37 |
--xattr xdg.origin.url --as 'url' \
|
|
38 |
--streamlet exiftool \
|
|
39 |
--option 'attribute' 'PNG:ImageWidth' --as 'width' \
|
|
40 |
--option 'attribute' 'PNG:ImageHeight' --as 'height' \
|
|
41 |
--option 'attribute' 'PDF:PageCount' --as 'page_count' \
|
|
42 |
--streamlet lines_count \
|
|
43 |
--streamlet tesseract \
|
|
44 |
--option 'language' 'eng' \
|
|
45 |
--as 'ocr_text' \
|
|
46 |
--streamlet pdftotext --as 'pdf_text' \
|
|
47 |
| relpipe-tr-awk \
|
|
48 |
--relation filesystem \
|
|
49 |
--where 'path ~ /\.sh$/ || url ~ /alt2xml\.globalcode\.info/ || ocr_text ~ /GNU/ || pdf_text ~ /Sane/' \
|
|
50 |
| relpipe-tr-cut filesystem 'path|url|width|height|page_count|lines_count' \
|
|
51 |
| relpipe-out-tabular
|
|
52 |
|
|
53 |
# if too wide, add: | less -RSi]]></m:pre>
|
|
54 |
|
|
55 |
<p>
|
|
56 |
Which will print:
|
|
57 |
</p>
|
|
58 |
|
|
59 |
<m:pre jazyk="text"><![CDATA[filesystem:
|
|
60 |
╭─────────────────────────────────┬──────────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬─────────────────────┬───────────────────────╮
|
|
61 |
│ path (string) │ url (string) │ width (string) │ height (string) │ page_count (string) │ lines_count (integer) │
|
|
62 |
├─────────────────────────────────┼──────────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼─────────────────────┼───────────────────────┤
|
|
63 |
│ ./HURD_Live_CD.png │ https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png │ 720 │ 400 │ │ 8 │
|
|
64 |
│ ./ssm.en.pdf │ https://sane-software.globalcode.info/v_0/ssm.en.pdf │ │ │ 6 │ 568 │
|
|
65 |
│ ./sql-api_alt2xml_talk_2014.pdf │ https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf │ │ │ 21 │ 696 │
|
|
66 |
│ ./search.sh │ │ │ │ │ 21 │
|
|
67 |
╰─────────────────────────────────┴──────────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴─────────────────────┴───────────────────────╯
|
|
68 |
Record count: 4]]></m:pre>
|
|
69 |
|
|
70 |
<p>
|
|
71 |
How it looks in the terminal:
|
|
72 |
</p>
|
|
73 |
|
|
74 |
<m:img src="img/streamlets-preview.png"/>
|
|
75 |
|
|
76 |
<p>
|
|
77 |
OCR and PDF text extractions (and also other metadata extractions) are done on-the-fly in the pipeline.
|
|
78 |
Especially the OCR may take some time, so it is usually better in such case to break the pipe in the middle,
|
|
79 |
redirect intermediate result to a file (serves like an index or cache) and then use it multiple times
|
|
80 |
(just <code>cat</code> the file and continue the original pipeline; BTW: multiple files can be simply concatenated, the format is designed for such use).
|
|
81 |
But in most cases, it is not necessary and we work with live data.
|
|
82 |
</p>
|
|
83 |
|
|
84 |
<p>
|
|
85 |
Please note that this is really fresh, it has not been released and can be seen only in the Mercurial repository.
|
|
86 |
The streamlets used can be seen here: <a href="https://hg.globalcode.info/relpipe/relpipe-in-filesystem.cpp/file/tip/streamlet-examples">streamlet-examples</a>.
|
|
87 |
And even the upcoming release v0.15 is still a development version (it will work, but the API might change in future – until we release v1.0 which will be stable and production ready).
|
|
88 |
</p>
|
|
89 |
|
|
90 |
<p>
|
|
91 |
Regarding performance:
|
|
92 |
currently it is parallelized only over attributes (each streamlet instance runs in a separate process).
|
|
93 |
In v0.15 it will be parallelized also over records (files in this case).
|
|
94 |
</p>
|
|
95 |
|
|
96 |
</text>
|
|
97 |
|
|
98 |
</stránka> |