|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Streamlets preview</nadpis> |
|
6 <perex>an early example of streamlets in relpipe-in-filesystem</perex> |
|
7 |
|
8 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
9 |
|
10 <p> |
|
11 <em>This is an early preview published at 2020-01-17 before the v0.15 release.</em> |
|
12 </p> |
|
13 |
|
14 <p> |
|
15 First prepare some files: |
|
16 </p> |
|
17 |
|
18 <m:pre jazyk="shell"><![CDATA[$ wget --xattr https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png |
|
19 $ wget --xattr https://sane-software.globalcode.info/v_0/ssm.en.pdf |
|
20 $ wget --xattr https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf |
|
21 |
|
22 $ ls -1 |
|
23 HURD_Live_CD.png |
|
24 search.sh |
|
25 sql-api_alt2xml_talk_2014.pdf |
|
26 ssm.en.pdf]]></m:pre> |
|
27 |
|
28 <p> |
|
29 Collect metadata (file path, extended attributes, image size, number of PDF pages, number of text lines, OCR recognized text extracted from images and plain-text extracted from PDF files), |
|
30 filter the results (do restriction), select only certain attributes (do projection) |
|
31 and format result as a table: |
|
32 </p> |
|
33 |
|
34 <m:pre jazyk="shell"><![CDATA[find -print0 \ |
|
35 | relpipe-in-filesystem \ |
|
36 --file path \ |
|
37 --xattr xdg.origin.url --as 'url' \ |
|
38 --streamlet exiftool \ |
|
39 --option 'attribute' 'PNG:ImageWidth' --as 'width' \ |
|
40 --option 'attribute' 'PNG:ImageHeight' --as 'height' \ |
|
41 --option 'attribute' 'PDF:PageCount' --as 'page_count' \ |
|
42 --streamlet lines_count \ |
|
43 --streamlet tesseract \ |
|
44 --option 'language' 'eng' \ |
|
45 --as 'ocr_text' \ |
|
46 --streamlet pdftotext --as 'pdf_text' \ |
|
47 | relpipe-tr-awk \ |
|
48 --relation filesystem \ |
|
49 --where 'path ~ /\.sh$/ || url ~ /alt2xml\.globalcode\.info/ || ocr_text ~ /GNU/ || pdf_text ~ /Sane/' \ |
|
50 | relpipe-tr-cut filesystem 'path|url|width|height|page_count|lines_count' \ |
|
51 | relpipe-out-tabular |
|
52 |
|
53 # if too wide, add: | less -RSi]]></m:pre> |
|
54 |
|
55 <p> |
|
56 Which will print: |
|
57 </p> |
|
58 |
|
59 <m:pre jazyk="text"><![CDATA[filesystem: |
|
60 ╭─────────────────────────────────┬──────────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬─────────────────────┬───────────────────────╮ |
|
61 │ path (string) │ url (string) │ width (string) │ height (string) │ page_count (string) │ lines_count (integer) │ |
|
62 ├─────────────────────────────────┼──────────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼─────────────────────┼───────────────────────┤ |
|
63 │ ./HURD_Live_CD.png │ https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png │ 720 │ 400 │ │ 8 │ |
|
64 │ ./ssm.en.pdf │ https://sane-software.globalcode.info/v_0/ssm.en.pdf │ │ │ 6 │ 568 │ |
|
65 │ ./sql-api_alt2xml_talk_2014.pdf │ https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf │ │ │ 21 │ 696 │ |
|
66 │ ./search.sh │ │ │ │ │ 21 │ |
|
67 ╰─────────────────────────────────┴──────────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴─────────────────────┴───────────────────────╯ |
|
68 Record count: 4]]></m:pre> |
|
69 |
|
70 <p> |
|
71 How it looks in the terminal: |
|
72 </p> |
|
73 |
|
74 <m:img src="img/streamlets-preview.png"/> |
|
75 |
|
76 <p> |
|
77 OCR and PDF text extractions (and also other metadata extractions) are done on-the-fly in the pipeline. |
|
78 Especially the OCR may take some time, so it is usually better in such case to break the pipe in the middle, |
|
79 redirect intermediate result to a file (serves like an index or cache) and then use it multiple times |
|
80 (just <code>cat</code> the file and continue the original pipeline; BTW: multiple files can be simply concatenated, the format is designed for such use). |
|
81 But in most cases, it is not necessary and we work with live data. |
|
82 </p> |
|
83 |
|
84 <p> |
|
85 Please note that this is really fresh, it has not been released and can be seen only in the Mercurial repository. |
|
86 The streamlets used can be seen here: <a href="https://hg.globalcode.info/relpipe/relpipe-in-filesystem.cpp/file/tip/streamlet-examples">streamlet-examples</a>. |
|
87 And even the upcoming release v0.15 is still a development version (it will work, but the API might change in future – until we release v1.0 which will be stable and production ready). |
|
88 </p> |
|
89 |
|
90 <p> |
|
91 Regarding performance: |
|
92 currently it is parallelized only over attributes (each streamlet instance runs in a separate process). |
|
93 In v0.15 it will be parallelized also over records (files in this case). |
|
94 </p> |
|
95 |
|
96 </text> |
|
97 |
|
98 </stránka> |