Binary file relpipe-data/img/streamlets-preview.png has changed
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/streamlets-preview.xml Fri Jan 17 19:56:22 2020 +0100
@@ -0,0 +1,98 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Streamlets preview</nadpis>
+ <perex>an early example of streamlets in relpipe-in-filesystem</perex>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ <em>This is an early preview published at 2020-01-17 before the v0.15 release.</em>
+ </p>
+
+ <p>
+ First prepare some files:
+ </p>
+
+ <m:pre jazyk="shell"><![CDATA[$ wget --xattr https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png
+$ wget --xattr https://sane-software.globalcode.info/v_0/ssm.en.pdf
+$ wget --xattr https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf
+
+$ ls -1
+HURD_Live_CD.png
+search.sh
+sql-api_alt2xml_talk_2014.pdf
+ssm.en.pdf]]></m:pre>
+
+ <p>
+ Collect metadata (file path, extended attributes, image size, number of PDF pages, number of text lines, OCR recognized text extracted from images and plain-text extracted from PDF files),
+ filter the results (do restriction), select only certain attributes (do projection)
+ and format result as a table:
+ </p>
+
+ <m:pre jazyk="shell"><![CDATA[find -print0 \
+ | relpipe-in-filesystem \
+ --file path \
+ --xattr xdg.origin.url --as 'url' \
+ --streamlet exiftool \
+ --option 'attribute' 'PNG:ImageWidth' --as 'width' \
+ --option 'attribute' 'PNG:ImageHeight' --as 'height' \
+ --option 'attribute' 'PDF:PageCount' --as 'page_count' \
+ --streamlet lines_count \
+ --streamlet tesseract \
+ --option 'language' 'eng' \
+ --as 'ocr_text' \
+ --streamlet pdftotext --as 'pdf_text' \
+ | relpipe-tr-awk \
+ --relation filesystem \
+ --where 'path ~ /\.sh$/ || url ~ /alt2xml\.globalcode\.info/ || ocr_text ~ /GNU/ || pdf_text ~ /Sane/' \
+ | relpipe-tr-cut filesystem 'path|url|width|height|page_count|lines_count' \
+ | relpipe-out-tabular
+
+# if too wide, add: | less -RSi]]></m:pre>
+
+ <p>
+ Which will print:
+ </p>
+
+ <m:pre jazyk="text"><![CDATA[filesystem:
+ ╭─────────────────────────────────┬──────────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬─────────────────────┬───────────────────────╮
+ │ path (string) │ url (string) │ width (string) │ height (string) │ page_count (string) │ lines_count (integer) │
+ ├─────────────────────────────────┼──────────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼─────────────────────┼───────────────────────┤
+ │ ./HURD_Live_CD.png │ https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png │ 720 │ 400 │ │ 8 │
+ │ ./ssm.en.pdf │ https://sane-software.globalcode.info/v_0/ssm.en.pdf │ │ │ 6 │ 568 │
+ │ ./sql-api_alt2xml_talk_2014.pdf │ https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf │ │ │ 21 │ 696 │
+ │ ./search.sh │ │ │ │ │ 21 │
+ ╰─────────────────────────────────┴──────────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴─────────────────────┴───────────────────────╯
+Record count: 4]]></m:pre>
+
+ <p>
+ How it looks in the terminal:
+ </p>
+
+ <m:img src="img/streamlets-preview.png"/>
+
+ <p>
+ OCR and PDF text extractions (and also other metadata extractions) are done on-the-fly in the pipeline.
+ Especially the OCR may take some time, so it is usually better in such case to break the pipe in the middle,
+ redirect intermediate result to a file (serves like an index or cache) and then use it multiple times
+ (just <code>cat</code> the file and continue the original pipeline; BTW: multiple files can be simply concatenated, the format is designed for such use).
+ But in most cases, it is not necessary and we work with live data.
+ </p>
+
+ <p>
+ Please note that this is really fresh, it has not been released and can be seen only in the Mercurial repository.
+ The streamlets used can be seen here: <a href="https://hg.globalcode.info/relpipe/relpipe-in-filesystem.cpp/file/tip/streamlet-examples">streamlet-examples</a>.
+ And even the upcoming release v0.15 is still a development version (it will work, but the API might change in future – until we release v1.0 which will be stable and production ready).
+ </p>
+
+ <p>
+ Regarding performance:
+ currently it is parallelized only over attributes (each streamlet instance runs in a separate process).
+ In v0.15 it will be parallelized also over records (files in this case).
+ </p>
+
+ </text>
+
+</stránka>
\ No newline at end of file