relpipe-data/streamlets-preview.xml
branchv_0
changeset 292 c4b4864225de
child 326 ab7f333f1225
equal deleted inserted replaced
291:2fab532bda09 292:c4b4864225de
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>Streamlets preview</nadpis>
       
     6 	<perex>an early example of streamlets in relpipe-in-filesystem</perex>
       
     7 
       
     8 	<text xmlns="http://www.w3.org/1999/xhtml">
       
     9 		
       
    10 		<p>
       
    11 			<em>This is an early preview published at 2020-01-17 before the v0.15 release.</em>
       
    12 		</p>
       
    13 		
       
    14 		<p>
       
    15 			First prepare some files:
       
    16 		</p>
       
    17 		
       
    18 		<m:pre jazyk="shell"><![CDATA[$ wget --xattr https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png
       
    19 $ wget --xattr https://sane-software.globalcode.info/v_0/ssm.en.pdf
       
    20 $ wget --xattr https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf
       
    21 
       
    22 $ ls -1
       
    23 HURD_Live_CD.png
       
    24 search.sh
       
    25 sql-api_alt2xml_talk_2014.pdf
       
    26 ssm.en.pdf]]></m:pre>
       
    27 
       
    28 		<p>
       
    29 			Collect metadata (file path, extended attributes, image size, number of PDF pages, number of text lines, OCR recognized text extracted from images and plain-text extracted from PDF files),
       
    30 			filter the results (do restriction), select only certain attributes (do projection)
       
    31 			and format result as a table:
       
    32 		</p>
       
    33 		
       
    34 		<m:pre jazyk="shell"><![CDATA[find -print0 \
       
    35 	| relpipe-in-filesystem \
       
    36 		--file path \
       
    37 		--xattr xdg.origin.url --as 'url' \
       
    38 		--streamlet exiftool \
       
    39 			--option 'attribute' 'PNG:ImageWidth'  --as 'width' \
       
    40 			--option 'attribute' 'PNG:ImageHeight' --as 'height' \
       
    41 			--option 'attribute' 'PDF:PageCount'   --as 'page_count' \
       
    42 		--streamlet lines_count \
       
    43 		--streamlet tesseract \
       
    44 			--option 'language' 'eng' \
       
    45 			--as 'ocr_text' \
       
    46 		--streamlet pdftotext --as 'pdf_text' \
       
    47 	| relpipe-tr-awk \
       
    48 		--relation filesystem \
       
    49 		--where 'path ~ /\.sh$/ || url ~ /alt2xml\.globalcode\.info/ || ocr_text ~ /GNU/ || pdf_text ~ /Sane/' \
       
    50 	| relpipe-tr-cut filesystem 'path|url|width|height|page_count|lines_count' \
       
    51 	| relpipe-out-tabular
       
    52 
       
    53 # if too wide, add: | less -RSi]]></m:pre>
       
    54 
       
    55 		<p>
       
    56 			Which will print:
       
    57 		</p>
       
    58 		
       
    59 		<m:pre jazyk="text"><![CDATA[filesystem:
       
    60  ╭─────────────────────────────────┬──────────────────────────────────────────────────────────────────────┬────────────────┬─────────────────┬─────────────────────┬───────────────────────╮
       
    61  │ path                   (string) │ url                                                         (string) │ width (string) │ height (string) │ page_count (string) │ lines_count (integer) │
       
    62  ├─────────────────────────────────┼──────────────────────────────────────────────────────────────────────┼────────────────┼─────────────────┼─────────────────────┼───────────────────────┤
       
    63  │ ./HURD_Live_CD.png              │ https://upload.wikimedia.org/wikipedia/commons/d/d4/HURD_Live_CD.png │ 720            │ 400             │                     │                     8 │
       
    64  │ ./ssm.en.pdf                    │ https://sane-software.globalcode.info/v_0/ssm.en.pdf                 │                │                 │ 6                   │                   568 │
       
    65  │ ./sql-api_alt2xml_talk_2014.pdf │ https://alt2xml.globalcode.info/sql-api_alt2xml_talk_2014.pdf        │                │                 │ 21                  │                   696 │
       
    66  │ ./search.sh                     │                                                                      │                │                 │                     │                    21 │
       
    67  ╰─────────────────────────────────┴──────────────────────────────────────────────────────────────────────┴────────────────┴─────────────────┴─────────────────────┴───────────────────────╯
       
    68 Record count: 4]]></m:pre>
       
    69 		
       
    70 		<p>
       
    71 			How it looks in the terminal:
       
    72 		</p>
       
    73 		
       
    74 		<m:img src="img/streamlets-preview.png"/>
       
    75 		
       
    76 		<p>
       
    77 			OCR and PDF text extractions (and also other metadata extractions) are done on-the-fly in the pipeline.
       
    78 			Especially the OCR may take some time, so it is usually better in such case to break the pipe in the middle, 
       
    79 			redirect intermediate result to a file (serves like an index or cache) and then use it multiple times 
       
    80 			(just <code>cat</code> the file and continue the original pipeline; BTW: multiple files can be simply concatenated, the format is designed for such use). 
       
    81 			But in most cases, it is not necessary and we work with live data.
       
    82 		</p>
       
    83 		
       
    84 		<p>
       
    85 			Please note that this is really fresh, it has not been released and can be seen only in the Mercurial repository.
       
    86 			The streamlets used can be seen here: <a href="https://hg.globalcode.info/relpipe/relpipe-in-filesystem.cpp/file/tip/streamlet-examples">streamlet-examples</a>. 
       
    87 			And even the upcoming release v0.15 is still a development version (it will work, but the API might change in future – until we release v1.0 which will be stable and production ready).
       
    88 		</p>
       
    89 		
       
    90 		<p>
       
    91 			Regarding performance:
       
    92 			currently it is parallelized only over attributes (each streamlet instance runs in a separate process). 
       
    93 			In v0.15 it will be parallelized also over records (files in this case).
       
    94 		</p>
       
    95 		
       
    96 	</text>
       
    97 
       
    98 </stránka>