# HG changeset patch # User František Kučera # Date 1579287382 -3600 # Node ID c4b4864225de421f203c65e194bb54220c1bc54f # Parent 2fab532bda090c58a820fa197d7f2c80160accc6 streamlets preview diff -r 2fab532bda09 -r c4b4864225de relpipe-data/img/streamlets-preview.png Binary file relpipe-data/img/streamlets-preview.png has changed diff -r 2fab532bda09 -r c4b4864225de relpipe-data/streamlets-preview.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/streamlets-preview.xml Fri Jan 17 19:56:22 2020 +0100 @@ -0,0 +1,98 @@ + + + Streamlets preview + an early example of streamlets in relpipe-in-filesystem + + + +

+ This is an early preview published at 2020-01-17 before the v0.15 release. +

+ +

+ First prepare some files: +

+ + + +

+ Collect metadata (file path, extended attributes, image size, number of PDF pages, number of text lines, OCR recognized text extracted from images and plain-text extracted from PDF files), + filter the results (do restriction), select only certain attributes (do projection) + and format result as a table: +

+ + + +

+ Which will print: +

+ + + +

+ How it looks in the terminal: +

+ + + +

+ OCR and PDF text extractions (and also other metadata extractions) are done on-the-fly in the pipeline. + Especially the OCR may take some time, so it is usually better in such case to break the pipe in the middle, + redirect intermediate result to a file (serves like an index or cache) and then use it multiple times + (just cat the file and continue the original pipeline; BTW: multiple files can be simply concatenated, the format is designed for such use). + But in most cases, it is not necessary and we work with live data. +

+ +

+ Please note that this is really fresh, it has not been released and can be seen only in the Mercurial repository. + The streamlets used can be seen here: streamlet-examples. + And even the upcoming release v0.15 is still a development version (it will work, but the API might change in future – until we release v1.0 which will be stable and production ready). +

+ +

+ Regarding performance: + currently it is parallelized only over attributes (each streamlet instance runs in a separate process). + In v0.15 it will be parallelized also over records (files in this case). +

+ +
+ +
\ No newline at end of file