relpipe-data/release-v0.15.xml
branchv_0
changeset 294 abbc9bcfbcc4
parent 282 ec02133045a3
child 299 dd7aeff5ef0c
equal deleted inserted replaced
293:b862d16a2e9f 294:abbc9bcfbcc4
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>Release v0.15</nadpis>
       
     6 	<perex>new public release of Relational pipes</perex>
       
     7 	<m:release>v0.15</m:release>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 		<p>
       
    11 			We are pleased to introduce you the new development version of <m:name/>.
       
    12 			This release brings two big new features: streamlets and parallel processing + several smaller improvements.
       
    13 		</p>
       
    14 		
       
    15 		<ul>
       
    16 			<li>
       
    17 				<strong>SLEB128</strong>: variable-length integers are now signed (i.e. can be even negative!) and encoded as SLEB128</li>
       
    18 			<li>
       
    19 				<strong>streamlets in relpipe-in-filesystem</strong>: see details below</li>
       
    20 			<li>
       
    21 				<strong>parallel processing in relpipe-in-filesystem</strong>: see details below</li>
       
    22 			<li>
       
    23 				<strong>multiple modes in relpipe-in-xmltable</strong>: see details below</li>
       
    24 			<li>
       
    25 				<strong>XInclude in relpipe-in-xmltable</strong>: use <code>--xinclude true</code> to process XIncludes before converting XML to relations</li>
       
    26 			<li>
       
    27 				<strong>relpipe-lib-protocol → relpipe-lib-common</strong>: this module was renamed and converted to a shared library, it will contain some common functions instead of just the header files</li>
       
    28 		</ul>
       
    29 		
       
    30 		<p>
       
    31 			See the <m:a href="examples">examples</m:a> and <m:a href="screenshots">screenshots</m:a> pages for details.
       
    32 		</p>
       
    33 		
       
    34 		<p>
       
    35 			Please note that this is still a development relasease and thus the API (libraries, CLI arguments, formats) might and will change.
       
    36 			Any suggestions, ideas and bug reports are welcome in our <m:a href="contact">mailing list</m:a>.
       
    37 		</p>
       
    38 		
       
    39 		<h2>Streamlets</h2>
       
    40 		
       
    41 		<p>
       
    42 			<em>Streamlet</em> is a small stream that inflows into the main stream, fuse with it and (typically) brings new attributes.
       
    43 		</p>
       
    44 		<p>
       
    45 			From the technical point of view, streamlets are something between classic <m:a href="classic-example">filters</m:a> and functions.
       
    46 			Unlike a function, the streamlet can be written in any programming language and runs as a separate process.
       
    47 			Unlike a filter, the streamlet does not relplace whole stream with a new one, but reads certain attributes from the original stream and adds some new ones back.
       
    48 			Common feature of filters and streamlets is that both continually read the input and continually deliver outputs, so the memory requirements are usually constant and „infinite“ streams might be processed.
       
    49 			And unlike ordinary commands (executed e.g. using <code>xargs</code> or a shell loop over a set of files), the streamlet does not <code>fork()</code> and <code>exec()</code> on each input file – the single streamlet process is reused for all records in the stream which is much more efficient (especially if there is some expensive initialization phase).
       
    50 		</p>
       
    51 		
       
    52 		<p>
       
    53 			Because streamlets are small scripts or compiled programs, they can be used for extending <m:name/> with minimal effort.
       
    54 			A streamlet can be e.g. few-lines Bash script – or on the other hand: a more powerful C++ or Java program.
       
    55 			Currently we have templates/examples written in Bash, C++ and Java. But it is possible to use any scripting or programming language.
       
    56 			The streamlet communicates with its parent (who manages the whole stream) through a simple <a href="img/streamlet-release-v0.15.png">message-based protocol</a>.
       
    57 			Full documentation will be published when stable (before v1.0.0) as a part of the public API.
       
    58 		</p>
       
    59 		
       
    60 		<p>
       
    61 			The first module where streamlets have been implemented is <code>relpipe-in-filesystem</code>.
       
    62 			Streamlets in this module get a single input attribute (the file path) and add various file metadata to the stream.
       
    63 			We have e.g. streamlets that compute hashes (SHA-256 etc.), extract metadata from image files (PNG, JPEG etc.) or PDF documents (title, author… or even the full content in plain-text), <m:a href="streamlets-preview">OCR</m:a>-recognized text from images,
       
    64 			count lines of code, extract portions of XML files using XPath or some metadata from JAR/ZIP files.
       
    65 			The streamlets are a way how to keep <code>relpipe-in-filesystem</code> simple with small code footpring while making it extensible and thus powerful.
       
    66 			We are not going to face the question: „Should we add this nice feature (+) and thus also this library dependency (-)? Would it be bloatware or not?“.
       
    67 			We (or the users) can add any feature through streamlets while the core <code>relpipe-in-filesystem</code> will stay simple and nobody (who does not need that feature) will not suffer from the growing complexity.
       
    68 		</p>
       
    69 		<p>
       
    70 			But streamlets are not limited to <code>relpipe-in-filesystem</code> – they are a general concept and there will be <code>relpipe-tr-streamlet</code> module.
       
    71 			Such streamlets will get any set of input attributes (not only file names) defined by the user and compute values based on them.
       
    72 			Such streamlet can e.g. modify a text attribute, compute a sum of numeric attributes, encrypt or decrypt values or interact with some external systems.
       
    73 			Writing a streamlet is easier than writing a transformation (like <code>relpipe-tr-*</code>) and it is more than OK to write simple single-purpose ad-hoc streamlets.
       
    74 			It is like writing simple shell scripts or functions.
       
    75 			Examples of really simple streamlets are: <code>inode</code> (Bash) and <code>pid</code> (C++).
       
    76 			It requires implementing only two functions: first one returns names and types of the output attributes and the second one returns that attributes for a record.
       
    77 			However, the streamlets might be parametrized through options, might return dynamic number of output attributes and might provide complex logic.
       
    78 			Some streamlets will become a stable part of the <m:name/> specification and API (<code>xpath</code> and <code>hash</code> seems to be such ones).
       
    79 		</p>
       
    80 		<p>
       
    81 			One of open questions is whether to have streamlets in <code>relpipe-in-filesystem</code> when we have <code>relpipe-tr-streamlet</code>.
       
    82 			<em>One tool should do one thing</em> and <em>we should not duplicate the effort</em>…
       
    83 			But it still makes some sense because the file streamlets are specific kind of streamlets and e.g. Bash completion should suggest them if we work with files but not with other data.
       
    84 			And it is also nice to have all metadata collecting on the same level in a single command (i.e. <code>--streamlet</code> beside <code>--file</code> and <code>--xattr</code>)
       
    85 			than having to collect basic and extended file attributes using single command and collect other file metadata using different command.
       
    86 		</p>
       
    87 		
       
    88 		<h2>Parallel processing</h2>
       
    89 		
       
    90 		<p>
       
    91 			There are two kinds of parallelism: over attributes and over records.
       
    92 		</p>
       
    93 		
       
    94 		<p>
       
    95 			Because streamlets are forked processes, they are quite naturally parallelized over attributes.
       
    96 			We can e.g. compute SHA-1 hash in one streamlet and SHA-256 hash in another streamlet and we will utilize two CPU cores (or we can ask one streamlet to compute both SHA-1 and SHA-256 hashes and then we will utilize only one CPU core).
       
    97 			The <code>relpipe-in-filesystem</code> tool simply 1) feeds all streamlet instances with the current file name, 2) streamlets work in parallel and then 3) the tool collects results from all streamlets.
       
    98 		</p>
       
    99 		
       
   100 		<p>
       
   101 			But it would not be enough. Today, we usually have more CPU cores than heavy attributes (like hashes).
       
   102 			So we need to process multiple records in parallel.
       
   103 			The first design proposal (not implemented) was that the tool will simply distribute the file names to STDINs of particular streamlet processes in the round-robin fashion and processes will write to the common STDOUT (just with a lock for synchronization to keep the records atomic – the <m:name/> data format is specifically designed for such use).
       
   104 			This will be really simple and somehow helpful (better than nothing).
       
   105 			But this design has a significant flaw: the tool is not aware of how busy particular streamlet processes are and will feed them with tasks (file names) equally.
       
   106 			So it will work satisfactorily only in case that all tasks have similar difficultness.
       
   107 			This is unfortunately not the usual case because e.g. computing a hash of a big file takes much more time than computing a hash of a small file.
       
   108 			Thus some streamlet processes will be overloaded while other will be idle and in the end whole group will be waiting for the overloaded ones (and only one or few CPU cores will be utilized).
       
   109 			So this is not a good way to go.
       
   110 		</p>
       
   111 		
       
   112 		<p>
       
   113 			The solution is using a queue. The tool will feed the tasks (file names in the <code>relpipe-in-filesystem</code> case) to the queue
       
   114 			and the streamlet processes will fetch them from the queue as soon as they are idle.
       
   115 			So we will utilize all the CPU cores all the time (obviously if we have more records than CPU cores, which is usually true).
       
   116 			Because our target platform are POSIX operating systems (and primary one is GNU/Linux), we choose POSIX MQ as the queue.
       
   117 			POSIX MQ is a nice and simple technology, it is standardized and really classic. It does not require any broker process or any third-party library so it does not bring additional dependencies – it is provided directly by the OS.
       
   118 			However, fallback is still possible:
       
   119 			a) if we set <code>--parallel 1</code> (which is default behavior), it will run directly in a single process without the queue; 
       
   120 			b) the POSIX MQ have quite simple API so it is possible to write an adapter and port the tool to another system that does not have POSIX MQ and still enjoy the parallelism (or simply reimplement this API using shared memory and a semaphore).
       
   121 		</p>
       
   122 		
       
   123 		<p>
       
   124 			We could add another queue to the output side and use it for serialization of the stream (which flows to the single STDOUT/FD).
       
   125 			But it is not necessary (thanks to the <m:name/> format design) and would add just some overhead.
       
   126 			So on the output side, we use just a POSIX semaphore (and a lock/guard based on it).
       
   127 			Thus the tool still has no other dependencies than the standard library and the operating system.
       
   128 		</p>
       
   129 		
       
   130 		<p>
       
   131 			If we still have idle CPU cores or machines and need even more parallelism, streamlets can fork their own sub-processes, use threads or some technology like MPI or OpenMP.
       
   132 			However, simple parallel processing of records (<code>--parallel N</code>) is usually more than suitable and efficiently utilize our hardware.
       
   133 		</p>
       
   134 		
       
   135 		<h2>XPath modes</h2>
       
   136 		
       
   137 		<p>
       
   138 			Both <code>relpipe-in-xmltable</code> and the <code>xpath</code> streamlet uses XPath language to extract values from XML documents.
       
   139 			There are several modes of value extraction:
       
   140 		</p>
       
   141 		<ul>
       
   142 			<li>
       
   143 				<code>string</code>: this is default option, simply the text content
       
   144 			</li>
       
   145 			<li>
       
   146 				<code>boolean</code>: the value converted to a boolean in the XPath fashion;
       
   147 				<!--
       
   148 				can be used also to check whether given file is valid XML:
       
   149 				<code>- -streamlet xpath - -option attribute . - -option mode boolean - -as valid_xml</code>
       
   150 				-->
       
   151 			</li>
       
   152 			<li>
       
   153 				<code>raw-xml</code>: a portion of original XML document; 
       
   154 				this is a way to put multiple values or any structured data in a single attribute;
       
   155 				if the XPath points to multiple nodes, it can still be returned as a valid XML document using a configurable wrapper node (so we can return e.g. all headlines from a document)
       
   156 			</li>
       
   157 			<li>
       
   158 				<code>line-number</code>: number of the line where given node was found;
       
   159 				this can be used for referencing particular place in the document</li>
       
   160 			<li>
       
   161 				<code>xpath</code>: XPath pointing to particular node; 
       
   162 				it would be a different XPath expression than the original one (which might point to a set of nodes);
       
   163 				this can also be used for referencing particular place in the document</li>
       
   164 		</ul>
       
   165 		
       
   166 		<p>
       
   167 			Both tools share the naming convention and are configured in a similar way – using e.g. <code>relpipe-in-xmltable --mode raw-xml</code> or <code>--streamlet xpath --option mode raw-xml</code>.
       
   168 		</p>
       
   169 		
       
   170 		<h2>Feature overview</h2>
       
   171 		
       
   172 		<h3>Data types</h3>
       
   173 		<ul>
       
   174 			<li m:since="v0.8">boolean</li>
       
   175 			<li m:since="v0.15">variable-length signed integer (SLEB128)</li>
       
   176 			<li m:since="v0.8">string in UTF-8</li>
       
   177 		</ul>
       
   178 		<h3>Inputs</h3>
       
   179 		<ul>
       
   180 			<li m:since="v0.11">Recfile</li>
       
   181 			<li m:since="v0.9">XML</li>
       
   182 			<li m:since="v0.13">XMLTable</li>
       
   183 			<li m:since="v0.9">CSV</li>
       
   184 			<li m:since="v0.9">file system</li>
       
   185 			<li m:since="v0.8">CLI</li>
       
   186 			<li m:since="v0.8">fstab</li>
       
   187 			<li m:since="v0.14">SQL script</li>
       
   188 		</ul>
       
   189 		<h3>Transformations</h3>
       
   190 		<ul>
       
   191 			<li m:since="v0.13">sql: filtering and transformations using the SQL language</li>
       
   192 			<li m:since="v0.12">awk: filtering and transformations using the classic AWK tool and language</li>
       
   193 			<li m:since="v0.10">guile: filtering and transformations defined in the Scheme language using GNU Guile</li>
       
   194 			<li m:since="v0.8">grep: regular expression filter, removes unwanted records from the relation</li>
       
   195 			<li m:since="v0.8">cut: regular expression attribute cutter (removes or duplicates attributes and can also DROP whole relation)</li>
       
   196 			<li m:since="v0.8">sed: regular expression replacer</li>
       
   197 			<li m:since="v0.8">validator: just a pass-through filter that crashes on invalid data</li>
       
   198 			<li m:since="v0.8">python: highly experimental</li>
       
   199 		</ul>
       
   200 		<h3>Streamlets</h3>
       
   201 		<ul>
       
   202 			<li m:since="v0.15">xpath (example, unstable)</li>
       
   203 			<li m:since="v0.15">hash (example, unstable)</li>
       
   204 			<li m:since="v0.15">jar_info (example, unstable)</li>
       
   205 			<li m:since="v0.15">mime_type (example, unstable)</li>
       
   206 			<li m:since="v0.15">exiftool (example, unstable)</li>
       
   207 			<li m:since="v0.15">pid (example, unstable)</li>
       
   208 			<li m:since="v0.15">cloc (example, unstable)</li>
       
   209 			<li m:since="v0.15">exiv2 (example, unstable)</li>
       
   210 			<li m:since="v0.15">inode (example, unstable)</li>
       
   211 			<li m:since="v0.15">lines_count (example, unstable)</li>
       
   212 			<li m:since="v0.15">pdftotext (example, unstable)</li>
       
   213 			<li m:since="v0.15">pdfinfo (example, unstable)</li>
       
   214 			<li m:since="v0.15">tesseract (example, unstable)</li>
       
   215 		</ul>
       
   216 		<h3>Outputs</h3>
       
   217 		<ul>
       
   218 			<li m:since="v0.11">ASN.1 BER</li>
       
   219 			<li m:since="v0.11">Recfile</li>
       
   220 			<li m:since="v0.9">CSV</li>
       
   221 			<li m:since="v0.8">tabular</li>
       
   222 			<li m:since="v0.8">XML</li>
       
   223 			<li m:since="v0.8">nullbyte</li>
       
   224 			<li m:since="v0.8">GUI in Qt</li>
       
   225 			<li m:since="v0.8">ODS (LibreOffice)</li>
       
   226 		</ul>
       
   227 		
       
   228 		<h2>New examples</h2>
       
   229 		<ul>
       
   230 			<li><m:a href="examples-parallel-hashes">Computing hashes in parallel</m:a></li>
       
   231 			<li><m:a href="examples-runnable-jars">Finding runnable JARs</m:a></li>
       
   232 			<li><m:a href="examples-xhtml-filesystem-xpath">Collecting statistics from XHTML pages</m:a></li>
       
   233 		</ul>
       
   234 		
       
   235 		<h2>Backward incompatible changes</h2>
       
   236 		
       
   237 		<p>
       
   238 			The data format has changed: SLEB128 is now used for encoding numbers.
       
   239 			If the data format was used only on-thy-fly, no additional steps are required during upgrade.
       
   240 			If the data format was used for persistence (streams redirected to files), recommended upgrade procedure is:
       
   241 			convert files to XML using old version of <code>relpipe-out-xml</code> and then convert it from XML back using new version of <code>relpipe-in-xml</code>.
       
   242 		</p>
       
   243 		
       
   244 		<h2>Installation</h2>
       
   245 		
       
   246 		<p>
       
   247 			Instalation was tested on Debian GNU/Linux 10.2.
       
   248 			The process should be similar on other distributions.
       
   249 		</p>
       
   250 		
       
   251 		<m:pre src="examples/release-v0.15.sh" jazyk="bash" odkaz="ano"/>
       
   252 		
       
   253 		<p>
       
   254 			<m:name/> are modular thus you can download and install only parts you need (the libraries are needed always).
       
   255 			Tools <code>out-gui.qt</code> and <code>tr-python</code> require additional libraries and are not built by default.
       
   256 		</p>
       
   257 		
       
   258 	</text>
       
   259 
       
   260 </stránka>