classic pipeline example v_0
authorFrantišek Kučera <franta-hg@frantovo.cz>
Sun, 25 Nov 2018 19:58:06 +0100
branchv_0
changeset 144 ee7e96151673
parent 143 297da74fcab2
child 145 42bbbccd87f3
classic pipeline example
relpipe-data/animals.txt
relpipe-data/classic-example.xml
relpipe-data/index.xml
relpipe-data/makra/classic-example.xsl
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/animals.txt	Sun Nov 25 19:58:06 2018 +0100
@@ -0,0 +1,6 @@
+large white cat
+medium black cat
+big yellow dog
+small yellow cat
+small white dog
+medium green turtle
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/classic-example.xml	Sun Nov 25 19:58:06 2018 +0100
@@ -0,0 +1,119 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>Classic pipeline example</nadpis>
+	<perex>Explained example of classic pipeline</perex>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		<p>
+			Assume that we have a text file containing a list of animals and their properties:
+		</p>
+		
+		<m:pre src="animals.txt"/>
+				
+		<p>
+			We can pass this file through a pipeline:
+		</p>
+		
+		<m:classic-example/>
+		
+		<p>
+			Particular steps of the pipeline are separated by the | pipe symbol.
+			In the first step, we just read the file and print it on STDOUT.<m:podČarou>Of course, this is an <a href="http://porkmail.org/era/unix/award.html" title="Useless Use of Cat">UUoC</a>, but in examples the right order makes it easier to read than usage of &lt; file redirections.</m:podČarou>
+			In the second step, we filter only dogs and get:
+		</p>
+		
+		<pre><![CDATA[big yellow dog
+small white dog]]></pre>
+
+		<p>
+			In the third step, we select second <em>field</em> (fields are separated by spaces) and get colours of our dogs:
+		</p>
+		
+		<pre><![CDATA[yellow
+white]]></pre>
+
+		<p>
+			In the fourth step, we translate the values to uppercase and get:
+		</p>
+		
+		<pre><![CDATA[YELLOW
+WHITE]]></pre>
+
+		<p>
+			So we have a list of colors of our dogs printed upper-case. 
+			In case we have several dogs of same colors, we could avoid duplicates simply by adding <code>| sort -u</code> in the pipeline (after the <code>cut</code> part).
+		</p>
+
+		<h2>The great parts</h2>
+		
+		<p>
+			The authors of <code>cat</code>, <code>grep</code>, <code>cut</code> or <code>tr</code> programs don't have to know anything about cats<m:podČarou>n.b. the cat in the command name is a different cat than in our text file</m:podČarou> and dogs and our business domain.
+			They can focus on their tasks which are reading files, filtering by regular expressions, doing some substrings and text conversions. And they do it well without being distracted by any animals.
+		</p>
+		
+		<p>
+			And we don't have to know anything about the low-level programming in the C language or compile anything.
+			We just simply build a pipeline in a shell (e.g. GNU Bash) from existing programs and focus on our business logic.
+			And we do it well without being distracted by any low-level issues.
+		</p>
+		
+		<h2>The pitfalls</h2>
+		
+		<p>
+			This simple example looks quite flawlessly.
+			But actually it is very brittle.
+		</p>
+		
+		<p>
+			What if we have a very big cat that can be described by this line in our file?
+		</p>
+		
+		<pre>dog-sized red cat</pre>
+		
+		<p>In the second step of the pipeline (<code>grep</code>) we will include this record and the final result will be:</p>
+		
+		<pre><![CDATA[RED
+YELLOW
+WHITE]]></pre>
+
+		<p>Which is really unexpected and unwanted result. We don't have a RED dog and this is just an accident. The same would happen if we have a monkey of a <em>doggish</em> color.</p>
+		
+		<p>
+			This problem is caused by the fact that the <code>grep dog</code> filters lines containing the word <em>dog</em> regardless its position (first, second or third field).
+			Sometimes we could avoid such problems by a bit more complicated regular expression and/or by using Perl, but our pipeline wouldn't be as simple and legible as before.
+		</p>
+		
+		<p>
+			What if we have a turtle that has lighter color than other turtles?
+		</p>
+		
+		<pre>small light green turtle</pre>
+		
+		<p>
+			If we do <code>grep turtle</code> it will work well in this case, but our pipeline will fail in the third step where the <code>cut</code> will select only <em>light</em> (instead of <em>light green</em>).
+			And the final result will be:
+		</p>
+		
+		<pre><![CDATA[GREEN
+LIGHT]]></pre>
+		
+		<p>
+			Which is definitively wrong because the second turtle is not LIGHT, it is LIGHT GREEN.
+			This problem is caused by the fact that we don't have a well-defined separators between fields.
+			Sometimes we could avoid such problems by restrictions/presumptions e.g. <em>the color must not contain a space character</em> (we could replace spaces by hyphens).
+			Or we could use some other field delimiter e.g. ; or | or ,. But still we would not be able to use such character in the field values.
+			So we must invent some kind of escaping (like <code>\;</code> is not a separator but a part of the field value)
+			or add some quotes/apostrophes (which still requires escaping, because what if we have e.g. name field containing an apostrophe?).
+			And parsing such inputs by classic tools and regular expressions is not easy and sometimes even not possible.
+		</p>
+		
+		<p>
+			There are also other problems like character encoding, missing meta-data (e.g. field names and types), joining multiple files (Is there always a new-line character at the end of the file? Or is there a BOM at the beginning of the file?)
+			or passing several types of data in a single stream (we have list of animals and we can have e.g. also a list of foods or list of our staff where each list has different fields).
+		</p>
+
+	</text>
+
+</stránka>
--- a/relpipe-data/index.xml	Sun Nov 25 01:03:26 2018 +0100
+++ b/relpipe-data/index.xml	Sun Nov 25 19:58:06 2018 +0100
@@ -20,10 +20,14 @@
 			Each running program (process) has one input stream (called standard input or STDIN) and one output stream (called standard output or STDOUT) and also one additional output stream for logging/errors/warnings (STDERR).
 			We can connect programs and pass the STDOUT of first one to the STDIN of the second one (etc.) using pipes.
 		</p>
+		
+		<p>
+			A classic pipeline example (<m:a href="classic-example">explained</m:a>):
+		</p>
+		
+		<m:classic-example/>
 
 		<!--		
-		<pre>cat /etc/fstab | dd 2>/tmp/dd.log | grep tmpfs</pre>
-		<p></p>
 		<m:diagram orientace="vodorovně">
 			node[shape=box];
 			
@@ -42,7 +46,7 @@
 			According to this principle we can build complex and powerful programs (pipelines) by composing several simple, single-purpose and reusable programs.
 			Such single-purpose programs (often called <em>filters</em>) are much easier to create, test and optimize and their authors don't have to bother about the complexity of the final pipeline.
 			They even don't have to know, how their programs will be used in the future by others.
-			This is a great design principle that brings us advanced flexibility, reusability, efficiency and reliability. Simply: awesome.
+			This is a great design principle that brings us advanced flexibility, reusability, efficiency and reliability.
 			Being in any role (author of a filter, builder of a pipeline etc.), we can always focus on our task only and do it well.
 			And we can collaborate with others even if we don't know about them and we don't know that we are collaborating.
 			Now think about putting this together with the free software ideas...  How very!
@@ -79,6 +83,19 @@
 		</m:diagram>
 		-->
 		
+		
+		<p>Bytes, text, structured data? XML, YAML, JSON, ASN.1</p>
+		
+		<p>Rules:</p>
+		
+		<ul>
+			<li>a stream contains zero or more relations</li>
+			<li>a relation has a name</li>
+			<li>a relation has one or more attributes</li>
+			<li>a relation contains zero or more records</li>
+		</ul>
+		
+		
 		<h2>What <m:name/> are?</h2>
 		
 		<p>
@@ -101,12 +118,12 @@
 		
 		<ul>
 			<li>Shell – we use existing shells (e.g. GNU Bash), work with any shell and even without a shell (e.g. as a stream format passed through a network or stored in a file).</li>
-			<li>Terminal emulator – same as with shells, we use existing terminals and we can use <m:name/> also outside any terminal; if we interact with any terminal, we use standard means as Unicode, ANSI escape sequences etc.</li>
-			<li>IDE – we use standard <m:unix/> tools as an IDE (GNU Screen, Make etc.) or any other IDE.</li>
+			<li>Terminal emulator – same as with shells, we use existing terminals and we can use <m:name/> also outside any terminal; if we interact with the terminal, we use standard means as Unicode, ANSI escape sequences etc.</li>
+			<li>IDE – we can use standard <m:unix/> tools as an IDE (GNU Screen, Make etc.) or any other IDE.</li>
 			<li>Programming language – <m:name/> are language-independent data format and can be produced or consumed in any programming language.</li>
 			<li>Query language – although some of our tools are doing queries, filtering or transformations, we are not inventing a new query language – instead, we use existing languages like SQL, XPath or regular expressions.</li>
 			<!--<li>Text editor – </li>-->
-			<li>Database system, DBMS – we focus on the stream processing rather than data storage. Although sometimes it makes sense to pipe data to a file and continue with the processing later.</li>
+			<li>Database system, DBMS – we focus on the stream processing rather than data storage. Although sometimes it makes sense to redirect data to a file and continue with the processing later.</li>
 		</ul>
 		
 		
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/makra/classic-example.xsl	Sun Nov 25 19:58:06 2018 +0100
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<xsl:stylesheet version="2.0"
+	xmlns="http://www.w3.org/1999/xhtml"
+	xmlns:h="http://www.w3.org/1999/xhtml"
+	xmlns:s="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:k="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/konfigurace"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"
+	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+	xmlns:fn="http://www.w3.org/2005/xpath-functions"
+	xmlns:svg="http://www.w3.org/2000/svg"
+	xmlns:xs="http://www.w3.org/2001/XMLSchema"
+	exclude-result-prefixes="fn h s k m xs">
+
+	<xsl:template match="m:classic-example"><pre>cat animals.txt | grep dog | cut -d " " -f 2 | tr a-z A-Z</pre></xsl:template>
+
+</xsl:stylesheet>
+