relpipe-data/index.xml
author František Kučera <franta-hg@frantovo.cz>
Thu, 13 Dec 2018 13:58:06 +0100
branchv_0
changeset 212 bf9a704dc916
parent 183 82897ccc01ce
child 217 3e2fd4ce9f02
permissions -rw-r--r--
examples: relpipe-tr-cut

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Relational pipes</nadpis>
	<perex>Official homepage of Relational pipes.</perex>
	<pořadí>10</pořadí>

	<text xmlns="http://www.w3.org/1999/xhtml">
		<p>
			One of the great parts of the <m:unix/>
			<m:podČarou><m:unix tvar="vysvětlivka"/></m:podČarou>
			culture is the invention<m:podČarou>which is attributed to Doug McIlroy, see <a href="http://www.catb.org/~esr/writings/taoup/html/ch07s02.html#plumbing">The Art of Unix Programming: Pipes, Redirection, and Filters</a></m:podČarou>
			of <em>pipes</em> and the idea<m:podČarou>see <a href="http://www.catb.org/~esr/writings/taoup/html/ch01s06.html">The Art of Unix Programming: Basics of the Unix Philosophy</a></m:podČarou> 
			that <em>one program should do one thing and do it well</em>.
		</p>
		
		<p>
			Each running program (process) has one input stream (called standard input or STDIN) and one output stream (called standard output or STDOUT) and also one additional output stream for logging/errors/warnings (STDERR).
			We can connect programs and pass the STDOUT of first one to the STDIN of the second one (etc.) using pipes.
		</p>
		
		<p>
			A classic pipeline example (<m:a href="classic-example">explained</m:a>):
		</p>
		
		<m:classic-example/>

		<!--		
		<m:diagram orientace="vodorovně">
			node[shape=box];
			
			cat  [label="cat /etc/fstab"];
			dd   [];
			grep [label="grep tmpfs"];
			log  [label="/tmp/dd.log"];
			
			cat -> dd  [label="STDOUT → STDIN"];
			dd -> grep [label="STDOUT → STDIN"];
			dd -> log  [label="STDERR → file"];
		</m:diagram>
		-->
		
		<p>
			According to this principle we can build complex and powerful programs (pipelines) by composing several simple, single-purpose and reusable programs.
			Such single-purpose programs (often called <em>filters</em>) are much easier to create, test and optimize and their authors don't have to bother about the complexity of the final pipeline.
			They even don't have to know, how their programs will be used in the future by others.
			This is a great design principle that brings us advanced flexibility, reusability, efficiency and reliability.
			Being in any role (author of a filter, builder of a pipeline etc.), we can always focus on our task only and do it well.<m:podČarou>see <a href="http://wiki.apidesign.org/wiki/Cluelessness">cluelessness</a> by Jaroslav Tulach in his <em>Practical API Design. Confessions of a Java Framework Architect</em></m:podČarou>
			And we can collaborate with others even if we don't know about them and we don't know that we are collaborating.
			Now think about putting this together with the free software ideas...  How very!
		</p>
		
		<!--
		<m:diagram orientace="vodorovně">
			compound=true;
			node[shape=box];
			
			subgraph cluster_in {
			label = "Inputs:";
			cli;
			fstab;
			}
			
			subgraph cluster_tr {
			label = "Transformations:";
			grep;
			sed;
			}
			
			subgraph cluster_out {
			label = "Outputs:";
			xml;
			tabular;
			gui;
			}
			
			cli -> grep  [ltail=cluster_in, lhead=cluster_tr];
			grep -> xml [ltail=cluster_tr, lhead=cluster_out];
			// cli -> xml [ltail=cluster_in, lhead=cluster_out];
			
		</m:diagram>
		-->
		
		
		<p>
			But the question is: how the data passed through pipes should be formatted and structured.
			There is wide spectrum of options from simple unstructured text files (just arrays of lines)
			through various <abbr title="delimiter-separated values e.g. CSV separated by comas">DSV</abbr>
			to formats like XML (YAML, JSON, ASN.1, Diameter, S-expressions etc.).
			Simpler formats look temptingly but have many problems and limitations (see the Pitfalls section in the <m:a href="classic-example">Classic pipeline example</m:a>).
			On the other hand, the advanced formats are capable to represent arbitrary object tree structures or even arbitrary graphs.
			They offer unlimited possibilities – and this is their strength and weaknes at the same time.
		</p>
		
		<!--
		<blockquote>Everything should be made as simple as possible, but not simpler.</blockquote>
		-->
		
		<p>
			It is not about the shape of the brackets, apostrophes, quotes or text vs. binary.
			It is not a technical question – it is in the semantic layer and human brain.
			Generic formats and their <em>arbitrary object trees/graphs</em> are (for humans, not for computers) difficult to understand and work with
			– compared to simpler structures like arrays, maps or matrixes.
		</p>
		
		<p>
			This is the reason why we have chosen the relational model as our logical model.
			This model comes from 1969<m:podČarou>invented and described by Edgar F. Codd, 
				see <em>Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, Research Report, IBM</em> from 1969 
				and <em>A Relational Model of Data for Large Shared Data Banks</em> from 1970, 
				see also <a href="https://en.wikipedia.org/wiki/Relational_model">Relational model</a>
			</m:podČarou>
			and through decades it has proven its qualities and viability.
			This logical model is powerful enough to describe almost any data and – at the same time – it is still simple and easy to be understood by humans.
		</p>
		
		<p>
			Thus the <m:name/> are streams containing zero or more relations.
			Each relation has a name, one or more attributes and zero or more records (tuples).
			Each attribute has a name and a data-type.
			Records contain attribute values.
			We can imagine this stream as a sequence of tables (but the table is only one of many possible visual representations of such relational data).
		</p>
		
		<h2>What <m:name/> are?</h2>
		
		<p>
			<m:name/> are an open <em>data format</em> designed for streaming structured data between two processes. 
			Simultaneously with the format specification, we are also developing a <em>reference implementation</em> (libraries and tools) as a free software.
			Although we believe in the specification-first (or contract-first) approach, we always look and check, whether the theoretic concepts are feasible and whether they can be reasonably and reliably implemented.
			So befeore publishing any new specification or its version, we will verify it by creating a reference implementation at least in one programming language.
		</p>
		<p>
			More generally, <m:name/> are a philosophical continuation of the classic <m:unix/> pipelines and the relational model.
		</p>
		
		
		<h2>What <m:name/> are not?</h2>
			
		<p>
			<m:name/> respect the existing ecosystem and are rather an improvement or supplement than a replacement.
			So the <m:name/> are not a:
		</p>
		
		<ul>
			<li>Shell – we use existing shells (e.g. GNU Bash), work with any shell and even without a shell (e.g. as a stream format passed through a network or stored in a file).</li>
			<li>Terminal emulator – same as with shells, we use existing terminals and we can use <m:name/> also outside any terminal; if we interact with the terminal, we use standard means like Unicode, ANSI escape sequences etc.</li>
			<li>IDE – we can use standard <m:unix/> tools as an IDE (GNU Screen, Emacs, Make etc.) or any other IDE.</li>
			<li>Programming language – <m:name/> are language-independent data format and can be produced or consumed in any programming language.</li>
			<li>Query language – although some of our tools are doing queries, filtering or transformations, we are not inventing a new query language – instead, we use existing languages like SQL, XPath or regular expressions.</li>
			<!--<li>Text editor – </li>-->
			<li>Database system, DBMS – we focus on the stream processing rather than data storage. Although sometimes it makes sense to redirect data to a file and continue with the processing later.</li>
		</ul>
		
		
		<h2>Project status</h2>
		
		<p>
			The main ideas and the roadmap are quite clear, but many things will change (including the format internals and interfaces of the libraries and tools).
			Because we understand how important the API and ABI stability is, we are not ready to publish the version 1.0 yet.
		</p>
		<p>
			On the other hand, the already published tools (tagged as v0.x in v_0 branch) should work quite well (should compile, should run, should not segfault often, should not wipe your hard drive or kill your cat),
			so they might be useful for someone who likes our ideas and who is prepared to update own programs and scripts when the new version is ready.
		</p>

		
	</text>

</stránka>