relpipe-data/principles.xml
author František Kučera <franta-hg@frantovo.cz>
Thu, 13 Dec 2018 13:58:06 +0100
branchv_0
changeset 212 bf9a704dc916
parent 210 f0a2916368e2
child 231 ea49ee7a73c9
permissions -rw-r--r--
examples: relpipe-tr-cut

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Principles</nadpis>
	<perex>Basic ideas, principles and rules behind the Relational pipes</perex>
	<pořadí>12</pořadí>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<h2>Sane software</h2>
		<p>
			<m:name/> (both the specification and the reference implementation) should be developed according to the <a href="https://sane-software.globalcode.info/">Sane software manifesto</a> (draft).
			Many of principles mentioned below are part of <em>being sane</em>. 
		</p>
		
		<h2>Free software and open specification</h2>
		
		<p>
			<m:name/> is and always will be a <a href="https://www.gnu.org/philosophy/free-sw.html">free software</a> and the specification of the format, tools and libraries will be open.
			It must not be impaired by software patents or other similar restrictions.
			In our country, we do not accept the existence of patents at all.
		</p>
		
		<h2>Divide and conquer</h2>
		<p>
			Each program should do one thing and do it well. We should separate these three tasks:
		</p>
		
		<ul>
			<li>data acquisition / creation</li>
			<li>data transformation</li>
			<li>data presentation</li>
		</ul>
		
		<p>
			A single program should not combine two or more of these tasks. Or should at least allow to run in mode which does only one of them.
			Thus we should be able to combine various programs together and get various presentations of the same data regardless the presentation features of the program that created the data.
			We should be able to add another transformation on the path between the data origin and the data destination. For example filter out some unwanted data or modify or enhance the values.
			Or we should be able to generate some mock/testing data and pass it through the original pipeline (sequence of transformations and the output filter) instead of the live data.
			We should be free in how we combine the tools together.
			We should be able to build even pipelines that was not expected by the authors of particulars tools we used.
		</p>
		
		<p>
			Authors should focus on their task only – e.g. <em>interaction with the Kernel and capturing the inotify events</em> and should not bother about the presentation of the captured data.
			There might be many output formats that makes sense (CSV, XML, table, YAML, \0 separated values etc.),
			but we should keep it <abbr title="Don't repeat yourself">DRY</abbr> and don't implement every format in every tool.
			It would be a waste of time and also a source of errors, because when developing some additional format (which is not our core business) only <em>by the way</em> we would probably do it wrong. 
		</p>
		
		
		<h2>Inputs, outputs and transformations as reusable libraries</h2>
		
		<p>
			Parts of the <m:name/> implementation might be used as a library instead of as a filter in a pipeline.
			This is not a primary purpose of our software, but sometimes it might be useful.
			In such scenario the data are never serialized in the <m:name/> format but flows through a single process and its method/function calls.
			For instance, if we need a tabular or CSV output in our program, we could adopt the code from the <m:name/> implementation as a library and call it internally without generating data in the <m:name/> format.
			This might bring some performance benefits.
		</p>
		
		<p>
			This is not a recommended approach, but should be possible.
		</p>
		
		<p>
			However, in any case, we should provide also an option of producing <em>raw</em> data in the <m:name/> format and allow others to convert it to any other format according their needs.
		</p>
		
		<h2>Specification-first, contract-first</h2>
		
		<p>
			The starting point for any developer should be the <m:a href="specification">specification</m:a> that defines the contract and the interface between the system components.
			It should cover the data format and also the tools (inputs, transformers and outputs).
			The specification must be verified by creating a reference implementation in at least one programming language.
		</p>
		
		<h2>Small code footprint and modular design</h2>
		
		<p>
			The length of the program measured in source lines of code (SLOC) should be as small as possible.
			Of course, the goal is not putting multiple statements on a single line.
			We should avoid unnecessary complexity (see <a href="https://en.wikipedia.org/wiki/Cyclomatic_complexity">Cyclomatic complexity</a> – but the SLOC are easier to count and give also quite relevant information).
		</p>
		
		<p>
			Modular design allows users to include (download, compile, run) only the portions of software they need.
			If the user needs e.g. regular expressions and XML output to be happy, he should not be forced to include also the code for CSV, YAML, JSON and PDF.
		</p>
		
		<p>
			Sane software is minimalistic in this way, which means that it is easy to audit, debug or modify.
			Looking for a bug (or even a backdoor) or looking for the place where to add the new feature
			is much easier in a software that has hundreds or tousands of SLOC than in a software consisting of hundreds of thousands or even millions of SLOC.
		</p>
		
		<p>
			The developer who wants to generate (or consume on the other side) relational data, should include only circa few hundreds of SLOC.
			This is the amount of code that could be read through in an hour or two.
			<!--
			Thus implementing the relational output to an existing program should be matter of few hours.
			-->
		</p>
		
		
		<h2>Sane dependencies</h2>
		
		<p>
			The libraries and the tools should not depend on any libraries other than the standard library of given programming language.
			In the best case, of course.
			This might be in coflict with the previous rule and then it is the question what is lesser harm.
			It definitely makes no sense to write e.g. XML or YAML parser ourselves as a part of our tool.
			Using high quality and well tested library is the only sane option.
			But what about XML output? We can develop a reliable XML generator on few lines of code because we can implement only the subset of the standard that we need.
			Writing such code is much more sane than including some bulky library that has several orders of magnitude more lines of code than our program.
		</p>
		
		<h2>Concise data serialization</h2>
		
		<p>
			The <m:name/> data format should be concise – the data should be represented by reasonably small amount of bytes.
			The format should support large amounts of small values and also sparse data (structures with many NULL/missing values) without wasting too much space.
			The data that are not written don't need to be compressed and thus have the best compression ratio.
		</p>
		
		<h2>Streaming</h2>
		
		<p>
			Relational tools should process streams of data and should hold only necessary data in the memory
			i.e. the tool should produce the output (the first record) as soon as possible while still reading the input (following records).
			Thus the memory usage does not depend on the volume of processed data.
		</p>
		
		<p>
			However, there are cases where such streaming is not feasible e.g. if we need to compute some statistics or a column widths while printing a table in the terminal.
			In such situation, we must read the whole relation and only then generate the output.
			But we should still be able to do streaming on the relations level i.e. if there are more relation, we always hold only one of them in the memory.
		</p>
		
		<p>
			This rule is important not only from the performance point of view but also for user experience.
			The user should see the output as soon as possible i.e. the longer running processes will produce result continuously instead of flushing everything at the end.
			This is also good for debugging and <em>looking inside the things</em>. 
		</p>
		
		<h2>Unambiguity</h2>
		
		<p>
			There should be only one way to represent a single value.
			For example the booleans can be written as <code>00</code> (false) or <code>01</code> (true) and every other value (<code>02..FF</code>) should be invalid/unsupported.
			Exceptions might occur if there are relevant reasons, but they should be rare.
		</p>
		
		
		<h2>Multiple files concatenation</h2>
		
		<p>
			It should be possible to concatenate multiple files or streams of relational data as easy as we can concatenate multiple text files
			(given that such text files have same character encoding, have no BOM at the beginning and have a newline at the end).
			If we can do:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[
(cat file-1.txt; echo "some additional middle data"; cat file-2.txt) | wc -l
]]></m:pre>
		
		<p>
			We should also be able to do:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[
(cat file-1.rp; relpipe-in-fstab; cat file-2.rp) | relpipe-out-xml
]]></m:pre>

		<p>
			Also, it should be possible to append (<code>&gt;&gt;</code>) new records to the last relation without modifying the already written data.
		</p>
		
		<h2>Work primarily with STDIO</h2>
		
		<p>
			The tools should work primarily and by default with the standard input and standard output (STDIN and STDOUT).
			Reading/writing from/to files or network should be (if present) a secondary and optional scenario.
		</p>
		
		<p>
			Standard error output (STDERR) should be used for errors/warnings/logs. By default, it should not produce any output, if everything goes well.
		</p>
		
		<h2>Tools might be TTY-aware</h2>
		
		<p>
			The input and output tools processing relational data might adapt their behaviour according to the fact whether their input resp. output is a terminal (TTY).
		</p>
		<p>
			If the output is a TTY, it means that the output is displayed to the user, 
			so the tool might e.g. colorize its output or do some other human-friendly formatting – 
			which makes no sense, if the output is directed to a file or piped to another program.
			Example:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[
# This would print a table with fancy colors using ANSI sequences:
relpipe-in-fstab | relpipe-out-tabular
			
# This would store the same table in a file but without any colors:
relpipe-in-fstab | relpipe-out-tabular > table.txt]]></m:pre>
		
		<p>
			If the input is a TTY, it means that the user is typing the values.
			In such situation, the tool might accept another input format (text, human-friendly) or use some default file location instead.
			Example:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[
# This would read the /etc/fstab (which is the default location):
relpipe-in-fstab | relpipe-out-tabular

# Those would read the /etc/mtab instead:
cat /etc/mtab | relpipe-in-fstab | relpipe-out-tabular
relpipe-in-fstab < /etc/mtab | relpipe-out-tabular]]></m:pre>

		<p>
			However, the behaviour should be modified in visual and expectable manner only.
			It should not e.g. switch from XML to YAML.
		</p>
		
		<h2>Use --long-options</h2>
		
		<p>
			Tools should accept arguments (if any) as <code>--long-options</code>.
			When looking at a script, it should be clear – at first sight – what it does.
			Which would not be if some cryptic short options like <code>-a -x -Z</code> were used.
			In order to save our keyboards, there are features like <em>Bash completion</em>.
		</p>
		
		
		<h2>Be exact and reliable</h2>
		
		<p>
			<m:name/> should convey data without corrupting or waywardly modifying them.
			Implementation details (e.g. how values are encoded in the stream) should not affect transferred data and the user.
		</p>
		
		<h2>Fail-fast, be strict</h2>
		
		<p>
			Because the relational data will be created by machines instead of being manually typed by erring humans,
			we should fail-fast on an error. We should be strict and require valid inputs only.
			Any error should be revealed as soon as possible and fixed.
		</p>
		
		<p>
			There might be tools or options for recovering corrupted data (caused e.g. by a failing HDD or a faulty network or a buggy software).
			But the recovery mode is not the default one.
		</p>
		
		<p>
			We demand reliable systems – not random and accidential behaviour caused by software guessing <em>What might probably these bytes mean?</em>
		</p>
		
		
		
		
		
		<h2></h2>
		<h2></h2>
		<h2></h2>
		<h2></h2>
		
	</text>

</stránka>