relpipe-data/examples-grep-cut-fstab.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 326 ab7f333f1225
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Doing projection and restriction using cut and grep</nadpis>
	<perex>SELECT mount_point FROM fstab WHERE type IN ('btrfs', 'xfs')</perex>
	<m:pořadí-příkladu>01000</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			While reading classic pipelines involving <code>grep</code> and <code>cut</code> commands
			we must notice that there is some similarity with simple SQL queries looking like:
		</p>
		
		<m:pre jazyk="SQL">SELECT "some", "cut", "fields" FROM stdin WHERE grep_matches(whole_line);</m:pre>
		
		<p>
			And that is true: <code>grep</code> does restriction<m:podČarou>
				<a href="https://en.wikipedia.org/wiki/Selection_(relational_algebra)">selecting</a> only certain records from the original relation according to their match with given conditions</m:podČarou>
			and <code>cut</code> does projection<m:podČarou>limited subset of what <a href="https://en.wikipedia.org/wiki/Projection_(relational_algebra)">projection</a> means</m:podČarou>.
			Now we can do these relational operations using our relational tools called <code>relpipe-tr-grep</code> and <code>relpipe-tr-cut</code>.
		</p>
		
		<p>
			Assume that we need only <code>mount_point</code> fields from our <code>fstab</code> where <code>type</code> is <code>btrfs</code> or <code>xfs</code>
			and we want to do something (a shell script block) with these directory paths.
		</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-fstab \
	| relpipe-tr-grep --relation 'fstab' --attribute 'type' --value '^btrfs|xfs$' \
	| relpipe-tr-cut --relation 'fstab' --attribute 'mount_point' \
	| relpipe-out-nullbyte \
	| while read -r -d '' m; do
		echo "$m";
	done]]></m:pre>
	
		<p>
			The <code>relpipe-tr-cut</code> tool has similar syntax to its <em>grep</em> and <em>sed</em> siblings and also uses the power of regular expressions.
			In this case it modifies on-the-fly the <code>fstab</code> relation and drops all its attributes except the <code>mount_point</code> one.
		</p>
		
		<p>
			Then we pass the data to the Bash <code>while</code> cycle.
			In such simple scenario (just <code>echo</code>), we could use <code>xargs</code> as in examples above,
			but in this syntax, we can write whole block of shell commands for each record/value and do more complex actions with them.
		</p>
		
		<h2>More projections with relpipe-tr-cut</h2>
		
		<p>
			Assume that we have a simple relation containing numbers:
		</p>
	
		<m:pre jazyk="bash"><![CDATA[seq 0 8 \
	| tr \\n \\0 \
	| relpipe-in-cli generate-from-stdin numbers 3 a integer b integer c integer \
	> numbers.rp]]></m:pre>

		<p>and second one containing letters:</p>

		<m:pre jazyk="bash"><![CDATA[relpipe-in-cli generate letters 2 a string b string A B C D > letters.rp]]></m:pre>

		<p>We saved them into two files and then combined them into a single file. We will work with them as they are a single stream of relations:</p>
		
		<m:pre jazyk="bash"><![CDATA[cat numbers.rp letters.rp > both.rp;
cat both.rp | relpipe-out-tabular]]></m:pre>
		
		<p>Will print:</p>
		
		<pre><![CDATA[numbers:
 ╭─────────────┬─────────────┬─────────────╮
 │ a (integer) │ b (integer) │ c (integer) │
 ├─────────────┼─────────────┼─────────────┤
 │           0 │           1 │           2 │
 │           3 │           4 │           5 │
 │           6 │           7 │           8 │
 ╰─────────────┴─────────────┴─────────────╯
Record count: 3
letters:
 ╭─────────────┬─────────────╮
 │ a  (string) │ b  (string) │
 ├─────────────┼─────────────┤
 │ A           │ B           │
 │ C           │ D           │
 ╰─────────────┴─────────────╯
Record count: 2]]></pre>

		<p>We can put away the <code>a</code> attribute from the <code>numbers</code> relation:</p>
		
		<m:pre jazyk="bash">cat both.rp | relpipe-tr-cut --relation 'numbers' --attribute 'b|c' | relpipe-out-tabular</m:pre>
		
		<p>and leave the <code>letters</code> relation unaffected:</p>
		
		<pre><![CDATA[numbers:
 ╭─────────────┬─────────────╮
 │ b (integer) │ c (integer) │
 ├─────────────┼─────────────┤
 │           1 │           2 │
 │           4 │           5 │
 │           7 │           8 │
 ╰─────────────┴─────────────╯
Record count: 3
letters:
 ╭─────────────┬─────────────╮
 │ a  (string) │ b  (string) │
 ├─────────────┼─────────────┤
 │ A           │ B           │
 │ C           │ D           │
 ╰─────────────┴─────────────╯
Record count: 2]]></pre>

		<p>Or we can remove <code>a</code> from both relations resp. keep there only attributes whose names match <code>'b|c'</code> regex:</p>

		<m:pre jazyk="bash">cat both.rp | relpipe-tr-cut --relation '.*' --attribute 'b|c' | relpipe-out-tabular</m:pre>
		
		<p>Instead of <code>'.*'</code> we could use <code>'numbers|letters'</code> and in this case it will give the same result:</p>
		
		<pre><![CDATA[numbers:
 ╭─────────────┬─────────────╮
 │ b (integer) │ c (integer) │
 ├─────────────┼─────────────┤
 │           1 │           2 │
 │           4 │           5 │
 │           7 │           8 │
 ╰─────────────┴─────────────╯
Record count: 3
letters:
 ╭─────────────╮
 │ b  (string) │
 ├─────────────┤
 │ B           │
 │ D           │
 ╰─────────────╯
Record count: 2]]></pre>

		<p>All the time, we are reducing the attributes. But we can also multiply them or change their order:</p>
		
		<m:pre jazyk="bash">cat both.rp \
	| relpipe-tr-cut --relation 'numbers' --attribute 'b|a|c' --attribute 'b' --attribute 'a' --attribute 'a' \
	| relpipe-out-tabular</m:pre>
		
		<p>
			n.b. the order in <code>'b|a|c'</code> does not matter and if such regex matches, it preserves the original order of the attributes;
			but if we use multiple regexes to specify attributes, their order and count matters:
		</p>
		
		<pre><![CDATA[numbers:
 ╭─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────╮
 │ a (integer) │ b (integer) │ c (integer) │ b (integer) │ a (integer) │ a (integer) │
 ├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
 │           0 │           1 │           2 │           1 │           0 │           0 │
 │           3 │           4 │           5 │           4 │           3 │           3 │
 │           6 │           7 │           8 │           7 │           6 │           6 │
 ╰─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────╯
Record count: 3
letters:
 ╭─────────────┬─────────────╮
 │ a  (string) │ b  (string) │
 ├─────────────┼─────────────┤
 │ A           │ B           │
 │ C           │ D           │
 ╰─────────────┴─────────────╯
Record count: 2]]></pre>

		<p>
			The <code>letters</code> relation stays rock steady and <code>relpipe-tr-cut --relation 'numbers'</code> does not affect it in any way.
		</p>
		
		<h2>Process CSV files</h2>
		
		<p>
			There are various input filters (<code>relpipe-in-*</code>), one of them is <code>relpipe-in-csv</code>
			which converts CSV files to relational format.
			Thus we can process standard CSV files in our relational pipelines
			and e.g. filter records that have certain value in certain column (<code>relpipe-tr-grep</code>)
			or keep only certain columns (<code>relpipe-tr-cut</code>).
		</p>
		
		<p>
			We may have a <code>tasks.csv</code> file containing TODOs and FIXMEs:
		</p>
		
		<pre><![CDATA["file","line","type","description"
".hg/shelve-backup/posix_mq.patch","97","TODO","support also other encodings."
".hg/shelve-backup/posix_mq.patch","163","TODO","support also other encodings."
"src/FileAttributeFinder.h","79","TODO","optional whitespace trimming or substring"
"src/FileAttributeFinder.h","80","TODO","custom encoding + read encoding from xattr"
"src/FileAttributeFinder.h","83","TODO","allow custom error value or fallback to HEX/Base64"
"streamlet-examples/streamlet-common.h","286","FIXME","correct error codes"
…]]></pre>

		<p>
			And we can process it using this pipeline:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[cat tasks.csv \
	| relpipe-in-csv \
	| relpipe-tr-grep --relation 'csv' --attribute 'type' --value 'FIXME' \
	| relpipe-tr-cut  --relation 'csv' --attribute 'file|description' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>and get result like this:</p>
	
		<pre><![CDATA[csv:
 ╭───────────────────────────────────────┬──────────────────────╮
 │ file                         (string) │ description (string) │
 ├───────────────────────────────────────┼──────────────────────┤
 │ streamlet-examples/streamlet-common.h │ correct error codes  │
 │ streamlet-examples/streamlet-common.h │ correct error codes  │
 │ streamlet-examples/Streamlet.java     │ correct error codes  │
 ╰───────────────────────────────────────┴──────────────────────╯
Record count: 3]]></pre>

	
		<p>
			We work with attribute (column) names, so there is no need to remember column numbers.
			And thanks to regular expressions we can write elegant and powerful filters.
		</p>


		
	</text>

</stránka>