<stránka
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
<nadpis>Doing projection and restriction using cut and grep</nadpis>
<perex>SELECT mount_point FROM fstab WHERE type IN ('btrfs', 'xfs')</perex>
<m:pořadí-příkladu>01000</m:pořadí-příkladu>
<text xmlns="http://www.w3.org/1999/xhtml">
<p>
While reading classic pipelines involving <code>grep</code> and <code>cut</code> commands
we must notice that there is some similarity with simple SQL queries looking like:
</p>
<m:pre jazyk="SQL">SELECT "some", "cut", "fields" FROM stdin WHERE grep_matches(whole_line);</m:pre>
<p>
And that is true: <code>grep</code> does restriction<m:podČarou>
<a href="https://en.wikipedia.org/wiki/Selection_(relational_algebra)">selecting</a> only certain records from the original relation according to their match with given conditions</m:podČarou>
and <code>cut</code> does projection<m:podČarou>limited subset of what <a href="https://en.wikipedia.org/wiki/Projection_(relational_algebra)">projection</a> means</m:podČarou>.
Now we can do these relational operations using our relational tools called <code>relpipe-tr-grep</code> and <code>relpipe-tr-cut</code>.
</p>
<p>
Assume that we need only <code>mount_point</code> fields from our <code>fstab</code> where <code>type</code> is <code>btrfs</code> or <code>xfs</code>
and we want to do something (a shell script block) with these directory paths.
</p>
<m:pre jazyk="bash"><![CDATA[relpipe-in-fstab \
| relpipe-tr-grep --relation 'fstab' --attribute 'type' --value '^btrfs|xfs$' \
| relpipe-tr-cut --relation 'fstab' --attribute 'mount_point' \
| relpipe-out-nullbyte \
| while read -r -d '' m; do
echo "$m";
done]]></m:pre>
<p>
The <code>relpipe-tr-cut</code> tool has similar syntax to its <em>grep</em> and <em>sed</em> siblings and also uses the power of regular expressions.
In this case it modifies on-the-fly the <code>fstab</code> relation and drops all its attributes except the <code>mount_point</code> one.
</p>
<p>
Then we pass the data to the Bash <code>while</code> cycle.
In such simple scenario (just <code>echo</code>), we could use <code>xargs</code> as in examples above,
but in this syntax, we can write whole block of shell commands for each record/value and do more complex actions with them.
</p>
<h2>More projections with relpipe-tr-cut</h2>
<p>
Assume that we have a simple relation containing numbers:
</p>
<m:pre jazyk="bash"><![CDATA[seq 0 8 \
| tr \\n \\0 \
| relpipe-in-cli generate-from-stdin numbers 3 a integer b integer c integer \
> numbers.rp]]></m:pre>
<p>and second one containing letters:</p>
<m:pre jazyk="bash"><![CDATA[relpipe-in-cli generate letters 2 a string b string A B C D > letters.rp]]></m:pre>
<p>We saved them into two files and then combined them into a single file. We will work with them as they are a single stream of relations:</p>
<m:pre jazyk="bash"><![CDATA[cat numbers.rp letters.rp > both.rp;
cat both.rp | relpipe-out-tabular]]></m:pre>
<p>Will print:</p>
<pre><![CDATA[numbers:
╭─────────────┬─────────────┬─────────────╮
│ a (integer) │ b (integer) │ c (integer) │
├─────────────┼─────────────┼─────────────┤
│ 0 │ 1 │ 2 │
│ 3 │ 4 │ 5 │
│ 6 │ 7 │ 8 │
╰─────────────┴─────────────┴─────────────╯
Record count: 3
letters:
╭─────────────┬─────────────╮
│ a (string) │ b (string) │
├─────────────┼─────────────┤
│ A │ B │
│ C │ D │
╰─────────────┴─────────────╯
Record count: 2]]></pre>
<p>We can put away the <code>a</code> attribute from the <code>numbers</code> relation:</p>
<m:pre jazyk="bash">cat both.rp | relpipe-tr-cut --relation 'numbers' --attribute 'b|c' | relpipe-out-tabular</m:pre>
<p>and leave the <code>letters</code> relation unaffected:</p>
<pre><![CDATA[numbers:
╭─────────────┬─────────────╮
│ b (integer) │ c (integer) │
├─────────────┼─────────────┤
│ 1 │ 2 │
│ 4 │ 5 │
│ 7 │ 8 │
╰─────────────┴─────────────╯
Record count: 3
letters:
╭─────────────┬─────────────╮
│ a (string) │ b (string) │
├─────────────┼─────────────┤
│ A │ B │
│ C │ D │
╰─────────────┴─────────────╯
Record count: 2]]></pre>
<p>Or we can remove <code>a</code> from both relations resp. keep there only attributes whose names match <code>'b|c'</code> regex:</p>
<m:pre jazyk="bash">cat both.rp | relpipe-tr-cut --relation '.*' --attribute 'b|c' | relpipe-out-tabular</m:pre>
<p>Instead of <code>'.*'</code> we could use <code>'numbers|letters'</code> and in this case it will give the same result:</p>
<pre><![CDATA[numbers:
╭─────────────┬─────────────╮
│ b (integer) │ c (integer) │
├─────────────┼─────────────┤
│ 1 │ 2 │
│ 4 │ 5 │
│ 7 │ 8 │
╰─────────────┴─────────────╯
Record count: 3
letters:
╭─────────────╮
│ b (string) │
├─────────────┤
│ B │
│ D │
╰─────────────╯
Record count: 2]]></pre>
<p>All the time, we are reducing the attributes. But we can also multiply them or change their order:</p>
<m:pre jazyk="bash">cat both.rp \
| relpipe-tr-cut --relation 'numbers' --attribute 'b|a|c' --attribute 'b' --attribute 'a' --attribute 'a' \
| relpipe-out-tabular</m:pre>
<p>
n.b. the order in <code>'b|a|c'</code> does not matter and if such regex matches, it preserves the original order of the attributes;
but if we use multiple regexes to specify attributes, their order and count matters:
</p>
<pre><![CDATA[numbers:
╭─────────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────╮
│ a (integer) │ b (integer) │ c (integer) │ b (integer) │ a (integer) │ a (integer) │
├─────────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ 0 │ 1 │ 2 │ 1 │ 0 │ 0 │
│ 3 │ 4 │ 5 │ 4 │ 3 │ 3 │
│ 6 │ 7 │ 8 │ 7 │ 6 │ 6 │
╰─────────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────╯
Record count: 3
letters:
╭─────────────┬─────────────╮
│ a (string) │ b (string) │
├─────────────┼─────────────┤
│ A │ B │
│ C │ D │
╰─────────────┴─────────────╯
Record count: 2]]></pre>
<p>
The <code>letters</code> relation stays rock steady and <code>relpipe-tr-cut --relation 'numbers'</code> does not affect it in any way.
</p>
<h2>Process CSV files</h2>
<p>
There are various input filters (<code>relpipe-in-*</code>), one of them is <code>relpipe-in-csv</code>
which converts CSV files to relational format.
Thus we can process standard CSV files in our relational pipelines
and e.g. filter records that have certain value in certain column (<code>relpipe-tr-grep</code>)
or keep only certain columns (<code>relpipe-tr-cut</code>).
</p>
<p>
We may have a <code>tasks.csv</code> file containing TODOs and FIXMEs:
</p>
<pre><![CDATA["file","line","type","description"
".hg/shelve-backup/posix_mq.patch","97","TODO","support also other encodings."
".hg/shelve-backup/posix_mq.patch","163","TODO","support also other encodings."
"src/FileAttributeFinder.h","79","TODO","optional whitespace trimming or substring"
"src/FileAttributeFinder.h","80","TODO","custom encoding + read encoding from xattr"
"src/FileAttributeFinder.h","83","TODO","allow custom error value or fallback to HEX/Base64"
"streamlet-examples/streamlet-common.h","286","FIXME","correct error codes"
…]]></pre>
<p>
And we can process it using this pipeline:
</p>
<m:pre jazyk="bash"><![CDATA[cat tasks.csv \
| relpipe-in-csv \
| relpipe-tr-grep --relation 'csv' --attribute 'type' --value 'FIXME' \
| relpipe-tr-cut --relation 'csv' --attribute 'file|description' \
| relpipe-out-tabular]]></m:pre>
<p>and get result like this:</p>
<pre><![CDATA[csv:
╭───────────────────────────────────────┬──────────────────────╮
│ file (string) │ description (string) │
├───────────────────────────────────────┼──────────────────────┤
│ streamlet-examples/streamlet-common.h │ correct error codes │
│ streamlet-examples/streamlet-common.h │ correct error codes │
│ streamlet-examples/Streamlet.java │ correct error codes │
╰───────────────────────────────────────┴──────────────────────╯
Record count: 3]]></pre>
<p>
We work with attribute (column) names, so there is no need to remember column numbers.
And thanks to regular expressions we can write elegant and powerful filters.
</p>
</text>
</stránka>