relpipe-data/examples-awk-filtering.xml
author František Kučera <franta-hg@frantovo.cz>
Thu, 01 Aug 2019 11:59:39 +0200
branchv_0
changeset 266 862a1d97e74b
parent 258 2868d772c27e
permissions -rw-r--r--
add the Big picture diagram

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Complex filtering with AWK</nadpis>
	<perex>filtering records with AND, OR and functions</perex>
	<m:pořadí-příkladu>02100</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">

		<p>
			If we need more complex filtering than <code>relpipe-tr-grep</code> can offer, we can write an AWK transformation.
			Then we can use AND and OR operators and functions like regular expression matching or numerical formulas.
		</p>
		
		<p>
			The tool <code>relpipe-tr-awk</code> calls real AWK program (usually GNU AWK) installed on our system and passes data of given relation to it.
			Thus we can use any AWK feature in our pipeline while processing relational data.
			Relational attributes are mapped to AWK variables, so we can reference them by their names instead of mere field numbers.
		</p>
		
		<p>
			The <code>--for-each</code> option is used for both filtering (instead of <code>--where</code>) 
			and arbitrary code execution (for data modifications, adding records, computations or intentional side effects).
			In AWK, filtering conditions are surrounded by <code>(…)</code> and actions by <code>{…}</code>.
			Both can be combined together and multiple expressions can be separated by <code>;</code> semicolon.
			The <code>record()</code> function should be called instead of AWK <code>print</code> (which should never be used directly).
			Calling <code>record()</code> is not necessary, when only filtering is done (and there are no data modifications).
		</p>
		
		<h2>Filtering numbers</h2>
		
		<p>With AWK we can filter records using standard numeric operators like ==, &lt;, &gt;, &gt;= etc.</p>
		
		<m:pre jazyk="bash"><![CDATA[find -print0 | relpipe-in-filesystem \
	| relpipe-tr-awk \
		--relation '.*' \
			--for-each '(size > 2000)' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>and e.g. list files with certain sizes:</p>
		
		<pre><![CDATA[filesystem:
 ╭──────────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path        (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├──────────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ ./relpipe-tr-awk.cpp │ f             │           2880 │ hacker         │ hacker         │
 │ ./CLIParser.h        │ f             │           5264 │ hacker         │ hacker         │
 │ ./AwkHandler.h       │ f             │          17382 │ hacker         │ hacker         │
 ╰──────────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 3]]></pre>


		<h2>Filtering strings</h2>
		
		<p>String values can be searched for certain regular expression:</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-fstab \
	| relpipe-tr-awk \
		--relation '.*' \
			--for-each '(mount_point ~ /cdrom/)' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>e.g. <code>fstab</code> records having <code>cdrom</code> in the <code>mount_point</code>:</p>
	
		<pre><![CDATA[fstab:
 ╭─────────────────┬─────────────────┬──────────────────────┬───────────────┬──────────────────┬────────────────┬────────────────╮
 │ scheme (string) │ device (string) │ mount_point (string) │ type (string) │ options (string) │ dump (integer) │ pass (integer) │
 ├─────────────────┼─────────────────┼──────────────────────┼───────────────┼──────────────────┼────────────────┼────────────────┤
 │                 │ /dev/sr0        │ /media/cdrom0        │ udf,iso9660   │ user,noauto      │              0 │              0 │
 ╰─────────────────┴─────────────────┴──────────────────────┴───────────────┴──────────────────┴────────────────┴────────────────╯
Record count: 1]]></pre>

		<p>Case-insensitive search can be switched on by adding:</p>
		
		<pre>--define IGNORECASE integer 1</pre>
		
		<h2>AND and OR</h2>
		
		<p>We can combine multiple conditions using <code>||</code> and <code>&amp;&amp;</code> logical operators:</p>
		
		<m:pre jazyk="bash"><![CDATA[relpipe-in-fstab \
	| relpipe-tr-awk \
		--relation '.*' \
			--for-each '(type == "btrfs" || pass == 1)' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>and build arbitrary complex filters</p>
	
		<pre><![CDATA[fstab:
 ╭─────────────────┬──────────────────────────────────────┬──────────────────────┬───────────────┬───────────────────────────────────────┬────────────────┬────────────────╮
 │ scheme (string) │ device                      (string) │ mount_point (string) │ type (string) │ options                      (string) │ dump (integer) │ pass (integer) │
 ├─────────────────┼──────────────────────────────────────┼──────────────────────┼───────────────┼───────────────────────────────────────┼────────────────┼────────────────┤
 │ UUID            │ 29758270-fd25-4a6c-a7bb-9a18302816af │ /                    │ ext4          │ relatime,user_xattr,errors=remount-ro │              0 │              1 │
 │ UUID            │ a2b5f230-a795-4f6f-a39b-9b57686c86d5 │ /home                │ btrfs         │ relatime                              │              0 │              2 │
 ╰─────────────────┴──────────────────────────────────────┴──────────────────────┴───────────────┴───────────────────────────────────────┴────────────────┴────────────────╯
Record count: 2]]></pre>

		<p>Nested <code>(…)</code> work as expected.</p>

		<p>
			And AWK can do much more – it offers plenty of functions and language constructs that we can use in our transformations.
			Comperhensive documentation can be found here: <a href="https://www.gnu.org/software/gawk/manual/">Gawk: Effective AWK Programming</a>.
		</p>
		
	</text>

</stránka>