relpipe-data/examples-guile-projections.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 316 d7ae02390fac
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Doing projections with Scheme</nadpis>
	<perex>modifying attribute values and adding new attributes or removing them</perex>
	<m:pořadí-příkladu>01500</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			The <code>relpipe-tr-scheme</code> can not only filter records,
			but can also modify them and even modify the structure of the relation – add or remove attributes.
			
		</p>
		
		<h2>Sample data</h2>
		
		<p>We have some CSV file:</p>
		
		<m:pre jazyk="text" src="examples/guile-1.csv"/>
		
		<p>and we convert it to a relation called <code>n</code>:</p>
		
		<m:pre jazyk="bash"><![CDATA[cat guile-1.csv \
	| relpipe-in-csv n id integer name string a integer b integer c integer \
	| relpipe-out-tabular]]></m:pre>
		
		<p>which printed as a table looks like this:</p>

		<m:pre jazyk="text"><![CDATA[n:
 ╭──────────────┬───────────────┬─────────────┬─────────────┬─────────────╮
 │ id (integer) │ name (string) │ a (integer) │ b (integer) │ c (integer) │
 ├──────────────┼───────────────┼─────────────┼─────────────┼─────────────┤
 │            1 │ first         │           1 │           2 │           3 │
 │            2 │ second        │           2 │          10 │        1024 │
 │            3 │ third         │           4 │           4 │          16 │
 ╰──────────────┴───────────────┴─────────────┴─────────────┴─────────────╯
Record count: 3]]></m:pre>

		<p>
			Because it is annoying to write some code again and again, we will create a shell function and (re)use it later:
		</p>

		<m:pre jazyk="bash"><![CDATA[sample-data() {
	cat guile-1.csv \
		| relpipe-in-csv n id integer name string a integer b integer c integer;
}]]></m:pre>

		<p>
			Another option is storing the relational data in a file and then reading this file.
			Files are better option, if the transformation is costly and we do not need live/fresh data.
		</p>

		<h2>Modifying attribute values</h2>

		<p>
			Then, we can modify such relation using Scheme – e.g. we can make the <code>name</code> uppercase and increase <code>id</code> by 1000:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[sample-data \
	| relpipe-tr-scheme \
		--relation n \
		--for-each '(set! $name (string-upcase $name) ) (set! $id (+ $id 1000) )' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>So we have:</p>

		<m:pre jazyk="text"><![CDATA[n:
 ╭──────────────┬───────────────┬─────────────┬─────────────┬─────────────╮
 │ id (integer) │ name (string) │ a (integer) │ b (integer) │ c (integer) │
 ├──────────────┼───────────────┼─────────────┼─────────────┼─────────────┤
 │         1001 │ FIRST         │           1 │           2 │           3 │
 │         1002 │ SECOND        │           2 │          10 │        1024 │
 │         1003 │ THIRD         │           4 │           4 │          16 │
 ╰──────────────┴───────────────┴─────────────┴─────────────┴─────────────╯
Record count: 3]]></m:pre>


		<h2>Removing attributes</h2>
	
		<p>
			The relation on the output might have different structure that the relation on the input.
			We can keep only some of the original attributes:
		</p>
	
		<m:pre jazyk="bash"><![CDATA[sample-data \
	| relpipe-tr-scheme \
		--relation n \
		--for-each '(set! $name (string-upcase $name) ) (set! $id (+ $id 1000) )' \
		--output-attribute 'id' integer \
		--output-attribute 'name' string \
	| relpipe-out-tabular]]></m:pre>
	
		<p>and have:</p>
	
		<m:pre jazyk="text"><![CDATA[n:
 ╭──────────────┬───────────────╮
 │ id (integer) │ name (string) │
 ├──────────────┼───────────────┤
 │         1001 │ FIRST         │
 │         1002 │ SECOND        │
 │         1003 │ THIRD         │
 ╰──────────────┴───────────────╯
Record count: 3]]></m:pre>

		<h2>Adding attributes</h2>
		
		<p>
			If we do not want to completely redefine the structure of the relation,
			we can keep all original attributes and just add definitions of some others:			
		</p>
		
		<m:pre jazyk="bash"><![CDATA[sample-data \
	| relpipe-tr-scheme \
		--relation n \
		--for-each '(define $sum (+ $a $b $c) )' \
		--output-attribute 'sum' integer \
		--input-attributes-prepend \
	| relpipe-out-tabular]]></m:pre>

		<p>so we have a completely new attribute containing the sum of <code>a</code>, <code>b</code> and <code>c</code>:</p>

		<m:pre jazyk="text"><![CDATA[n:
 ╭──────────────┬───────────────┬─────────────┬─────────────┬─────────────┬───────────────╮
 │ id (integer) │ name (string) │ a (integer) │ b (integer) │ c (integer) │ sum (integer) │
 ├──────────────┼───────────────┼─────────────┼─────────────┼─────────────┼───────────────┤
 │            1 │ first         │           1 │           2 │           3 │             6 │
 │            2 │ second        │           2 │          10 │        1024 │          1036 │
 │            3 │ third         │           4 │           4 │          16 │            24 │
 ╰──────────────┴───────────────┴─────────────┴─────────────┴─────────────┴───────────────╯
Record count: 3]]></m:pre>

		<p>
			We can change the attribute order by using <code>--input-attributes-append</code>
			instead of <code>--input-attributes-prepend</code>.
		</p>
		
		<h2>Changing the attribute type</h2>
		
		<p>
			Each attribute has a data type (integer, string…).
			And we can change the type. Of course we have to modify the data, because we can not put e.g. string value into an integer attribute.
		</p>
		
		<m:pre jazyk="bash"><![CDATA[sample-data \
	| relpipe-tr-scheme \
		--relation n \
		--for-each '(define $id (string-upcase $name) )' \
		--output-attribute 'id' string \
		--output-attribute 'a' integer \
		--output-attribute 'b' integer \
		--output-attribute 'c' integer \
	| relpipe-out-tabular]]></m:pre>
	
		<p>
			The code above changed the type of <code>id</code> attribute from integer to string
			and put uppercase <code>name</code> into it:
		</p>
	
		<m:pre jazyk="text"><![CDATA[n:
 ╭─────────────┬─────────────┬─────────────┬─────────────╮
 │ id (string) │ a (integer) │ b (integer) │ c (integer) │
 ├─────────────┼─────────────┼─────────────┼─────────────┤
 │ FIRST       │           1 │           2 │           3 │
 │ SECOND      │           2 │          10 │        1024 │
 │ THIRD       │           4 │           4 │          16 │
 ╰─────────────┴─────────────┴─────────────┴─────────────╯
Record count: 3]]></m:pre>


		<h2>Projection and restriction</h2>
		
		<p>
			We can do projection and restriction at the same time, during the same transformation:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[sample-data \
	| relpipe-tr-scheme \
		--relation n \
		--for-each '(set! $name (string-upcase $name) ) (set! $id (+ $id 1000) )' \
		--output-attribute 'id' integer \
		--output-attribute 'name' string \
		--where '(= $c (* $a $b) )' \
	| relpipe-out-tabular]]></m:pre>

		<p>and have:</p>

		<m:pre jazyk="bash"><![CDATA[n:
 ╭──────────────┬───────────────╮
 │ id (integer) │ name (string) │
 ├──────────────┼───────────────┤
 │         1003 │ THIRD         │
 ╰──────────────┴───────────────╯
Record count: 1]]></m:pre>

		<p>
			And if we use <code>expt</code> instead of <code>*</code>, we will get SECOND instead of THIRD.
		</p>
		
		<p>The example above has its SQL equivalent:</p>
		
		<m:pre jazyk="sql"><![CDATA[SELECT
	id + 1000 AS id,
	upper(name) AS name
FROM n
WHERE c = (a * b);]]></m:pre>

		<p>
			The difference is that <m:name/> do not require data to be stored anywhere,
			because we (by default) process streams on the fly.
			Thus one process can generate data, second one can transform them and the third one can convert them to some output format.
			All processes are running at the same time and without need to cache all data at once.
		</p>
		
	</text>

</stránka>