relpipe-data/examples-csv-sql-join.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 00:43:11 +0100
branchv_0
changeset 329 5bc2bb8b7946
permissions -rw-r--r--
Release v0.18

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">

	<nadpis>Running SQL JOINs on multiple CSV files</nadpis>
	<perex>query a collection of (not only) CSV files using SQL</perex>
	<m:pořadí-příkladu>05100</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">

		<p>
			CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
			both, human-readable and well supported by many existing applications and libraries.
			We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc.
			(on the other hand, such simple formats have usually some drawbacks…)
		</p>
		<p>
			In this example, we will show how to query a set of CSV files like it was a relational database.
		</p>

		<p>Suppose we have a CSV file describing our network interfaces:</p>
		<m:pre jazyk="text"><![CDATA[address,name
00:00:00:00:00:00,lo
00:D0:D8:00:26:00,eth0
00:01:02:01:33:70,eth1]]></m:pre>


		<p>and another CSV file with IP addresses assigned to them:</p>
		<m:pre jazyk="text"><![CDATA[address,mask,version,interface
127.0.0.1,8,4,lo
::1,128,6,lo
192.168.1.2,24,4,eth0
192.168.1.8,24,4,eth0
10.21.12.24,24,4,eth0
75.748.86.91,95,4,eth1
23.75.345.200,95,4,eth1
2a01:430:2e::cafe:babe,64,6,eth1]]></m:pre>


		<h2>Loading a CSV file and running basic queries</h2>

		<p>
			Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.)
			We can also add <code>relpipe-tr-sql</code> in the middle of our pipeline and run some SQL queries –
			transform data on-the-fly and send the query result to the <code>relpipe-out-tabular</code> (or other output filter) in place of the original data.
			For now, we will filter just the IPv6 addresses:
		</p>
		<m:pre jazyk="bash"><![CDATA[cat ip.csv \
	| relpipe-in-csv --relation 'ip' \
	| relpipe-tr-sql \
		--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
	| relpipe-out-tabular]]></m:pre>
		<p>and get them printed:</p>
		<m:pre jazyk="text"><![CDATA[ipv6:
 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
 │ address       (string) │ mask (string) │ version (string) │ interface (string) │
 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
 │ ::1                    │ 128           │ 6                │ lo                 │
 │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │
 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 2]]></m:pre>

		<p>
			It is alo possible to run several queries at once
			and thanks to the <m:name/> format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline:
		</p>
		<m:pre jazyk="bash"><![CDATA[cat ip.csv \
	| relpipe-in-csv --relation 'ip' \
	| relpipe-tr-sql \
		--relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \
		--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
	| relpipe-out-tabular]]></m:pre>
		<p>resulting in two nice tables:</p>
		<m:pre jazyk="text"><![CDATA[ipv4:
 ╭──────────────────┬───────────────┬──────────────────┬────────────────────╮
 │ address (string) │ mask (string) │ version (string) │ interface (string) │
 ├──────────────────┼───────────────┼──────────────────┼────────────────────┤
 │ 127.0.0.1        │ 8             │ 4                │ lo                 │
 │ 192.168.1.2      │ 24            │ 4                │ eth0               │
 │ 192.168.1.8      │ 24            │ 4                │ eth0               │
 │ 10.21.12.24      │ 24            │ 4                │ eth0               │
 │ 75.748.86.91     │ 95            │ 4                │ eth1               │
 │ 23.75.345.200    │ 95            │ 4                │ eth1               │
 ╰──────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 6

ipv6:
 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
 │ address       (string) │ mask (string) │ version (string) │ interface (string) │
 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
 │ ::1                    │ 128           │ 6                │ lo                 │
 │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │
 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
Record count: 2]]></m:pre>

		<h2>Using parametrized queries to avoid SQL injection</h2>
		<p>
			When <code>"4"</code> and <code>"6"</code> are not fixed values, we should not glue them to the query string like <code>version = $version</code>,
			because it is a dangerous practice that may lead to SQL injection.
			We have parametrized queries for such tasks:
		</p>
		<m:pre jazyk="bash"><![CDATA[--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"]]></m:pre>
		
		
		<h2>Running SQL JOINs, UNIONs etc. on multiple CSV files</h2>
		
		<p>
			To load multiple CSV files into our <i>in-memory database</i>, we just concatenate the relational streams
			using the means of our shell – the semicolons and parenthesis:
		</p>
		<m:pre jazyk="bash"><![CDATA[(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \
	| relpipe-tr-sql \
		--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
	| relpipe-out-tabular]]></m:pre>

		<p>Generic version that loads all <code>*.csv</code> files:</p>
		<m:pre jazyk="bash"><![CDATA[for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \
	| relpipe-tr-sql \
		--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
	| relpipe-out-tabular]]></m:pre>
		
		<p>Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.</p>
		<m:pre jazyk="text"><![CDATA[ip_nic:
 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮
 │ address       (string) │ mask (string) │ version (string) │ interface (string) │ address  (string) │ name (string) │
 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤
 │ 127.0.0.1              │ 8             │ 4                │ lo                 │ 00:00:00:00:00:00 │ lo            │
 │ ::1                    │ 128           │ 6                │ lo                 │ 00:00:00:00:00:00 │ lo            │
 │ 192.168.1.2            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
 │ 192.168.1.8            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
 │ 10.21.12.24            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
 │ 75.748.86.91           │ 95            │ 4                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
 │ 23.75.345.200          │ 95            │ 4                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
 │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯
Record count: 8]]></m:pre>


		<h2>Leveraging shell functions</h2>
		
		<p>
			Good practice is to wrap common code blocks into functions and thus make them reusable.
			In shell, the function still works with input and output streams and we can use them when building our pipelines.
			Shell functions can be seen as named reusable parts of a pipeline.
		</p>
		
		<m:pre jazyk="bash"><![CDATA[csv2relation()  { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done }
do_query()      { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; }
format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }

csv2relation *.csv | do_query | format_result]]></m:pre>

		<p>
			The <code>format_result()</code> function checks whether the STDOUT is a terminal or not.
			and when printing to the terminal, it generates a table.
			When writing to a regular file or STDIN of another process, it passes through original relational data.
			Thus <code>./our-script.sh</code> will print a nice table in the terminal, while <code>./our-script.sh > data.rp</code> will create a file containing machine-readable data
			and <code>./our-script.sh | relpipe-out-xhtml > report.xhtml</code> will create an XHTML report and <code>./our-script.sh | relpipe-out-gui</code> will show a GUI window full of tables and maybe also charts.
		</p>
		
		<m:img src="img/csv-sql-gui-ip-address-counts.png"/>
		
		<m:pre jazyk="sql"><![CDATA[SELECT
	nic.name || ' IPv' || ip.version AS label,
	nic.name AS interface,
	ip.version AS ip_version,
	count(*) AS address_count
FROM nic 
	LEFT JOIN ip ON (ip.interface = nic.name)
GROUP BY nic.name, ip.version
ORDER BY count(*) DESC]]></m:pre>

		
		<h2>Makefile version</h2>

		<p>
			Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code.
			We can also use Make (the tool intended mainly for building sofware), write a <i>Makefile</i> and organize our code around some temporary files and other targets instead of functions.
		</p>

		<m:pre jazyk="Makefile"><![CDATA[all: print_summary

.PHONY: clean print_summary run_commands

clean:
	rm -rf *.rp

%.rp: %.csv
	relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@)

define SQL_IP_NIC
	SELECT
		ip.address AS ip_address,
		nic.name AS interface,
		nic.address AS mac_address
	FROM ip
		JOIN nic ON (nic.name = ip.interface)
endef
export SQL_IP_NIC

define SQL_COUNT_VERSIONS
	SELECT
		interface,
		count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count,
		count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count
	FROM ip
	GROUP BY interface
	ORDER BY interface
endef
export SQL_COUNT_VERSIONS

# Longer SQL queries are better kept in separate .sql files,
# because we can enjoy syntax highlighting and other support in our editors.
# Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)"

summary.rp: nic.rp ip.rp
	cat $(^) \
		| relpipe-tr-sql \
			--relation "ip_nic" "$$SQL_IP_NIC" \
			--relation "counts" "$$SQL_COUNT_VERSIONS" \
		> $(@)

print_summary: summary.rp
	cat $(<) | relpipe-out-tabular
]]></m:pre>

	<p>
		We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make)
		and have reusable shell functions available in the Makefile:
	</p>

<m:pre jazyk="text"><![CDATA[
SHELL=bash
BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done }
export BASH_FUNC_read_nullbyte%%]]></m:pre>

	<p>usage example:</p>

<m:pre jazyk="Makefile"><![CDATA[
run_commands: summary.rp
	cat $(<) \
		| relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \
		| relpipe-out-nullbyte \
		| while read_nullbyte ip_address interface mac_address; do\
			echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \
		done;
]]></m:pre>

		<p>
			Both approaches – the shell script and the Makefile – have pros and cons.
			With Makefile, we usually create some temporary files containing intermediate results.
			That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed.
		</p>


	</text>

</stránka>