--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-csv-sql-join.xml Mon Feb 21 00:43:11 2022 +0100
@@ -0,0 +1,258 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Running SQL JOINs on multiple CSV files</nadpis>
+ <perex>query a collection of (not only) CSV files using SQL</perex>
+ <m:pořadí-příkladu>05100</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
+ both, human-readable and well supported by many existing applications and libraries.
+ We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc.
+ (on the other hand, such simple formats have usually some drawbacks…)
+ </p>
+ <p>
+ In this example, we will show how to query a set of CSV files like it was a relational database.
+ </p>
+
+ <p>Suppose we have a CSV file describing our network interfaces:</p>
+ <m:pre jazyk="text"><![CDATA[address,name
+00:00:00:00:00:00,lo
+00:D0:D8:00:26:00,eth0
+00:01:02:01:33:70,eth1]]></m:pre>
+
+
+ <p>and another CSV file with IP addresses assigned to them:</p>
+ <m:pre jazyk="text"><![CDATA[address,mask,version,interface
+127.0.0.1,8,4,lo
+::1,128,6,lo
+192.168.1.2,24,4,eth0
+192.168.1.8,24,4,eth0
+10.21.12.24,24,4,eth0
+75.748.86.91,95,4,eth1
+23.75.345.200,95,4,eth1
+2a01:430:2e::cafe:babe,64,6,eth1]]></m:pre>
+
+
+ <h2>Loading a CSV file and running basic queries</h2>
+
+ <p>
+ Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.)
+ We can also add <code>relpipe-tr-sql</code> in the middle of our pipeline and run some SQL queries –
+ transform data on-the-fly and send the query result to the <code>relpipe-out-tabular</code> (or other output filter) in place of the original data.
+ For now, we will filter just the IPv6 addresses:
+ </p>
+ <m:pre jazyk="bash"><![CDATA[cat ip.csv \
+ | relpipe-in-csv --relation 'ip' \
+ | relpipe-tr-sql \
+ --relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
+ | relpipe-out-tabular]]></m:pre>
+ <p>and get them printed:</p>
+ <m:pre jazyk="text"><![CDATA[ipv6:
+ ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
+ │ address (string) │ mask (string) │ version (string) │ interface (string) │
+ ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
+ │ ::1 │ 128 │ 6 │ lo │
+ │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
+ ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
+Record count: 2]]></m:pre>
+
+ <p>
+ It is alo possible to run several queries at once
+ and thanks to the <m:name/> format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline:
+ </p>
+ <m:pre jazyk="bash"><![CDATA[cat ip.csv \
+ | relpipe-in-csv --relation 'ip' \
+ | relpipe-tr-sql \
+ --relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \
+ --relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
+ | relpipe-out-tabular]]></m:pre>
+ <p>resulting in two nice tables:</p>
+ <m:pre jazyk="text"><![CDATA[ipv4:
+ ╭──────────────────┬───────────────┬──────────────────┬────────────────────╮
+ │ address (string) │ mask (string) │ version (string) │ interface (string) │
+ ├──────────────────┼───────────────┼──────────────────┼────────────────────┤
+ │ 127.0.0.1 │ 8 │ 4 │ lo │
+ │ 192.168.1.2 │ 24 │ 4 │ eth0 │
+ │ 192.168.1.8 │ 24 │ 4 │ eth0 │
+ │ 10.21.12.24 │ 24 │ 4 │ eth0 │
+ │ 75.748.86.91 │ 95 │ 4 │ eth1 │
+ │ 23.75.345.200 │ 95 │ 4 │ eth1 │
+ ╰──────────────────┴───────────────┴──────────────────┴────────────────────╯
+Record count: 6
+
+ipv6:
+ ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
+ │ address (string) │ mask (string) │ version (string) │ interface (string) │
+ ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
+ │ ::1 │ 128 │ 6 │ lo │
+ │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
+ ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
+Record count: 2]]></m:pre>
+
+ <h2>Using parametrized queries to avoid SQL injection</h2>
+ <p>
+ When <code>"4"</code> and <code>"6"</code> are not fixed values, we should not glue them to the query string like <code>version = $version</code>,
+ because it is a dangerous practice that may lead to SQL injection.
+ We have parametrized queries for such tasks:
+ </p>
+ <m:pre jazyk="bash"><![CDATA[--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"]]></m:pre>
+
+
+ <h2>Running SQL JOINs, UNIONs etc. on multiple CSV files</h2>
+
+ <p>
+ To load multiple CSV files into our <i>in-memory database</i>, we just concatenate the relational streams
+ using the means of our shell – the semicolons and parenthesis:
+ </p>
+ <m:pre jazyk="bash"><![CDATA[(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \
+ | relpipe-tr-sql \
+ --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>Generic version that loads all <code>*.csv</code> files:</p>
+ <m:pre jazyk="bash"><![CDATA[for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \
+ | relpipe-tr-sql \
+ --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.</p>
+ <m:pre jazyk="text"><![CDATA[ip_nic:
+ ╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮
+ │ address (string) │ mask (string) │ version (string) │ interface (string) │ address (string) │ name (string) │
+ ├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤
+ │ 127.0.0.1 │ 8 │ 4 │ lo │ 00:00:00:00:00:00 │ lo │
+ │ ::1 │ 128 │ 6 │ lo │ 00:00:00:00:00:00 │ lo │
+ │ 192.168.1.2 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
+ │ 192.168.1.8 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
+ │ 10.21.12.24 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
+ │ 75.748.86.91 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
+ │ 23.75.345.200 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
+ │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
+ ╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯
+Record count: 8]]></m:pre>
+
+
+ <h2>Leveraging shell functions</h2>
+
+ <p>
+ Good practice is to wrap common code blocks into functions and thus make them reusable.
+ In shell, the function still works with input and output streams and we can use them when building our pipelines.
+ Shell functions can be seen as named reusable parts of a pipeline.
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[csv2relation() { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done }
+do_query() { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; }
+format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }
+
+csv2relation *.csv | do_query | format_result]]></m:pre>
+
+ <p>
+ The <code>format_result()</code> function checks whether the STDOUT is a terminal or not.
+ and when printing to the terminal, it generates a table.
+ When writing to a regular file or STDIN of another process, it passes through original relational data.
+ Thus <code>./our-script.sh</code> will print a nice table in the terminal, while <code>./our-script.sh > data.rp</code> will create a file containing machine-readable data
+ and <code>./our-script.sh | relpipe-out-xhtml > report.xhtml</code> will create an XHTML report and <code>./our-script.sh | relpipe-out-gui</code> will show a GUI window full of tables and maybe also charts.
+ </p>
+
+ <m:img src="img/csv-sql-gui-ip-address-counts.png"/>
+
+ <m:pre jazyk="sql"><![CDATA[SELECT
+ nic.name || ' IPv' || ip.version AS label,
+ nic.name AS interface,
+ ip.version AS ip_version,
+ count(*) AS address_count
+FROM nic
+ LEFT JOIN ip ON (ip.interface = nic.name)
+GROUP BY nic.name, ip.version
+ORDER BY count(*) DESC]]></m:pre>
+
+
+ <h2>Makefile version</h2>
+
+ <p>
+ Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code.
+ We can also use Make (the tool intended mainly for building sofware), write a <i>Makefile</i> and organize our code around some temporary files and other targets instead of functions.
+ </p>
+
+ <m:pre jazyk="Makefile"><![CDATA[all: print_summary
+
+.PHONY: clean print_summary run_commands
+
+clean:
+ rm -rf *.rp
+
+%.rp: %.csv
+ relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@)
+
+define SQL_IP_NIC
+ SELECT
+ ip.address AS ip_address,
+ nic.name AS interface,
+ nic.address AS mac_address
+ FROM ip
+ JOIN nic ON (nic.name = ip.interface)
+endef
+export SQL_IP_NIC
+
+define SQL_COUNT_VERSIONS
+ SELECT
+ interface,
+ count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count,
+ count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count
+ FROM ip
+ GROUP BY interface
+ ORDER BY interface
+endef
+export SQL_COUNT_VERSIONS
+
+# Longer SQL queries are better kept in separate .sql files,
+# because we can enjoy syntax highlighting and other support in our editors.
+# Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)"
+
+summary.rp: nic.rp ip.rp
+ cat $(^) \
+ | relpipe-tr-sql \
+ --relation "ip_nic" "$$SQL_IP_NIC" \
+ --relation "counts" "$$SQL_COUNT_VERSIONS" \
+ > $(@)
+
+print_summary: summary.rp
+ cat $(<) | relpipe-out-tabular
+]]></m:pre>
+
+ <p>
+ We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make)
+ and have reusable shell functions available in the Makefile:
+ </p>
+
+<m:pre jazyk="text"><![CDATA[
+SHELL=bash
+BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done }
+export BASH_FUNC_read_nullbyte%%]]></m:pre>
+
+ <p>usage example:</p>
+
+<m:pre jazyk="Makefile"><![CDATA[
+run_commands: summary.rp
+ cat $(<) \
+ | relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \
+ | relpipe-out-nullbyte \
+ | while read_nullbyte ip_address interface mac_address; do\
+ echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \
+ done;
+]]></m:pre>
+
+ <p>
+ Both approaches – the shell script and the Makefile – have pros and cons.
+ With Makefile, we usually create some temporary files containing intermediate results.
+ That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed.
+ </p>
+
+
+ </text>
+
+</stránka>