diff -r cc60c8dd7924 -r 5bc2bb8b7946 relpipe-data/examples-csv-sql-join.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-csv-sql-join.xml Mon Feb 21 00:43:11 2022 +0100 @@ -0,0 +1,258 @@ + + + Running SQL JOINs on multiple CSV files + query a collection of (not only) CSV files using SQL + 05100 + + + +

+ CSV (RFC 4180) is quite good solution when we want to store or share relational data in a simple text format – + both, human-readable and well supported by many existing applications and libraries. + We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc. + (on the other hand, such simple formats have usually some drawbacks…) +

+ In this example, we will show how to query a set of CSV files like it was a relational database. +

+ +

Suppose we have a CSV file describing our network interfaces:

+ + + +

and another CSV file with IP addresses assigned to them:

+ + + +

Loading a CSV file and running basic queries

+ +

+ Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.) + We can also add relpipe-tr-sql in the middle of our pipeline and run some SQL queries – + transform data on-the-fly and send the query result to the relpipe-out-tabular (or other output filter) in place of the original data. + For now, we will filter just the IPv6 addresses: +

+ +

and get them printed:

+ + +

+ It is alo possible to run several queries at once + and thanks to the format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline: +

+ +

resulting in two nice tables:

+ + +

Using parametrized queries to avoid SQL injection

+ When "4" and "6" are not fixed values, we should not glue them to the query string like version = $version, + because it is a dangerous practice that may lead to SQL injection. + We have parametrized queries for such tasks: +

+ + + +

Running SQL JOINs, UNIONs etc. on multiple CSV files

+ +

+ To load multiple CSV files into our in-memory database, we just concatenate the relational streams + using the means of our shell – the semicolons and parenthesis: +

+ + +

Generic version that loads all *.csv files:

+ + +

Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.

+ + + +

Leveraging shell functions

+ +

+ Good practice is to wrap common code blocks into functions and thus make them reusable. + In shell, the function still works with input and output streams and we can use them when building our pipelines. + Shell functions can be seen as named reusable parts of a pipeline. +

+ + + +

+ The format_result() function checks whether the STDOUT is a terminal or not. + and when printing to the terminal, it generates a table. + When writing to a regular file or STDIN of another process, it passes through original relational data. + Thus ./our-script.sh will print a nice table in the terminal, while ./our-script.sh > data.rp will create a file containing machine-readable data + and ./our-script.sh | relpipe-out-xhtml > report.xhtml will create an XHTML report and ./our-script.sh | relpipe-out-gui will show a GUI window full of tables and maybe also charts. +

+ + + + + + +

Makefile version

+ +

+ Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code. + We can also use Make (the tool intended mainly for building sofware), write a Makefile and organize our code around some temporary files and other targets instead of functions. +

+ + $(@) + +define SQL_IP_NIC + SELECT + ip.address AS ip_address, + nic.name AS interface, + nic.address AS mac_address + FROM ip + JOIN nic ON (nic.name = ip.interface) +endef +export SQL_IP_NIC + +define SQL_COUNT_VERSIONS + SELECT + interface, + count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count, + count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count + FROM ip + GROUP BY interface + ORDER BY interface +endef +export SQL_COUNT_VERSIONS + +# Longer SQL queries are better kept in separate .sql files, +# because we can enjoy syntax highlighting and other support in our editors. +# Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)" + +summary.rp: nic.rp ip.rp + cat $(^) \ + | relpipe-tr-sql \ + --relation "ip_nic" "$$SQL_IP_NIC" \ + --relation "counts" "$$SQL_COUNT_VERSIONS" \ + > $(@) + +print_summary: summary.rp + cat $(<) | relpipe-out-tabular +]]> + +

+ We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make) + and have reusable shell functions available in the Makefile: +

+ + + +

usage example:

+ + + +

+ Both approaches – the shell script and the Makefile – have pros and cons. + With Makefile, we usually create some temporary files containing intermediate results. + That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed. +

+ + + + +