relpipe-data/examples-csv-sql-join.xml
branchv_0
changeset 329 5bc2bb8b7946
equal deleted inserted replaced
328:cc60c8dd7924 329:5bc2bb8b7946
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 
       
     5 	<nadpis>Running SQL JOINs on multiple CSV files</nadpis>
       
     6 	<perex>query a collection of (not only) CSV files using SQL</perex>
       
     7 	<m:pořadí-příkladu>05100</m:pořadí-příkladu>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 
       
    11 		<p>
       
    12 			CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
       
    13 			both, human-readable and well supported by many existing applications and libraries.
       
    14 			We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc.
       
    15 			(on the other hand, such simple formats have usually some drawbacks…)
       
    16 		</p>
       
    17 		<p>
       
    18 			In this example, we will show how to query a set of CSV files like it was a relational database.
       
    19 		</p>
       
    20 
       
    21 		<p>Suppose we have a CSV file describing our network interfaces:</p>
       
    22 		<m:pre jazyk="text"><![CDATA[address,name
       
    23 00:00:00:00:00:00,lo
       
    24 00:D0:D8:00:26:00,eth0
       
    25 00:01:02:01:33:70,eth1]]></m:pre>
       
    26 
       
    27 
       
    28 		<p>and another CSV file with IP addresses assigned to them:</p>
       
    29 		<m:pre jazyk="text"><![CDATA[address,mask,version,interface
       
    30 127.0.0.1,8,4,lo
       
    31 ::1,128,6,lo
       
    32 192.168.1.2,24,4,eth0
       
    33 192.168.1.8,24,4,eth0
       
    34 10.21.12.24,24,4,eth0
       
    35 75.748.86.91,95,4,eth1
       
    36 23.75.345.200,95,4,eth1
       
    37 2a01:430:2e::cafe:babe,64,6,eth1]]></m:pre>
       
    38 
       
    39 
       
    40 		<h2>Loading a CSV file and running basic queries</h2>
       
    41 
       
    42 		<p>
       
    43 			Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.)
       
    44 			We can also add <code>relpipe-tr-sql</code> in the middle of our pipeline and run some SQL queries –
       
    45 			transform data on-the-fly and send the query result to the <code>relpipe-out-tabular</code> (or other output filter) in place of the original data.
       
    46 			For now, we will filter just the IPv6 addresses:
       
    47 		</p>
       
    48 		<m:pre jazyk="bash"><![CDATA[cat ip.csv \
       
    49 	| relpipe-in-csv --relation 'ip' \
       
    50 	| relpipe-tr-sql \
       
    51 		--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
       
    52 	| relpipe-out-tabular]]></m:pre>
       
    53 		<p>and get them printed:</p>
       
    54 		<m:pre jazyk="text"><![CDATA[ipv6:
       
    55  ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
       
    56  │ address       (string) │ mask (string) │ version (string) │ interface (string) │
       
    57  ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
       
    58  │ ::1                    │ 128           │ 6                │ lo                 │
       
    59  │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │
       
    60  ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
       
    61 Record count: 2]]></m:pre>
       
    62 
       
    63 		<p>
       
    64 			It is alo possible to run several queries at once
       
    65 			and thanks to the <m:name/> format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline:
       
    66 		</p>
       
    67 		<m:pre jazyk="bash"><![CDATA[cat ip.csv \
       
    68 	| relpipe-in-csv --relation 'ip' \
       
    69 	| relpipe-tr-sql \
       
    70 		--relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \
       
    71 		--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
       
    72 	| relpipe-out-tabular]]></m:pre>
       
    73 		<p>resulting in two nice tables:</p>
       
    74 		<m:pre jazyk="text"><![CDATA[ipv4:
       
    75  ╭──────────────────┬───────────────┬──────────────────┬────────────────────╮
       
    76  │ address (string) │ mask (string) │ version (string) │ interface (string) │
       
    77  ├──────────────────┼───────────────┼──────────────────┼────────────────────┤
       
    78  │ 127.0.0.1        │ 8             │ 4                │ lo                 │
       
    79  │ 192.168.1.2      │ 24            │ 4                │ eth0               │
       
    80  │ 192.168.1.8      │ 24            │ 4                │ eth0               │
       
    81  │ 10.21.12.24      │ 24            │ 4                │ eth0               │
       
    82  │ 75.748.86.91     │ 95            │ 4                │ eth1               │
       
    83  │ 23.75.345.200    │ 95            │ 4                │ eth1               │
       
    84  ╰──────────────────┴───────────────┴──────────────────┴────────────────────╯
       
    85 Record count: 6
       
    86 
       
    87 ipv6:
       
    88  ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
       
    89  │ address       (string) │ mask (string) │ version (string) │ interface (string) │
       
    90  ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
       
    91  │ ::1                    │ 128           │ 6                │ lo                 │
       
    92  │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │
       
    93  ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
       
    94 Record count: 2]]></m:pre>
       
    95 
       
    96 		<h2>Using parametrized queries to avoid SQL injection</h2>
       
    97 		<p>
       
    98 			When <code>"4"</code> and <code>"6"</code> are not fixed values, we should not glue them to the query string like <code>version = $version</code>,
       
    99 			because it is a dangerous practice that may lead to SQL injection.
       
   100 			We have parametrized queries for such tasks:
       
   101 		</p>
       
   102 		<m:pre jazyk="bash"><![CDATA[--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"]]></m:pre>
       
   103 		
       
   104 		
       
   105 		<h2>Running SQL JOINs, UNIONs etc. on multiple CSV files</h2>
       
   106 		
       
   107 		<p>
       
   108 			To load multiple CSV files into our <i>in-memory database</i>, we just concatenate the relational streams
       
   109 			using the means of our shell – the semicolons and parenthesis:
       
   110 		</p>
       
   111 		<m:pre jazyk="bash"><![CDATA[(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \
       
   112 	| relpipe-tr-sql \
       
   113 		--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
       
   114 	| relpipe-out-tabular]]></m:pre>
       
   115 
       
   116 		<p>Generic version that loads all <code>*.csv</code> files:</p>
       
   117 		<m:pre jazyk="bash"><![CDATA[for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \
       
   118 	| relpipe-tr-sql \
       
   119 		--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
       
   120 	| relpipe-out-tabular]]></m:pre>
       
   121 		
       
   122 		<p>Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.</p>
       
   123 		<m:pre jazyk="text"><![CDATA[ip_nic:
       
   124  ╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮
       
   125  │ address       (string) │ mask (string) │ version (string) │ interface (string) │ address  (string) │ name (string) │
       
   126  ├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤
       
   127  │ 127.0.0.1              │ 8             │ 4                │ lo                 │ 00:00:00:00:00:00 │ lo            │
       
   128  │ ::1                    │ 128           │ 6                │ lo                 │ 00:00:00:00:00:00 │ lo            │
       
   129  │ 192.168.1.2            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
       
   130  │ 192.168.1.8            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
       
   131  │ 10.21.12.24            │ 24            │ 4                │ eth0               │ 00:D0:D8:00:26:00 │ eth0          │
       
   132  │ 75.748.86.91           │ 95            │ 4                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
       
   133  │ 23.75.345.200          │ 95            │ 4                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
       
   134  │ 2a01:430:2e::cafe:babe │ 64            │ 6                │ eth1               │ 00:01:02:01:33:70 │ eth1          │
       
   135  ╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯
       
   136 Record count: 8]]></m:pre>
       
   137 
       
   138 
       
   139 		<h2>Leveraging shell functions</h2>
       
   140 		
       
   141 		<p>
       
   142 			Good practice is to wrap common code blocks into functions and thus make them reusable.
       
   143 			In shell, the function still works with input and output streams and we can use them when building our pipelines.
       
   144 			Shell functions can be seen as named reusable parts of a pipeline.
       
   145 		</p>
       
   146 		
       
   147 		<m:pre jazyk="bash"><![CDATA[csv2relation()  { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done }
       
   148 do_query()      { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; }
       
   149 format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }
       
   150 
       
   151 csv2relation *.csv | do_query | format_result]]></m:pre>
       
   152 
       
   153 		<p>
       
   154 			The <code>format_result()</code> function checks whether the STDOUT is a terminal or not.
       
   155 			and when printing to the terminal, it generates a table.
       
   156 			When writing to a regular file or STDIN of another process, it passes through original relational data.
       
   157 			Thus <code>./our-script.sh</code> will print a nice table in the terminal, while <code>./our-script.sh > data.rp</code> will create a file containing machine-readable data
       
   158 			and <code>./our-script.sh | relpipe-out-xhtml > report.xhtml</code> will create an XHTML report and <code>./our-script.sh | relpipe-out-gui</code> will show a GUI window full of tables and maybe also charts.
       
   159 		</p>
       
   160 		
       
   161 		<m:img src="img/csv-sql-gui-ip-address-counts.png"/>
       
   162 		
       
   163 		<m:pre jazyk="sql"><![CDATA[SELECT
       
   164 	nic.name || ' IPv' || ip.version AS label,
       
   165 	nic.name AS interface,
       
   166 	ip.version AS ip_version,
       
   167 	count(*) AS address_count
       
   168 FROM nic 
       
   169 	LEFT JOIN ip ON (ip.interface = nic.name)
       
   170 GROUP BY nic.name, ip.version
       
   171 ORDER BY count(*) DESC]]></m:pre>
       
   172 
       
   173 		
       
   174 		<h2>Makefile version</h2>
       
   175 
       
   176 		<p>
       
   177 			Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code.
       
   178 			We can also use Make (the tool intended mainly for building sofware), write a <i>Makefile</i> and organize our code around some temporary files and other targets instead of functions.
       
   179 		</p>
       
   180 
       
   181 		<m:pre jazyk="Makefile"><![CDATA[all: print_summary
       
   182 
       
   183 .PHONY: clean print_summary run_commands
       
   184 
       
   185 clean:
       
   186 	rm -rf *.rp
       
   187 
       
   188 %.rp: %.csv
       
   189 	relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@)
       
   190 
       
   191 define SQL_IP_NIC
       
   192 	SELECT
       
   193 		ip.address AS ip_address,
       
   194 		nic.name AS interface,
       
   195 		nic.address AS mac_address
       
   196 	FROM ip
       
   197 		JOIN nic ON (nic.name = ip.interface)
       
   198 endef
       
   199 export SQL_IP_NIC
       
   200 
       
   201 define SQL_COUNT_VERSIONS
       
   202 	SELECT
       
   203 		interface,
       
   204 		count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count,
       
   205 		count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count
       
   206 	FROM ip
       
   207 	GROUP BY interface
       
   208 	ORDER BY interface
       
   209 endef
       
   210 export SQL_COUNT_VERSIONS
       
   211 
       
   212 # Longer SQL queries are better kept in separate .sql files,
       
   213 # because we can enjoy syntax highlighting and other support in our editors.
       
   214 # Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)"
       
   215 
       
   216 summary.rp: nic.rp ip.rp
       
   217 	cat $(^) \
       
   218 		| relpipe-tr-sql \
       
   219 			--relation "ip_nic" "$$SQL_IP_NIC" \
       
   220 			--relation "counts" "$$SQL_COUNT_VERSIONS" \
       
   221 		> $(@)
       
   222 
       
   223 print_summary: summary.rp
       
   224 	cat $(<) | relpipe-out-tabular
       
   225 ]]></m:pre>
       
   226 
       
   227 	<p>
       
   228 		We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make)
       
   229 		and have reusable shell functions available in the Makefile:
       
   230 	</p>
       
   231 
       
   232 <m:pre jazyk="text"><![CDATA[
       
   233 SHELL=bash
       
   234 BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done }
       
   235 export BASH_FUNC_read_nullbyte%%]]></m:pre>
       
   236 
       
   237 	<p>usage example:</p>
       
   238 
       
   239 <m:pre jazyk="Makefile"><![CDATA[
       
   240 run_commands: summary.rp
       
   241 	cat $(<) \
       
   242 		| relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \
       
   243 		| relpipe-out-nullbyte \
       
   244 		| while read_nullbyte ip_address interface mac_address; do\
       
   245 			echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \
       
   246 		done;
       
   247 ]]></m:pre>
       
   248 
       
   249 		<p>
       
   250 			Both approaches – the shell script and the Makefile – have pros and cons.
       
   251 			With Makefile, we usually create some temporary files containing intermediate results.
       
   252 			That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed.
       
   253 		</p>
       
   254 
       
   255 
       
   256 	</text>
       
   257 
       
   258 </stránka>