|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Running SQL JOINs on multiple CSV files</nadpis> |
|
6 <perex>query a collection of (not only) CSV files using SQL</perex> |
|
7 <m:pořadí-příkladu>05100</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format – |
|
13 both, human-readable and well supported by many existing applications and libraries. |
|
14 We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc. |
|
15 (on the other hand, such simple formats have usually some drawbacks…) |
|
16 </p> |
|
17 <p> |
|
18 In this example, we will show how to query a set of CSV files like it was a relational database. |
|
19 </p> |
|
20 |
|
21 <p>Suppose we have a CSV file describing our network interfaces:</p> |
|
22 <m:pre jazyk="text"><![CDATA[address,name |
|
23 00:00:00:00:00:00,lo |
|
24 00:D0:D8:00:26:00,eth0 |
|
25 00:01:02:01:33:70,eth1]]></m:pre> |
|
26 |
|
27 |
|
28 <p>and another CSV file with IP addresses assigned to them:</p> |
|
29 <m:pre jazyk="text"><![CDATA[address,mask,version,interface |
|
30 127.0.0.1,8,4,lo |
|
31 ::1,128,6,lo |
|
32 192.168.1.2,24,4,eth0 |
|
33 192.168.1.8,24,4,eth0 |
|
34 10.21.12.24,24,4,eth0 |
|
35 75.748.86.91,95,4,eth1 |
|
36 23.75.345.200,95,4,eth1 |
|
37 2a01:430:2e::cafe:babe,64,6,eth1]]></m:pre> |
|
38 |
|
39 |
|
40 <h2>Loading a CSV file and running basic queries</h2> |
|
41 |
|
42 <p> |
|
43 Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.) |
|
44 We can also add <code>relpipe-tr-sql</code> in the middle of our pipeline and run some SQL queries – |
|
45 transform data on-the-fly and send the query result to the <code>relpipe-out-tabular</code> (or other output filter) in place of the original data. |
|
46 For now, we will filter just the IPv6 addresses: |
|
47 </p> |
|
48 <m:pre jazyk="bash"><![CDATA[cat ip.csv \ |
|
49 | relpipe-in-csv --relation 'ip' \ |
|
50 | relpipe-tr-sql \ |
|
51 --relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \ |
|
52 | relpipe-out-tabular]]></m:pre> |
|
53 <p>and get them printed:</p> |
|
54 <m:pre jazyk="text"><![CDATA[ipv6: |
|
55 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮ |
|
56 │ address (string) │ mask (string) │ version (string) │ interface (string) │ |
|
57 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤ |
|
58 │ ::1 │ 128 │ 6 │ lo │ |
|
59 │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ |
|
60 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯ |
|
61 Record count: 2]]></m:pre> |
|
62 |
|
63 <p> |
|
64 It is alo possible to run several queries at once |
|
65 and thanks to the <m:name/> format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline: |
|
66 </p> |
|
67 <m:pre jazyk="bash"><![CDATA[cat ip.csv \ |
|
68 | relpipe-in-csv --relation 'ip' \ |
|
69 | relpipe-tr-sql \ |
|
70 --relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \ |
|
71 --relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \ |
|
72 | relpipe-out-tabular]]></m:pre> |
|
73 <p>resulting in two nice tables:</p> |
|
74 <m:pre jazyk="text"><![CDATA[ipv4: |
|
75 ╭──────────────────┬───────────────┬──────────────────┬────────────────────╮ |
|
76 │ address (string) │ mask (string) │ version (string) │ interface (string) │ |
|
77 ├──────────────────┼───────────────┼──────────────────┼────────────────────┤ |
|
78 │ 127.0.0.1 │ 8 │ 4 │ lo │ |
|
79 │ 192.168.1.2 │ 24 │ 4 │ eth0 │ |
|
80 │ 192.168.1.8 │ 24 │ 4 │ eth0 │ |
|
81 │ 10.21.12.24 │ 24 │ 4 │ eth0 │ |
|
82 │ 75.748.86.91 │ 95 │ 4 │ eth1 │ |
|
83 │ 23.75.345.200 │ 95 │ 4 │ eth1 │ |
|
84 ╰──────────────────┴───────────────┴──────────────────┴────────────────────╯ |
|
85 Record count: 6 |
|
86 |
|
87 ipv6: |
|
88 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮ |
|
89 │ address (string) │ mask (string) │ version (string) │ interface (string) │ |
|
90 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┤ |
|
91 │ ::1 │ 128 │ 6 │ lo │ |
|
92 │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ |
|
93 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯ |
|
94 Record count: 2]]></m:pre> |
|
95 |
|
96 <h2>Using parametrized queries to avoid SQL injection</h2> |
|
97 <p> |
|
98 When <code>"4"</code> and <code>"6"</code> are not fixed values, we should not glue them to the query string like <code>version = $version</code>, |
|
99 because it is a dangerous practice that may lead to SQL injection. |
|
100 We have parametrized queries for such tasks: |
|
101 </p> |
|
102 <m:pre jazyk="bash"><![CDATA[--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"]]></m:pre> |
|
103 |
|
104 |
|
105 <h2>Running SQL JOINs, UNIONs etc. on multiple CSV files</h2> |
|
106 |
|
107 <p> |
|
108 To load multiple CSV files into our <i>in-memory database</i>, we just concatenate the relational streams |
|
109 using the means of our shell – the semicolons and parenthesis: |
|
110 </p> |
|
111 <m:pre jazyk="bash"><![CDATA[(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \ |
|
112 | relpipe-tr-sql \ |
|
113 --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \ |
|
114 | relpipe-out-tabular]]></m:pre> |
|
115 |
|
116 <p>Generic version that loads all <code>*.csv</code> files:</p> |
|
117 <m:pre jazyk="bash"><![CDATA[for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \ |
|
118 | relpipe-tr-sql \ |
|
119 --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \ |
|
120 | relpipe-out-tabular]]></m:pre> |
|
121 |
|
122 <p>Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.</p> |
|
123 <m:pre jazyk="text"><![CDATA[ip_nic: |
|
124 ╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮ |
|
125 │ address (string) │ mask (string) │ version (string) │ interface (string) │ address (string) │ name (string) │ |
|
126 ├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤ |
|
127 │ 127.0.0.1 │ 8 │ 4 │ lo │ 00:00:00:00:00:00 │ lo │ |
|
128 │ ::1 │ 128 │ 6 │ lo │ 00:00:00:00:00:00 │ lo │ |
|
129 │ 192.168.1.2 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │ |
|
130 │ 192.168.1.8 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │ |
|
131 │ 10.21.12.24 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │ |
|
132 │ 75.748.86.91 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │ |
|
133 │ 23.75.345.200 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │ |
|
134 │ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ 00:01:02:01:33:70 │ eth1 │ |
|
135 ╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯ |
|
136 Record count: 8]]></m:pre> |
|
137 |
|
138 |
|
139 <h2>Leveraging shell functions</h2> |
|
140 |
|
141 <p> |
|
142 Good practice is to wrap common code blocks into functions and thus make them reusable. |
|
143 In shell, the function still works with input and output streams and we can use them when building our pipelines. |
|
144 Shell functions can be seen as named reusable parts of a pipeline. |
|
145 </p> |
|
146 |
|
147 <m:pre jazyk="bash"><![CDATA[csv2relation() { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done } |
|
148 do_query() { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; } |
|
149 format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; } |
|
150 |
|
151 csv2relation *.csv | do_query | format_result]]></m:pre> |
|
152 |
|
153 <p> |
|
154 The <code>format_result()</code> function checks whether the STDOUT is a terminal or not. |
|
155 and when printing to the terminal, it generates a table. |
|
156 When writing to a regular file or STDIN of another process, it passes through original relational data. |
|
157 Thus <code>./our-script.sh</code> will print a nice table in the terminal, while <code>./our-script.sh > data.rp</code> will create a file containing machine-readable data |
|
158 and <code>./our-script.sh | relpipe-out-xhtml > report.xhtml</code> will create an XHTML report and <code>./our-script.sh | relpipe-out-gui</code> will show a GUI window full of tables and maybe also charts. |
|
159 </p> |
|
160 |
|
161 <m:img src="img/csv-sql-gui-ip-address-counts.png"/> |
|
162 |
|
163 <m:pre jazyk="sql"><![CDATA[SELECT |
|
164 nic.name || ' IPv' || ip.version AS label, |
|
165 nic.name AS interface, |
|
166 ip.version AS ip_version, |
|
167 count(*) AS address_count |
|
168 FROM nic |
|
169 LEFT JOIN ip ON (ip.interface = nic.name) |
|
170 GROUP BY nic.name, ip.version |
|
171 ORDER BY count(*) DESC]]></m:pre> |
|
172 |
|
173 |
|
174 <h2>Makefile version</h2> |
|
175 |
|
176 <p> |
|
177 Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code. |
|
178 We can also use Make (the tool intended mainly for building sofware), write a <i>Makefile</i> and organize our code around some temporary files and other targets instead of functions. |
|
179 </p> |
|
180 |
|
181 <m:pre jazyk="Makefile"><![CDATA[all: print_summary |
|
182 |
|
183 .PHONY: clean print_summary run_commands |
|
184 |
|
185 clean: |
|
186 rm -rf *.rp |
|
187 |
|
188 %.rp: %.csv |
|
189 relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@) |
|
190 |
|
191 define SQL_IP_NIC |
|
192 SELECT |
|
193 ip.address AS ip_address, |
|
194 nic.name AS interface, |
|
195 nic.address AS mac_address |
|
196 FROM ip |
|
197 JOIN nic ON (nic.name = ip.interface) |
|
198 endef |
|
199 export SQL_IP_NIC |
|
200 |
|
201 define SQL_COUNT_VERSIONS |
|
202 SELECT |
|
203 interface, |
|
204 count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count, |
|
205 count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count |
|
206 FROM ip |
|
207 GROUP BY interface |
|
208 ORDER BY interface |
|
209 endef |
|
210 export SQL_COUNT_VERSIONS |
|
211 |
|
212 # Longer SQL queries are better kept in separate .sql files, |
|
213 # because we can enjoy syntax highlighting and other support in our editors. |
|
214 # Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)" |
|
215 |
|
216 summary.rp: nic.rp ip.rp |
|
217 cat $(^) \ |
|
218 | relpipe-tr-sql \ |
|
219 --relation "ip_nic" "$$SQL_IP_NIC" \ |
|
220 --relation "counts" "$$SQL_COUNT_VERSIONS" \ |
|
221 > $(@) |
|
222 |
|
223 print_summary: summary.rp |
|
224 cat $(<) | relpipe-out-tabular |
|
225 ]]></m:pre> |
|
226 |
|
227 <p> |
|
228 We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make) |
|
229 and have reusable shell functions available in the Makefile: |
|
230 </p> |
|
231 |
|
232 <m:pre jazyk="text"><![CDATA[ |
|
233 SHELL=bash |
|
234 BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done } |
|
235 export BASH_FUNC_read_nullbyte%%]]></m:pre> |
|
236 |
|
237 <p>usage example:</p> |
|
238 |
|
239 <m:pre jazyk="Makefile"><![CDATA[ |
|
240 run_commands: summary.rp |
|
241 cat $(<) \ |
|
242 | relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \ |
|
243 | relpipe-out-nullbyte \ |
|
244 | while read_nullbyte ip_address interface mac_address; do\ |
|
245 echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \ |
|
246 done; |
|
247 ]]></m:pre> |
|
248 |
|
249 <p> |
|
250 Both approaches – the shell script and the Makefile – have pros and cons. |
|
251 With Makefile, we usually create some temporary files containing intermediate results. |
|
252 That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed. |
|
253 </p> |
|
254 |
|
255 |
|
256 </text> |
|
257 |
|
258 </stránka> |