329
|
1 |
<stránka
|
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
|
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
|
|
4 |
|
|
5 |
<nadpis>Running SQL JOINs on multiple CSV files</nadpis>
|
|
6 |
<perex>query a collection of (not only) CSV files using SQL</perex>
|
|
7 |
<m:pořadí-příkladu>05100</m:pořadí-příkladu>
|
|
8 |
|
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml">
|
|
10 |
|
|
11 |
<p>
|
|
12 |
CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
|
|
13 |
both, human-readable and well supported by many existing applications and libraries.
|
|
14 |
We have even ready-to-use GUI editors, so called spreadsheets e.g. LibreOffice Calc.
|
|
15 |
(on the other hand, such simple formats have usually some drawbacks…)
|
|
16 |
</p>
|
|
17 |
<p>
|
|
18 |
In this example, we will show how to query a set of CSV files like it was a relational database.
|
|
19 |
</p>
|
|
20 |
|
|
21 |
<p>Suppose we have a CSV file describing our network interfaces:</p>
|
|
22 |
<m:pre jazyk="text"><![CDATA[address,name
|
|
23 |
00:00:00:00:00:00,lo
|
|
24 |
00:D0:D8:00:26:00,eth0
|
|
25 |
00:01:02:01:33:70,eth1]]></m:pre>
|
|
26 |
|
|
27 |
|
|
28 |
<p>and another CSV file with IP addresses assigned to them:</p>
|
|
29 |
<m:pre jazyk="text"><![CDATA[address,mask,version,interface
|
|
30 |
127.0.0.1,8,4,lo
|
|
31 |
::1,128,6,lo
|
|
32 |
192.168.1.2,24,4,eth0
|
|
33 |
192.168.1.8,24,4,eth0
|
|
34 |
10.21.12.24,24,4,eth0
|
|
35 |
75.748.86.91,95,4,eth1
|
|
36 |
23.75.345.200,95,4,eth1
|
|
37 |
2a01:430:2e::cafe:babe,64,6,eth1]]></m:pre>
|
|
38 |
|
|
39 |
|
|
40 |
<h2>Loading a CSV file and running basic queries</h2>
|
|
41 |
|
|
42 |
<p>
|
|
43 |
Simplest task is to parse the file and print it as a table in our terminal or convert it to another format (XML, Recfile, ODS, YAML, XHTML, ASN.1 etc.)
|
|
44 |
We can also add <code>relpipe-tr-sql</code> in the middle of our pipeline and run some SQL queries –
|
|
45 |
transform data on-the-fly and send the query result to the <code>relpipe-out-tabular</code> (or other output filter) in place of the original data.
|
|
46 |
For now, we will filter just the IPv6 addresses:
|
|
47 |
</p>
|
|
48 |
<m:pre jazyk="bash"><![CDATA[cat ip.csv \
|
|
49 |
| relpipe-in-csv --relation 'ip' \
|
|
50 |
| relpipe-tr-sql \
|
|
51 |
--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
|
|
52 |
| relpipe-out-tabular]]></m:pre>
|
|
53 |
<p>and get them printed:</p>
|
|
54 |
<m:pre jazyk="text"><![CDATA[ipv6:
|
|
55 |
╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
|
|
56 |
│ address (string) │ mask (string) │ version (string) │ interface (string) │
|
|
57 |
├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
|
|
58 |
│ ::1 │ 128 │ 6 │ lo │
|
|
59 |
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
|
|
60 |
╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
|
|
61 |
Record count: 2]]></m:pre>
|
|
62 |
|
|
63 |
<p>
|
|
64 |
It is alo possible to run several queries at once
|
|
65 |
and thanks to the <m:name/> format, the result sets are not mixed together, their boundaries are retained and everything is safely passed to the next stage of the pipeline:
|
|
66 |
</p>
|
|
67 |
<m:pre jazyk="bash"><![CDATA[cat ip.csv \
|
|
68 |
| relpipe-in-csv --relation 'ip' \
|
|
69 |
| relpipe-tr-sql \
|
|
70 |
--relation 'ipv4' "SELECT * FROM ip WHERE version = 4" \
|
|
71 |
--relation 'ipv6' "SELECT * FROM ip WHERE version = 6" \
|
|
72 |
| relpipe-out-tabular]]></m:pre>
|
|
73 |
<p>resulting in two nice tables:</p>
|
|
74 |
<m:pre jazyk="text"><![CDATA[ipv4:
|
|
75 |
╭──────────────────┬───────────────┬──────────────────┬────────────────────╮
|
|
76 |
│ address (string) │ mask (string) │ version (string) │ interface (string) │
|
|
77 |
├──────────────────┼───────────────┼──────────────────┼────────────────────┤
|
|
78 |
│ 127.0.0.1 │ 8 │ 4 │ lo │
|
|
79 |
│ 192.168.1.2 │ 24 │ 4 │ eth0 │
|
|
80 |
│ 192.168.1.8 │ 24 │ 4 │ eth0 │
|
|
81 |
│ 10.21.12.24 │ 24 │ 4 │ eth0 │
|
|
82 |
│ 75.748.86.91 │ 95 │ 4 │ eth1 │
|
|
83 |
│ 23.75.345.200 │ 95 │ 4 │ eth1 │
|
|
84 |
╰──────────────────┴───────────────┴──────────────────┴────────────────────╯
|
|
85 |
Record count: 6
|
|
86 |
|
|
87 |
ipv6:
|
|
88 |
╭────────────────────────┬───────────────┬──────────────────┬────────────────────╮
|
|
89 |
│ address (string) │ mask (string) │ version (string) │ interface (string) │
|
|
90 |
├────────────────────────┼───────────────┼──────────────────┼────────────────────┤
|
|
91 |
│ ::1 │ 128 │ 6 │ lo │
|
|
92 |
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │
|
|
93 |
╰────────────────────────┴───────────────┴──────────────────┴────────────────────╯
|
|
94 |
Record count: 2]]></m:pre>
|
|
95 |
|
|
96 |
<h2>Using parametrized queries to avoid SQL injection</h2>
|
|
97 |
<p>
|
|
98 |
When <code>"4"</code> and <code>"6"</code> are not fixed values, we should not glue them to the query string like <code>version = $version</code>,
|
|
99 |
because it is a dangerous practice that may lead to SQL injection.
|
|
100 |
We have parametrized queries for such tasks:
|
|
101 |
</p>
|
|
102 |
<m:pre jazyk="bash"><![CDATA[--relation 'ipv6' "SELECT * FROM ip WHERE version = ?" --parameter "6"]]></m:pre>
|
|
103 |
|
|
104 |
|
|
105 |
<h2>Running SQL JOINs, UNIONs etc. on multiple CSV files</h2>
|
|
106 |
|
|
107 |
<p>
|
|
108 |
To load multiple CSV files into our <i>in-memory database</i>, we just concatenate the relational streams
|
|
109 |
using the means of our shell – the semicolons and parenthesis:
|
|
110 |
</p>
|
|
111 |
<m:pre jazyk="bash"><![CDATA[(relpipe-in-csv --relation 'ip' < ip.csv; relpipe-in-csv --relation 'nic' < nic.csv) \
|
|
112 |
| relpipe-tr-sql \
|
|
113 |
--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
|
|
114 |
| relpipe-out-tabular]]></m:pre>
|
|
115 |
|
|
116 |
<p>Generic version that loads all <code>*.csv</code> files:</p>
|
|
117 |
<m:pre jazyk="bash"><![CDATA[for csv in *.csv; do relpipe-in-csv --relation "$(basename "$csv" .csv)" < "$csv"; done \
|
|
118 |
| relpipe-tr-sql \
|
|
119 |
--relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface" \
|
|
120 |
| relpipe-out-tabular]]></m:pre>
|
|
121 |
|
|
122 |
<p>Then we can JOIN data from multiple CSV files or do UNIONs, INTERSECTions etc.</p>
|
|
123 |
<m:pre jazyk="text"><![CDATA[ip_nic:
|
|
124 |
╭────────────────────────┬───────────────┬──────────────────┬────────────────────┬───────────────────┬───────────────╮
|
|
125 |
│ address (string) │ mask (string) │ version (string) │ interface (string) │ address (string) │ name (string) │
|
|
126 |
├────────────────────────┼───────────────┼──────────────────┼────────────────────┼───────────────────┼───────────────┤
|
|
127 |
│ 127.0.0.1 │ 8 │ 4 │ lo │ 00:00:00:00:00:00 │ lo │
|
|
128 |
│ ::1 │ 128 │ 6 │ lo │ 00:00:00:00:00:00 │ lo │
|
|
129 |
│ 192.168.1.2 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
|
|
130 |
│ 192.168.1.8 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
|
|
131 |
│ 10.21.12.24 │ 24 │ 4 │ eth0 │ 00:D0:D8:00:26:00 │ eth0 │
|
|
132 |
│ 75.748.86.91 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
|
|
133 |
│ 23.75.345.200 │ 95 │ 4 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
|
|
134 |
│ 2a01:430:2e::cafe:babe │ 64 │ 6 │ eth1 │ 00:01:02:01:33:70 │ eth1 │
|
|
135 |
╰────────────────────────┴───────────────┴──────────────────┴────────────────────┴───────────────────┴───────────────╯
|
|
136 |
Record count: 8]]></m:pre>
|
|
137 |
|
|
138 |
|
|
139 |
<h2>Leveraging shell functions</h2>
|
|
140 |
|
|
141 |
<p>
|
|
142 |
Good practice is to wrap common code blocks into functions and thus make them reusable.
|
|
143 |
In shell, the function still works with input and output streams and we can use them when building our pipelines.
|
|
144 |
Shell functions can be seen as named reusable parts of a pipeline.
|
|
145 |
</p>
|
|
146 |
|
|
147 |
<m:pre jazyk="bash"><![CDATA[csv2relation() { for file; do relpipe-in-csv --relation "$(basename "$file" .csv)" < "$file"; done }
|
|
148 |
do_query() { relpipe-tr-sql --relation 'ip_nic' "SELECT * FROM ip JOIN nic ON nic.name = ip.interface"; }
|
|
149 |
format_result() { [[ -t 1 ]] && relpipe-out-tabular || cat; }
|
|
150 |
|
|
151 |
csv2relation *.csv | do_query | format_result]]></m:pre>
|
|
152 |
|
|
153 |
<p>
|
|
154 |
The <code>format_result()</code> function checks whether the STDOUT is a terminal or not.
|
|
155 |
and when printing to the terminal, it generates a table.
|
|
156 |
When writing to a regular file or STDIN of another process, it passes through original relational data.
|
|
157 |
Thus <code>./our-script.sh</code> will print a nice table in the terminal, while <code>./our-script.sh > data.rp</code> will create a file containing machine-readable data
|
|
158 |
and <code>./our-script.sh | relpipe-out-xhtml > report.xhtml</code> will create an XHTML report and <code>./our-script.sh | relpipe-out-gui</code> will show a GUI window full of tables and maybe also charts.
|
|
159 |
</p>
|
|
160 |
|
|
161 |
<m:img src="img/csv-sql-gui-ip-address-counts.png"/>
|
|
162 |
|
|
163 |
<m:pre jazyk="sql"><![CDATA[SELECT
|
|
164 |
nic.name || ' IPv' || ip.version AS label,
|
|
165 |
nic.name AS interface,
|
|
166 |
ip.version AS ip_version,
|
|
167 |
count(*) AS address_count
|
|
168 |
FROM nic
|
|
169 |
LEFT JOIN ip ON (ip.interface = nic.name)
|
|
170 |
GROUP BY nic.name, ip.version
|
|
171 |
ORDER BY count(*) DESC]]></m:pre>
|
|
172 |
|
|
173 |
|
|
174 |
<h2>Makefile version</h2>
|
|
175 |
|
|
176 |
<p>
|
|
177 |
Shell scripts are not the only way to structure and organize our pipelines or generally our data-processing code.
|
|
178 |
We can also use Make (the tool intended mainly for building sofware), write a <i>Makefile</i> and organize our code around some temporary files and other targets instead of functions.
|
|
179 |
</p>
|
|
180 |
|
|
181 |
<m:pre jazyk="Makefile"><![CDATA[all: print_summary
|
|
182 |
|
|
183 |
.PHONY: clean print_summary run_commands
|
|
184 |
|
|
185 |
clean:
|
|
186 |
rm -rf *.rp
|
|
187 |
|
|
188 |
%.rp: %.csv
|
|
189 |
relpipe-in-csv --relation "$(basename $(<))" < $(<) > $(@)
|
|
190 |
|
|
191 |
define SQL_IP_NIC
|
|
192 |
SELECT
|
|
193 |
ip.address AS ip_address,
|
|
194 |
nic.name AS interface,
|
|
195 |
nic.address AS mac_address
|
|
196 |
FROM ip
|
|
197 |
JOIN nic ON (nic.name = ip.interface)
|
|
198 |
endef
|
|
199 |
export SQL_IP_NIC
|
|
200 |
|
|
201 |
define SQL_COUNT_VERSIONS
|
|
202 |
SELECT
|
|
203 |
interface,
|
|
204 |
count(CASE WHEN version=4 THEN 1 ELSE NULL END) AS ipv4_count,
|
|
205 |
count(CASE WHEN version=6 THEN 1 ELSE NULL END) AS ipv6_count
|
|
206 |
FROM ip
|
|
207 |
GROUP BY interface
|
|
208 |
ORDER BY interface
|
|
209 |
endef
|
|
210 |
export SQL_COUNT_VERSIONS
|
|
211 |
|
|
212 |
# Longer SQL queries are better kept in separate .sql files,
|
|
213 |
# because we can enjoy syntax highlighting and other support in our editors.
|
|
214 |
# Then we use it like this: --relation "ip_nic" "$$(cat ip_nic.sql)"
|
|
215 |
|
|
216 |
summary.rp: nic.rp ip.rp
|
|
217 |
cat $(^) \
|
|
218 |
| relpipe-tr-sql \
|
|
219 |
--relation "ip_nic" "$$SQL_IP_NIC" \
|
|
220 |
--relation "counts" "$$SQL_COUNT_VERSIONS" \
|
|
221 |
> $(@)
|
|
222 |
|
|
223 |
print_summary: summary.rp
|
|
224 |
cat $(<) | relpipe-out-tabular
|
|
225 |
]]></m:pre>
|
|
226 |
|
|
227 |
<p>
|
|
228 |
We can even combine advantages of Make and Bash together (without calling or including Bash scripts from Make)
|
|
229 |
and have reusable shell functions available in the Makefile:
|
|
230 |
</p>
|
|
231 |
|
|
232 |
<m:pre jazyk="text"><![CDATA[
|
|
233 |
SHELL=bash
|
|
234 |
BASH_FUNC_read_nullbyte%%=() { local IFS=; for v in "$$@"; do export "$$v"; read -r -d '' "$$v"; done }
|
|
235 |
export BASH_FUNC_read_nullbyte%%]]></m:pre>
|
|
236 |
|
|
237 |
<p>usage example:</p>
|
|
238 |
|
|
239 |
<m:pre jazyk="Makefile"><![CDATA[
|
|
240 |
run_commands: summary.rp
|
|
241 |
cat $(<) \
|
|
242 |
| relpipe-tr-cut --relation 'ip_nic' --invert-match relation true \
|
|
243 |
| relpipe-out-nullbyte \
|
|
244 |
| while read_nullbyte ip_address interface mac_address; do\
|
|
245 |
echo "network interface $$interface ($$mac_address) has IP address $$ip_address"; \
|
|
246 |
done;
|
|
247 |
]]></m:pre>
|
|
248 |
|
|
249 |
<p>
|
|
250 |
Both approaches – the shell script and the Makefile – have pros and cons.
|
|
251 |
With Makefile, we usually create some temporary files containing intermediate results.
|
|
252 |
That avoids streaming. But on the other hand, we process (parse, transform, filter, format etc.) only data that have been changed.
|
|
253 |
</p>
|
|
254 |
|
|
255 |
|
|
256 |
</text>
|
|
257 |
|
|
258 |
</stránka>
|