329
|
1 |
<stránka
|
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
|
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
|
|
4 |
|
|
5 |
<nadpis>CSV and data types</nadpis>
|
|
6 |
<perex>declare or recognize integers and booleans in a typeless format</perex>
|
|
7 |
<m:pořadí-příkladu>04800</m:pořadí-příkladu>
|
|
8 |
|
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml">
|
|
10 |
|
|
11 |
<p>
|
|
12 |
CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
|
|
13 |
both, human-readable and well supported by many existing applications and libraries.
|
|
14 |
We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc).
|
|
15 |
However, such simple formats have usually some drawbacks.
|
|
16 |
CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files.
|
|
17 |
A more serious problem is the absence of data types – in CSV, everything is just a text string.
|
|
18 |
Thus it was impossible to have loss-less conversion to CSV and back.
|
|
19 |
</p>
|
|
20 |
|
|
21 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
|
|
22 |
filesystem:
|
|
23 |
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
|
|
24 |
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
|
|
25 |
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
|
|
26 |
│ license/ │ d │ 0 │ hacker │ hacker │
|
|
27 |
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
|
|
28 |
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
|
|
29 |
Record count: 2]]></m:pre>
|
|
30 |
|
|
31 |
<p>Data types are missing in CSV by default:</p>
|
|
32 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv
|
|
33 |
"path","type","size","owner","group"
|
|
34 |
"license/","d","0","hacker","hacker"
|
|
35 |
"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
|
|
36 |
|
|
37 |
<p>The <code>size</code> attribute was integer and now it is mere string:</p>
|
|
38 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular
|
|
39 |
csv:
|
|
40 |
╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
|
|
41 |
│ path (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
|
|
42 |
├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
|
|
43 |
│ license/ │ d │ 0 │ hacker │ hacker │
|
|
44 |
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
|
|
45 |
╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
|
|
46 |
Record count: 2]]></m:pre>
|
|
47 |
|
|
48 |
|
|
49 |
<h2>Declare data types in the CSV header</h2>
|
|
50 |
|
|
51 |
<p>
|
|
52 |
Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading.
|
|
53 |
Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
|
|
54 |
</p>
|
|
55 |
|
|
56 |
<p>
|
|
57 |
The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>.
|
|
58 |
Attribute names may contain <code>::</code> (unlike the data type names).
|
|
59 |
</p>
|
|
60 |
|
|
61 |
<p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p>
|
|
62 |
|
|
63 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true
|
|
64 |
"path::string","type::string","size::integer","owner::string","group::string"
|
|
65 |
"license/","d","0","hacker","hacker"
|
|
66 |
"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
|
|
67 |
|
|
68 |
<p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p>
|
|
69 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular
|
|
70 |
csv:
|
|
71 |
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
|
|
72 |
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
|
|
73 |
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
|
|
74 |
│ license/ │ d │ 0 │ hacker │ hacker │
|
|
75 |
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
|
|
76 |
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
|
|
77 |
Record count: 2]]></m:pre>
|
|
78 |
|
|
79 |
|
|
80 |
<p>
|
|
81 |
So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
|
|
82 |
Such workflow can be effectively managed by <code>make</code> –
|
|
83 |
<code>make edit</code> will convert versioned data to CSV and launch the editor,
|
|
84 |
<code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
|
|
85 |
</p>
|
|
86 |
|
|
87 |
<p>
|
|
88 |
Why put into VCS data in different format than CSV?
|
|
89 |
Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs.
|
|
90 |
At a glance we can see which attributes have been changed.
|
|
91 |
While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
|
|
92 |
</p>
|
|
93 |
|
|
94 |
<p>
|
|
95 |
The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>.
|
|
96 |
The <code>relpipe-in-csv</code> tool automatically looks for these type declarations
|
|
97 |
and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
|
|
98 |
This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations).
|
|
99 |
</p>
|
|
100 |
|
|
101 |
|
|
102 |
<h2>Recognize data types using relpipe-tr-infertypes</h2>
|
|
103 |
|
|
104 |
<p>
|
|
105 |
Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
|
|
106 |
Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline.
|
|
107 |
It buffers whole relations and checks all values of each attribute.
|
|
108 |
If they contain all integers or all booleans they are converted to given type.
|
|
109 |
</p>
|
|
110 |
|
|
111 |
<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
|
|
112 |
csv:
|
|
113 |
╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
|
|
114 |
│ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
|
|
115 |
├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
|
|
116 |
│ license/ │ d │ 0 │ hacker │ hacker │
|
|
117 |
│ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
|
|
118 |
╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
|
|
119 |
Record count: 2]]></m:pre>
|
|
120 |
|
|
121 |
<p>
|
|
122 |
This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
|
|
123 |
We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit.
|
|
124 |
</p>
|
|
125 |
|
|
126 |
<p>
|
|
127 |
We may apply the type inference only on certain relations: <code>--relation "my_relation"</code>
|
|
128 |
or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>.
|
|
129 |
The <code>data</code> mode is described above.
|
|
130 |
In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>.
|
|
131 |
The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode.
|
|
132 |
This tool works with any relational data regardless their original format or source (not only with CSV).
|
|
133 |
</p>
|
|
134 |
|
|
135 |
|
|
136 |
<h2>No header? Specify types as CLI parameters</h2>
|
|
137 |
|
|
138 |
<p>
|
|
139 |
Some CSV files contain just data – have no header line containing the column names.
|
|
140 |
Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>:
|
|
141 |
</p>
|
|
142 |
|
|
143 |
<m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \
|
|
144 |
| relpipe-in-csv \
|
|
145 |
--relation 'just_data' \
|
|
146 |
--attribute 'x' string \
|
|
147 |
--attribute 'y' string \
|
|
148 |
--attribute 'z' string \
|
|
149 |
| relpipe-out-tabular
|
|
150 |
|
|
151 |
just_data:
|
|
152 |
╭────────────┬────────────┬────────────╮
|
|
153 |
│ x (string) │ y (string) │ z (string) │
|
|
154 |
├────────────┼────────────┼────────────┤
|
|
155 |
│ a │ b │ c │
|
|
156 |
│ A │ B │ C │
|
|
157 |
╰────────────┴────────────┴────────────╯
|
|
158 |
Record count: 2]]></m:pre>
|
|
159 |
|
|
160 |
<p>
|
|
161 |
We may also skip existing header line: <code>tail -n +2</code> and force our own names and types.
|
|
162 |
However this will not work if there are multiline values in the header – which is not common –
|
|
163 |
in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types
|
|
164 |
(these tools work with relational data instead of plain text).
|
|
165 |
</p>
|
|
166 |
|
|
167 |
</text>
|
|
168 |
|
|
169 |
</stránka>
|