|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>CSV and data types</nadpis> |
|
6 <perex>declare or recognize integers and booleans in a typeless format</perex> |
|
7 <m:pořadí-příkladu>04800</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format – |
|
13 both, human-readable and well supported by many existing applications and libraries. |
|
14 We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc). |
|
15 However, such simple formats have usually some drawbacks. |
|
16 CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files. |
|
17 A more serious problem is the absence of data types – in CSV, everything is just a text string. |
|
18 Thus it was impossible to have loss-less conversion to CSV and back. |
|
19 </p> |
|
20 |
|
21 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular |
|
22 filesystem: |
|
23 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮ |
|
24 │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │ |
|
25 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤ |
|
26 │ license/ │ d │ 0 │ hacker │ hacker │ |
|
27 │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │ |
|
28 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯ |
|
29 Record count: 2]]></m:pre> |
|
30 |
|
31 <p>Data types are missing in CSV by default:</p> |
|
32 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv |
|
33 "path","type","size","owner","group" |
|
34 "license/","d","0","hacker","hacker" |
|
35 "license/gpl.txt","f","35147","hacker","hacker"]]></m:pre> |
|
36 |
|
37 <p>The <code>size</code> attribute was integer and now it is mere string:</p> |
|
38 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular |
|
39 csv: |
|
40 ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮ |
|
41 │ path (string) │ type (string) │ size (string) │ owner (string) │ group (string) │ |
|
42 ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤ |
|
43 │ license/ │ d │ 0 │ hacker │ hacker │ |
|
44 │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │ |
|
45 ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯ |
|
46 Record count: 2]]></m:pre> |
|
47 |
|
48 |
|
49 <h2>Declare data types in the CSV header</h2> |
|
50 |
|
51 <p> |
|
52 Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading. |
|
53 Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software. |
|
54 </p> |
|
55 |
|
56 <p> |
|
57 The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>. |
|
58 Attribute names may contain <code>::</code> (unlike the data type names). |
|
59 </p> |
|
60 |
|
61 <p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p> |
|
62 |
|
63 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true |
|
64 "path::string","type::string","size::integer","owner::string","group::string" |
|
65 "license/","d","0","hacker","hacker" |
|
66 "license/gpl.txt","f","35147","hacker","hacker"]]></m:pre> |
|
67 |
|
68 <p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p> |
|
69 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular |
|
70 csv: |
|
71 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮ |
|
72 │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │ |
|
73 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤ |
|
74 │ license/ │ d │ 0 │ hacker │ hacker │ |
|
75 │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │ |
|
76 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯ |
|
77 Record count: 2]]></m:pre> |
|
78 |
|
79 |
|
80 <p> |
|
81 So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile). |
|
82 Such workflow can be effectively managed by <code>make</code> – |
|
83 <code>make edit</code> will convert versioned data to CSV and launch the editor, |
|
84 <code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS). |
|
85 </p> |
|
86 |
|
87 <p> |
|
88 Why put into VCS data in different format than CSV? |
|
89 Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs. |
|
90 At a glance we can see which attributes have been changed. |
|
91 While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was. |
|
92 </p> |
|
93 |
|
94 <p> |
|
95 The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>. |
|
96 The <code>relpipe-in-csv</code> tool automatically looks for these type declarations |
|
97 and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name. |
|
98 This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations). |
|
99 </p> |
|
100 |
|
101 |
|
102 <h2>Recognize data types using relpipe-tr-infertypes</h2> |
|
103 |
|
104 <p> |
|
105 Sometimes we may also want to infer data types from the values automatically without any explicit declaration. |
|
106 Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline. |
|
107 It buffers whole relations and checks all values of each attribute. |
|
108 If they contain all integers or all booleans they are converted to given type. |
|
109 </p> |
|
110 |
|
111 <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular |
|
112 csv: |
|
113 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮ |
|
114 │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │ |
|
115 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤ |
|
116 │ license/ │ d │ 0 │ hacker │ hacker │ |
|
117 │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │ |
|
118 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯ |
|
119 Record count: 2]]></m:pre> |
|
120 |
|
121 <p> |
|
122 This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources. |
|
123 We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit. |
|
124 </p> |
|
125 |
|
126 <p> |
|
127 We may apply the type inference only on certain relations: <code>--relation "my_relation"</code> |
|
128 or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>. |
|
129 The <code>data</code> mode is described above. |
|
130 In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>. |
|
131 The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode. |
|
132 This tool works with any relational data regardless their original format or source (not only with CSV). |
|
133 </p> |
|
134 |
|
135 |
|
136 <h2>No header? Specify types as CLI parameters</h2> |
|
137 |
|
138 <p> |
|
139 Some CSV files contain just data – have no header line containing the column names. |
|
140 Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>: |
|
141 </p> |
|
142 |
|
143 <m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \ |
|
144 | relpipe-in-csv \ |
|
145 --relation 'just_data' \ |
|
146 --attribute 'x' string \ |
|
147 --attribute 'y' string \ |
|
148 --attribute 'z' string \ |
|
149 | relpipe-out-tabular |
|
150 |
|
151 just_data: |
|
152 ╭────────────┬────────────┬────────────╮ |
|
153 │ x (string) │ y (string) │ z (string) │ |
|
154 ├────────────┼────────────┼────────────┤ |
|
155 │ a │ b │ c │ |
|
156 │ A │ B │ C │ |
|
157 ╰────────────┴────────────┴────────────╯ |
|
158 Record count: 2]]></m:pre> |
|
159 |
|
160 <p> |
|
161 We may also skip existing header line: <code>tail -n +2</code> and force our own names and types. |
|
162 However this will not work if there are multiline values in the header – which is not common – |
|
163 in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types |
|
164 (these tools work with relational data instead of plain text). |
|
165 </p> |
|
166 |
|
167 </text> |
|
168 |
|
169 </stránka> |