--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-csv-data-types.xml Mon Feb 21 00:43:11 2022 +0100
@@ -0,0 +1,169 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>CSV and data types</nadpis>
+ <perex>declare or recognize integers and booleans in a typeless format</perex>
+ <m:pořadí-příkladu>04800</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
+ both, human-readable and well supported by many existing applications and libraries.
+ We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc).
+ However, such simple formats have usually some drawbacks.
+ CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files.
+ A more serious problem is the absence of data types – in CSV, everything is just a text string.
+ Thus it was impossible to have loss-less conversion to CSV and back.
+ </p>
+
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
+filesystem:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/ │ d │ 0 │ hacker │ hacker │
+ │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+ <p>Data types are missing in CSV by default:</p>
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv
+"path","type","size","owner","group"
+"license/","d","0","hacker","hacker"
+"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
+
+ <p>The <code>size</code> attribute was integer and now it is mere string:</p>
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular
+csv:
+ ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
+ │ path (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
+ │ license/ │ d │ 0 │ hacker │ hacker │
+ │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
+ ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+
+ <h2>Declare data types in the CSV header</h2>
+
+ <p>
+ Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading.
+ Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
+ </p>
+
+ <p>
+ The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>.
+ Attribute names may contain <code>::</code> (unlike the data type names).
+ </p>
+
+ <p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p>
+
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true
+"path::string","type::string","size::integer","owner::string","group::string"
+"license/","d","0","hacker","hacker"
+"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
+
+ <p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p>
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular
+csv:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/ │ d │ 0 │ hacker │ hacker │
+ │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+
+ <p>
+ So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
+ Such workflow can be effectively managed by <code>make</code> –
+ <code>make edit</code> will convert versioned data to CSV and launch the editor,
+ <code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
+ </p>
+
+ <p>
+ Why put into VCS data in different format than CSV?
+ Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs.
+ At a glance we can see which attributes have been changed.
+ While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
+ </p>
+
+ <p>
+ The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>.
+ The <code>relpipe-in-csv</code> tool automatically looks for these type declarations
+ and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
+ This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations).
+ </p>
+
+
+ <h2>Recognize data types using relpipe-tr-infertypes</h2>
+
+ <p>
+ Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
+ Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline.
+ It buffers whole relations and checks all values of each attribute.
+ If they contain all integers or all booleans they are converted to given type.
+ </p>
+
+ <m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
+csv:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/ │ d │ 0 │ hacker │ hacker │
+ │ license/gpl.txt │ f │ 35147 │ hacker │ hacker │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+ <p>
+ This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
+ We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit.
+ </p>
+
+ <p>
+ We may apply the type inference only on certain relations: <code>--relation "my_relation"</code>
+ or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>.
+ The <code>data</code> mode is described above.
+ In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>.
+ The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode.
+ This tool works with any relational data regardless their original format or source (not only with CSV).
+ </p>
+
+
+ <h2>No header? Specify types as CLI parameters</h2>
+
+ <p>
+ Some CSV files contain just data – have no header line containing the column names.
+ Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>:
+ </p>
+
+ <m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \
+ | relpipe-in-csv \
+ --relation 'just_data' \
+ --attribute 'x' string \
+ --attribute 'y' string \
+ --attribute 'z' string \
+ | relpipe-out-tabular
+
+just_data:
+ ╭────────────┬────────────┬────────────╮
+ │ x (string) │ y (string) │ z (string) │
+ ├────────────┼────────────┼────────────┤
+ │ a │ b │ c │
+ │ A │ B │ C │
+ ╰────────────┴────────────┴────────────╯
+Record count: 2]]></m:pre>
+
+ <p>
+ We may also skip existing header line: <code>tail -n +2</code> and force our own names and types.
+ However this will not work if there are multiline values in the header – which is not common –
+ in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types
+ (these tools work with relational data instead of plain text).
+ </p>
+
+ </text>
+
+</stránka>