relpipe-data/examples-csv-data-types.xml
branchv_0
changeset 329 5bc2bb8b7946
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-csv-data-types.xml	Mon Feb 21 00:43:11 2022 +0100
@@ -0,0 +1,169 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>CSV and data types</nadpis>
+	<perex>declare or recognize integers and booleans in a typeless format</perex>
+	<m:pořadí-příkladu>04800</m:pořadí-příkladu>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		
+		<p>
+			CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
+			both, human-readable and well supported by many existing applications and libraries.
+			We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc).
+			However, such simple formats have usually some drawbacks.
+			CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files.
+			A more serious problem is the absence of data types – in CSV, everything is just a text string.
+			Thus it was impossible to have loss-less conversion to CSV and back.
+		</p>
+		
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
+filesystem:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/        │ d             │              0 │ hacker         │ hacker         │
+ │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+		<p>Data types are missing in CSV by default:</p>
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv 
+"path","type","size","owner","group"
+"license/","d","0","hacker","hacker"
+"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
+		
+		<p>The <code>size</code> attribute was integer and now it is mere string:</p>
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular 
+csv:
+ ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
+ │ path   (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
+ │ license/        │ d             │ 0             │ hacker         │ hacker         │
+ │ license/gpl.txt │ f             │ 35147         │ hacker         │ hacker         │
+ ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+		
+		<h2>Declare data types in the CSV header</h2>
+		
+		<p>
+			Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading.
+			Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
+		</p>
+		
+		<p>
+			The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>.
+			Attribute names may contain <code>::</code> (unlike the data type names).
+		</p>
+		
+		<p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p>
+		
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true 
+"path::string","type::string","size::integer","owner::string","group::string"
+"license/","d","0","hacker","hacker"
+"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
+
+		<p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p>
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular 
+csv:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/        │ d             │              0 │ hacker         │ hacker         │
+ │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+
+		<p>
+			So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
+			Such workflow can be effectively managed by <code>make</code> –
+			<code>make edit</code> will convert versioned data to CSV and launch the editor,
+			<code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
+		</p>
+		
+		<p>
+			Why put into VCS data in different format than CSV?
+			Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs.
+			At a glance we can see which attributes have been changed.
+			While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
+		</p>
+		
+		<p>
+			The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>.
+			The <code>relpipe-in-csv</code> tool automatically looks for these type declarations
+			and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
+			This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations).
+		</p>
+		
+		
+		<h2>Recognize data types using relpipe-tr-infertypes</h2>
+		
+		<p>
+			Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
+			Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline.
+			It buffers whole relations and checks all values of each attribute.
+			If they contain all integers or all booleans they are converted to given type.
+		</p>
+
+		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
+csv:
+ ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
+ │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
+ ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
+ │ license/        │ d             │              0 │ hacker         │ hacker         │
+ │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
+ ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
+Record count: 2]]></m:pre>
+
+		<p>
+			This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
+			We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit.
+		</p>
+		
+		<p>
+			We may apply the type inference only on certain relations: <code>--relation "my_relation"</code>
+			or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>.
+			The <code>data</code> mode is described above.
+			In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>.
+			The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode.
+			This tool works with any relational data regardless their original format or source (not only with CSV).
+		</p>
+
+				
+		<h2>No header? Specify types as CLI parameters</h2>
+		
+		<p>
+			Some CSV files contain just data – have no header line containing the column names.
+			Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>:
+		</p>
+
+		<m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \
+	| relpipe-in-csv \
+		--relation 'just_data' \
+			--attribute 'x' string \
+			--attribute 'y' string \
+			--attribute 'z' string \
+	| relpipe-out-tabular
+
+just_data:
+ ╭────────────┬────────────┬────────────╮
+ │ x (string) │ y (string) │ z (string) │
+ ├────────────┼────────────┼────────────┤
+ │ a          │ b          │ c          │
+ │ A          │ B          │ C          │
+ ╰────────────┴────────────┴────────────╯
+Record count: 2]]></m:pre>
+
+		<p>
+			We may also skip existing header line: <code>tail -n +2</code> and force our own names and types.
+			However this will not work if there are multiline values in the header – which is not common – 
+			in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types
+			(these tools work with relational data instead of plain text).
+		</p>
+		
+	</text>
+
+</stránka>