relpipe-data/examples-csv-data-types.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 329 5bc2bb8b7946
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>CSV and data types</nadpis>
	<perex>declare or recognize integers and booleans in a typeless format</perex>
	<m:pořadí-příkladu>04800</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
			both, human-readable and well supported by many existing applications and libraries.
			We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc).
			However, such simple formats have usually some drawbacks.
			CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files.
			A more serious problem is the absence of data types – in CSV, everything is just a text string.
			Thus it was impossible to have loss-less conversion to CSV and back.
		</p>
		
		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
filesystem:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2]]></m:pre>

		<p>Data types are missing in CSV by default:</p>
		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv 
"path","type","size","owner","group"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
		
		<p>The <code>size</code> attribute was integer and now it is mere string:</p>
		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular 
csv:
 ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
 │ license/        │ d             │ 0             │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │ 35147         │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
Record count: 2]]></m:pre>

		
		<h2>Declare data types in the CSV header</h2>
		
		<p>
			Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading.
			Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
		</p>
		
		<p>
			The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>.
			Attribute names may contain <code>::</code> (unlike the data type names).
		</p>
		
		<p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p>
		
		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true 
"path::string","type::string","size::integer","owner::string","group::string"
"license/","d","0","hacker","hacker"
"license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>

		<p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p>
		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular 
csv:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2]]></m:pre>


		<p>
			So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
			Such workflow can be effectively managed by <code>make</code> –
			<code>make edit</code> will convert versioned data to CSV and launch the editor,
			<code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
		</p>
		
		<p>
			Why put into VCS data in different format than CSV?
			Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs.
			At a glance we can see which attributes have been changed.
			While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
		</p>
		
		<p>
			The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>.
			The <code>relpipe-in-csv</code> tool automatically looks for these type declarations
			and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
			This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations).
		</p>
		
		
		<h2>Recognize data types using relpipe-tr-infertypes</h2>
		
		<p>
			Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
			Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline.
			It buffers whole relations and checks all values of each attribute.
			If they contain all integers or all booleans they are converted to given type.
		</p>

		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
csv:
 ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
 │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
 ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
 │ license/        │ d             │              0 │ hacker         │ hacker         │
 │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
 ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
Record count: 2]]></m:pre>

		<p>
			This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
			We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit.
		</p>
		
		<p>
			We may apply the type inference only on certain relations: <code>--relation "my_relation"</code>
			or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>.
			The <code>data</code> mode is described above.
			In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>.
			The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode.
			This tool works with any relational data regardless their original format or source (not only with CSV).
		</p>

				
		<h2>No header? Specify types as CLI parameters</h2>
		
		<p>
			Some CSV files contain just data – have no header line containing the column names.
			Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>:
		</p>

		<m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \
	| relpipe-in-csv \
		--relation 'just_data' \
			--attribute 'x' string \
			--attribute 'y' string \
			--attribute 'z' string \
	| relpipe-out-tabular

just_data:
 ╭────────────┬────────────┬────────────╮
 │ x (string) │ y (string) │ z (string) │
 ├────────────┼────────────┼────────────┤
 │ a          │ b          │ c          │
 │ A          │ B          │ C          │
 ╰────────────┴────────────┴────────────╯
Record count: 2]]></m:pre>

		<p>
			We may also skip existing header line: <code>tail -n +2</code> and force our own names and types.
			However this will not work if there are multiline values in the header – which is not common – 
			in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types
			(these tools work with relational data instead of plain text).
		</p>
		
	</text>

</stránka>