relpipe-data/examples-csv-data-types.xml
branchv_0
changeset 329 5bc2bb8b7946
equal deleted inserted replaced
328:cc60c8dd7924 329:5bc2bb8b7946
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>CSV and data types</nadpis>
       
     6 	<perex>declare or recognize integers and booleans in a typeless format</perex>
       
     7 	<m:pořadí-příkladu>04800</m:pořadí-příkladu>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 		
       
    11 		<p>
       
    12 			CSV (<m:a href="4180" typ="rfc">RFC 4180</m:a>) is quite good solution when we want to store or share relational data in a simple text format –
       
    13 			both, human-readable and well supported by many existing applications and libraries.
       
    14 			We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc).
       
    15 			However, such simple formats have usually some drawbacks.
       
    16 			CSV may contain only a single relation (<i>table</i>, <i>sheet</i>). This is not a big issue – we can use several files.
       
    17 			A more serious problem is the absence of data types – in CSV, everything is just a text string.
       
    18 			Thus it was impossible to have loss-less conversion to CSV and back.
       
    19 		</p>
       
    20 		
       
    21 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-tabular
       
    22 filesystem:
       
    23  ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
       
    24  │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
       
    25  ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
       
    26  │ license/        │ d             │              0 │ hacker         │ hacker         │
       
    27  │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
       
    28  ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
       
    29 Record count: 2]]></m:pre>
       
    30 
       
    31 		<p>Data types are missing in CSV by default:</p>
       
    32 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv 
       
    33 "path","type","size","owner","group"
       
    34 "license/","d","0","hacker","hacker"
       
    35 "license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
       
    36 		
       
    37 		<p>The <code>size</code> attribute was integer and now it is mere string:</p>
       
    38 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-out-tabular 
       
    39 csv:
       
    40  ╭─────────────────┬───────────────┬───────────────┬────────────────┬────────────────╮
       
    41  │ path   (string) │ type (string) │ size (string) │ owner (string) │ group (string) │
       
    42  ├─────────────────┼───────────────┼───────────────┼────────────────┼────────────────┤
       
    43  │ license/        │ d             │ 0             │ hacker         │ hacker         │
       
    44  │ license/gpl.txt │ f             │ 35147         │ hacker         │ hacker         │
       
    45  ╰─────────────────┴───────────────┴───────────────┴────────────────┴────────────────╯
       
    46 Record count: 2]]></m:pre>
       
    47 
       
    48 		
       
    49 		<h2>Declare data types in the CSV header</h2>
       
    50 		
       
    51 		<p>
       
    52 			Since <m:name/> <m:a href="release-v0.18">v0.18</m:a> we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading.
       
    53 			Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software.
       
    54 		</p>
       
    55 		
       
    56 		<p>
       
    57 			The attribute name and data type are separated by the <code>::</code> symbol e.g. <code>name::string,age::integer,member::boolean</code>.
       
    58 			Attribute names may contain <code>::</code> (unlike the data type names).
       
    59 		</p>
       
    60 		
       
    61 		<p>The data type declarations may be added simply by hand or automatically using <code>relpipe-out-csv</code>.</p>
       
    62 		
       
    63 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true 
       
    64 "path::string","type::string","size::integer","owner::string","group::string"
       
    65 "license/","d","0","hacker","hacker"
       
    66 "license/gpl.txt","f","35147","hacker","hacker"]]></m:pre>
       
    67 
       
    68 		<p>The <code>relpipe-out-csv</code> + <code>relpipe-in-csv</code> round-trip now does not degrade the data quality:</p>
       
    69 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv --write-types true | relpipe-in-csv | relpipe-out-tabular 
       
    70 csv:
       
    71  ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
       
    72  │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
       
    73  ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
       
    74  │ license/        │ d             │              0 │ hacker         │ hacker         │
       
    75  │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
       
    76  ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
       
    77 Record count: 2]]></m:pre>
       
    78 
       
    79 
       
    80 		<p>
       
    81 			So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile).
       
    82 			Such workflow can be effectively managed by <code>make</code> –
       
    83 			<code>make edit</code> will convert versioned data to CSV and launch the editor,
       
    84 			<code>make commit</code> will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS).
       
    85 		</p>
       
    86 		
       
    87 		<p>
       
    88 			Why put into VCS data in different format than CSV?
       
    89 			Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs.
       
    90 			At a glance we can see which attributes have been changed.
       
    91 			While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was.
       
    92 		</p>
       
    93 		
       
    94 		<p>
       
    95 			The <code>relpipe-out-csv</code> tool generates data types only when explicitly asked for: <code>--write-types true</code>.
       
    96 			The <code>relpipe-in-csv</code> tool automatically looks for these type declarations
       
    97 			and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name.
       
    98 			This behavior can be disabled by <code>--read-types false</code> (<code>true</code> will require valid type declarations).
       
    99 		</p>
       
   100 		
       
   101 		
       
   102 		<h2>Recognize data types using relpipe-tr-infertypes</h2>
       
   103 		
       
   104 		<p>
       
   105 			Sometimes we may also want to infer data types from the values automatically without any explicit declaration.
       
   106 			Then we put the <code>relpipe-tr-infertypes</code> tool in our pipeline.
       
   107 			It buffers whole relations and checks all values of each attribute.
       
   108 			If they contain all integers or all booleans they are converted to given type.
       
   109 		</p>
       
   110 
       
   111 		<m:pre jazyk="text"><![CDATA[$ find license/ -print0 | relpipe-in-filesystem | relpipe-out-csv | relpipe-in-csv | relpipe-tr-infertypes | relpipe-out-tabular
       
   112 csv:
       
   113  ╭─────────────────┬───────────────┬────────────────┬────────────────┬────────────────╮
       
   114  │ path   (string) │ type (string) │ size (integer) │ owner (string) │ group (string) │
       
   115  ├─────────────────┼───────────────┼────────────────┼────────────────┼────────────────┤
       
   116  │ license/        │ d             │              0 │ hacker         │ hacker         │
       
   117  │ license/gpl.txt │ f             │          35147 │ hacker         │ hacker         │
       
   118  ╰─────────────────┴───────────────┴────────────────┴────────────────┴────────────────╯
       
   119 Record count: 2]]></m:pre>
       
   120 
       
   121 		<p>
       
   122 			This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources.
       
   123 			We can e.g. download some data set from network and pipe it through <code>relpipe-in-csv</code> + <code>relpipe-tr-infertypes</code> and improve the data quality a bit.
       
   124 		</p>
       
   125 		
       
   126 		<p>
       
   127 			We may apply the type inference only on certain relations: <code>--relation "my_relation"</code>
       
   128 			or chose different mode: <code>--mode data</code> or <code>metadata</code> or <code>auto</code>.
       
   129 			The <code>data</code> mode is described above.
       
   130 			In the <code>metadata</code> mode the <code>relpipe-tr-infertypes</code> works similar to <code>relpipe-in-csv --read-types true</code>.
       
   131 			The <code>auto</code> mode checks for the metadata in attribute names first and if not found, it fallbacks to the <code>data</code> mode.
       
   132 			This tool works with any relational data regardless their original format or source (not only with CSV).
       
   133 		</p>
       
   134 
       
   135 				
       
   136 		<h2>No header? Specify types as CLI parameters</h2>
       
   137 		
       
   138 		<p>
       
   139 			Some CSV files contain just data – have no header line containing the column names.
       
   140 			Then we specify the attribute names and data types as CLI parameters of <code>relpipe-in-csv</code>:
       
   141 		</p>
       
   142 
       
   143 		<m:pre jazyk="text"><![CDATA[$ echo -e "a,b,c\nA,B,C" \
       
   144 	| relpipe-in-csv \
       
   145 		--relation 'just_data' \
       
   146 			--attribute 'x' string \
       
   147 			--attribute 'y' string \
       
   148 			--attribute 'z' string \
       
   149 	| relpipe-out-tabular
       
   150 
       
   151 just_data:
       
   152  ╭────────────┬────────────┬────────────╮
       
   153  │ x (string) │ y (string) │ z (string) │
       
   154  ├────────────┼────────────┼────────────┤
       
   155  │ a          │ b          │ c          │
       
   156  │ A          │ B          │ C          │
       
   157  ╰────────────┴────────────┴────────────╯
       
   158 Record count: 2]]></m:pre>
       
   159 
       
   160 		<p>
       
   161 			We may also skip existing header line: <code>tail -n +2</code> and force our own names and types.
       
   162 			However this will not work if there are multiline values in the header – which is not common – 
       
   163 			in such cases we should use some <code>relpipe-tr-*</code> tool to rewrite the names or types
       
   164 			(these tools work with relational data instead of plain text).
       
   165 		</p>
       
   166 		
       
   167 	</text>
       
   168 
       
   169 </stránka>