diff -r cc60c8dd7924 -r 5bc2bb8b7946 relpipe-data/examples-csv-data-types.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-csv-data-types.xml Mon Feb 21 00:43:11 2022 +0100 @@ -0,0 +1,169 @@ + + + CSV and data types + declare or recognize integers and booleans in a typeless format + 04800 + + + +

+ CSV (RFC 4180) is quite good solution when we want to store or share relational data in a simple text format – + both, human-readable and well supported by many existing applications and libraries. + We have even ready-to-use GUI editors, so called spreadsheets (e.g. LibreOffice Calc). + However, such simple formats have usually some drawbacks. + CSV may contain only a single relation (table, sheet). This is not a big issue – we can use several files. + A more serious problem is the absence of data types – in CSV, everything is just a text string. + Thus it was impossible to have loss-less conversion to CSV and back. +

+ + + +

Data types are missing in CSV by default:

+ + +

The size attribute was integer and now it is mere string:

+ + + +

Declare data types in the CSV header

+ +

+ Since v0.18 we can encode the data types (currently strings, integers and booleans) in the CSV header and then recover them while reading. + Such „CSV with data types“ is valid CSV according to the RFC specification and can be viewed or edited in any CSV-capable software. +

+ +

+ The attribute name and data type are separated by the :: symbol e.g. name::string,age::integer,member::boolean. + Attribute names may contain :: (unlike the data type names). +

+ +

The data type declarations may be added simply by hand or automatically using relpipe-out-csv.

+ + + +

The relpipe-out-csv + relpipe-in-csv round-trip now does not degrade the data quality:

+ + + +

+ So we can put e.g. a CSV editor between them while storing and versioning the data in a different format (like XML or Recfile). + Such workflow can be effectively managed by make – + make edit will convert versioned data to CSV and launch the editor, + make commit will convert data back from the CSV and commit them in Mercurial, Git or other version control system (VCS). +

+ +

+ Why put into VCS data in different format than CSV? + Formats like XML or Recfile may have each attribute on a separate line which leads to more readable diffs. + At a glance we can see which attributes have been changed. + While in CSV we see just a changed long line and even with a better tools we need to count the comas to know which attribute it was. +

+ +

+ The relpipe-out-csv tool generates data types only when explicitly asked for: --write-types true. + The relpipe-in-csv tool automatically looks for these type declarations + and if all attributes have valid type declarations, they are used, otherwise they are considered to be a part of the attribute name. + This behavior can be disabled by --read-types false (true will require valid type declarations). +

+ + +

Recognize data types using relpipe-tr-infertypes

+ +

+ Sometimes we may also want to infer data types from the values automatically without any explicit declaration. + Then we put the relpipe-tr-infertypes tool in our pipeline. + It buffers whole relations and checks all values of each attribute. + If they contain all integers or all booleans they are converted to given type. +

+ + + +

+ This approach is inefficient and contradicts streaming, however it is sometimes useful and convenient for small data coming from external sources. + We can e.g. download some data set from network and pipe it through relpipe-in-csv + relpipe-tr-infertypes and improve the data quality a bit. +

+ +

+ We may apply the type inference only on certain relations: --relation "my_relation" + or chose different mode: --mode data or metadata or auto. + The data mode is described above. + In the metadata mode the relpipe-tr-infertypes works similar to relpipe-in-csv --read-types true. + The auto mode checks for the metadata in attribute names first and if not found, it fallbacks to the data mode. + This tool works with any relational data regardless their original format or source (not only with CSV). +

+ + +

No header? Specify types as CLI parameters

+ +

+ Some CSV files contain just data – have no header line containing the column names. + Then we specify the attribute names and data types as CLI parameters of relpipe-in-csv: +

+ + + +

+ We may also skip existing header line: tail -n +2 and force our own names and types. + However this will not work if there are multiline values in the header – which is not common – + in such cases we should use some relpipe-tr-* tool to rewrite the names or types + (these tools work with relational data instead of plain text). +

+ + + +