diff -r cc60c8dd7924 -r 5bc2bb8b7946 relpipe-data/examples-reading-querying-uniform-way.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-reading-querying-uniform-way.xml Mon Feb 21 00:43:11 2022 +0100 @@ -0,0 +1,170 @@ + + + Reading and querying JSON, YAML, CBOR, HTML, MIME, INI, ASN.1 and XML in a uniform way + run XPath queries and turn data from various sources to relations + 04600 + + + +

+ Data come in different shapes and formats. + We can distinguish several main logical models: + relational, + tree + and graph + (a tree is an undirected graph with no cycles) + Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with. + Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything. + (actually it can describe anything, it is just a question of how nice and native it should look) + Unsurprisingly, are build around the relational model. + However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape. + So we need to bridge the gap between trees/graphs and relations. +

+ +

+ While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes). + Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc. + Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc. +

+ +

+ Why reinvent the wheel and repeat the same work for each format? +

+ +

+ We already have reusable code for relational data – this is given by the design of , because it separates: inputs, transformations and outputs. + Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter. +

+ +

+ But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in v0.18: +

+ +

XML: relpipe-in-xmltable
ASN.1: relpipe-in-asn1table
CBOR: relpipe-in-cbortable
HTML: relpipe-in-htmltable
INI: relpipe-in-initable
MIME: relpipe-in-mimetable
YAML: relpipe-in-yamltable

+ +

+ These tools follow the same design principle and offer the same user interface. + So once the user learns one tool, he can use this knowledge also while working with other formats. + The principle is: +

+ +

We are converting the tree structure to one or more relations.
For each relation, define the expression that selects record nodes from the tree.
For each attribute, define the expression (relative to the record node) that selects the attribute value.
If anything can not (or is not desired to) be mapped to relations, keep is as a tree, so we can process it later – these (sub)trees might be embedded in normal records or reside in a separate relation.
We may do a full (lossless) conversion, but we may also extract just a single value from the whole tree (generate a single relation with single record and single attribute). Or anything in between. Anyway, the tool and the logic used is still the same.

+ +

+ This is nothing new – and experienced SQL users should already know where the inspiration comes from: + the XMLTable() SQL function that converts XML tree to a result set (relation). + We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats. + And for all of them, we use the same query language: XPath. +

+ +

+ Despite this sounds so XML-ish, we do not translate the alternative formats to the XML markup. There is no text full of angle brackets and ampersands in the middle of the process. + In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications). +

+ + +

Converting a YAML tree to a set of relations

+ +

+ Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course). + YAML is quite good tree-serialization format. + It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution: +

+ + + +

We can use following command to convert the tree to a set of relations:

+ + + +

+ So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address). + We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address). +

+ + + +

+ n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON. + In current version (v0.18) of the relpipe-in-json and relpipe-in-jsontable are just symbolic links to their YAML counterparts. +

+ +

+ There is also similar example: Reading Libvirt XML files using XMLTable + where we build relations from a XML tree. + The principles are the same for all input formats. +

+ +

Dealing with the HTML tagsoup

+ +

+ With relpipe-in-htmltable we can extract structured information from poor HTML pages. + And unlike relpipe-in-xmltable, this tool does not require valid XML/XHTML, so it is good for the dirty work. + Processing such invalid data is always bit unreliable, but still better than nothing. +

+ + + +

Although Mr. Ryszczyks is unable to create a valid document, this script will print:

+ + + +

+ And thanks to the terminal autodetection in the format_result() function, + we can even pipe the result of this script to any relpipe-tr-* or relpipe-out-* + and get machine-readable data instead of the ANSI-colored tables – + so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.). +

+ +

The `2xml` helper script: `yaml2xml`, `json2xml`, `asn12xml`, `mime2xml` etc.

+ +

+ Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward. + However, sometimes it is useful to see the XML serialization of this in-memory model. + In the relpipe-in-xmltable.cpp repository we have a helper script called + 2xml + + – this script is not intended to be called directly – instead the user should create a symlink e.g. ini2xml, yaml2xml, asn12xml etc. + The 2xml script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format. +

+ +

+ If we want to do the same thing without the helper script, it is quite simple. + We use appropriate relpipe-in-*table tool and extract a single relation with single attribute and single record. + The --records expression is '/' i.e. the root node. + The --attribute expression is '.' i.e. still the root node. + And then we just add the --mode raw-xml to this attribute, so we get the XML serialization of given node (root) instead of the text content. +

+ +

+ In addition to this, the 2xml script does also formatting/indentation and syntax highlighting, + if given tools (xmllint and pygmentize) are available and the STDOUT is a terminal. +

+ +

+ This script is useful when writing the expressions for relpipe-in-*table, + but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats. + We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document, + or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language. +

+ + + + +

Converting a YAML tree to a set of relations

Dealing with the HTML tagsoup

The 2xml helper script: yaml2xml, json2xml, asn12xml, mime2xml etc.

The `2xml` helper script: `yaml2xml`, `json2xml`, `asn12xml`, `mime2xml` etc.