diff -r cc60c8dd7924 -r 5bc2bb8b7946 relpipe-data/examples-reading-querying-uniform-way.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-reading-querying-uniform-way.xml Mon Feb 21 00:43:11 2022 +0100 @@ -0,0 +1,170 @@ + + + Reading and querying JSON, YAML, CBOR, HTML, MIME, INI, ASN.1 and XML in a uniform way + run XPath queries and turn data from various sources to relations + 04600 + + + +

+ Data come in different shapes and formats. + We can distinguish several main logical models: + relational, + tree + and graph + (a tree is an undirected graph with no cycles) + Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with. + Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything. + (actually it can describe anything, it is just a question of how nice and native it should look) + Unsurprisingly, are build around the relational model. + However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape. + So we need to bridge the gap between trees/graphs and relations. +

+ +

+ While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes). + Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc. + Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc. +

+ +

+ Why reinvent the wheel and repeat the same work for each format? +

+ +

+ We already have reusable code for relational data – this is given by the design of , because it separates: inputs, transformations and outputs. + Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter. +

+ +

+ But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in v0.18: +

+ + + +

+ These tools follow the same design principle and offer the same user interface. + So once the user learns one tool, he can use this knowledge also while working with other formats. + The principle is: +

+ + + +

+ This is nothing new – and experienced SQL users should already know where the inspiration comes from: + the XMLTable() SQL function that converts XML tree to a result set (relation). + We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats. + And for all of them, we use the same query language: XPath. +

+ +

+ Despite this sounds so XML-ish, we do not translate the alternative formats to the XML markup. There is no text full of angle brackets and ampersands in the middle of the process. + In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications). +

+ + +

Converting a YAML tree to a set of relations

+ +

+ Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course). + YAML is quite good tree-serialization format. + It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution: +

+ + + +

We can use following command to convert the tree to a set of relations:

+ + + +

+ So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address). + We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address). +

+ + + +

+ n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON. + In current version (v0.18) of the relpipe-in-json and relpipe-in-jsontable are just symbolic links to their YAML counterparts. +

+ +

+ There is also similar example: Reading Libvirt XML files using XMLTable + where we build relations from a XML tree. + The principles are the same for all input formats. +

+ +

Dealing with the HTML tagsoup

+ +

+ With relpipe-in-htmltable we can extract structured information from poor HTML pages. + And unlike relpipe-in-xmltable, this tool does not require valid XML/XHTML, so it is good for the dirty work. + Processing such invalid data is always bit unreliable, but still better than nothing. +

+ + + +

Although Mr. Ryszczyks is unable to create a valid document, this script will print:

+ + + +

+ And thanks to the terminal autodetection in the format_result() function, + we can even pipe the result of this script to any relpipe-tr-* or relpipe-out-* + and get machine-readable data instead of the ANSI-colored tables – + so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.). +

+ +

The 2xml helper script: yaml2xml, json2xml, asn12xml, mime2xml etc.

+ +

+ Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward. + However, sometimes it is useful to see the XML serialization of this in-memory model. + In the relpipe-in-xmltable.cpp repository we have a helper script called + 2xml + + – this script is not intended to be called directly – instead the user should create a symlink e.g. ini2xml, yaml2xml, asn12xml etc. + The 2xml script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format. +

+ +

+ If we want to do the same thing without the helper script, it is quite simple. + We use appropriate relpipe-in-*table tool and extract a single relation with single attribute and single record. + The --records expression is '/' i.e. the root node. + The --attribute expression is '.' i.e. still the root node. + And then we just add the --mode raw-xml to this attribute, so we get the XML serialization of given node (root) instead of the text content. +

+ +

+ In addition to this, the 2xml script does also formatting/indentation and syntax highlighting, + if given tools (xmllint and pygmentize) are available and the STDOUT is a terminal. +

+ +

+ This script is useful when writing the expressions for relpipe-in-*table, + but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats. + We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document, + or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language. +

+ + +
+ +