relpipe-data/examples-reading-querying-uniform-way.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 00:43:11 +0100
branchv_0
changeset 329 5bc2bb8b7946
permissions -rw-r--r--
Release v0.18

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Reading and querying JSON, YAML, CBOR, HTML, MIME, INI, ASN.1 and XML in a uniform way</nadpis>
	<perex>run XPath queries and turn data from various sources to relations</perex>
	<m:pořadí-příkladu>04600</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			Data come in different shapes and formats.
			We can distinguish several main logical models:
			relational,
			tree
			and graph
			(a tree is an undirected graph with no cycles)
			Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with.
			Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything.
			(actually it can describe anything, it is just a question of how nice and native it should look)
			Unsurprisingly, <m:name/> are build around the relational model.
			However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape.
			So we need to bridge the gap between trees/graphs and relations.
		</p>
		
		<p>
			While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes).
			Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc.
			Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc.
		</p>
		
		<p>
			Why reinvent the wheel and repeat the same work for each format?
		</p>
		
		<p>
			We already have reusable code for relational data – this is given by the design of <m:name/>, because it separates: <em>inputs</em>, <em>transformations</em> and <em>outputs</em>.
			Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter.
		</p>
		
		<p>
			But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in <m:a href="release-v0.18">v0.18</m:a>:
		</p>
		
		<ul>
			<li>XML: <code>relpipe-in-xmltable</code></li>
			<li>ASN.1: <code>relpipe-in-asn1table</code></li>
			<li>CBOR: <code>relpipe-in-cbortable</code></li>
			<li>HTML: <code>relpipe-in-htmltable</code></li>
			<li>INI: <code>relpipe-in-initable</code></li>
			<li>MIME: <code>relpipe-in-mimetable</code></li>
			<li>YAML: <code>relpipe-in-yamltable</code></li>
		</ul>
		
		<p>
			These tools follow the same design principle and offer the same user interface.
			So once the user learns one tool, he can use this knowledge also while working with other formats.
			The principle is:
		</p>
		
		<ul>
			<li>We are converting the tree structure to one or more relations.</li>
			<li>For each relation, define the expression that selects record nodes from the tree.</li>
			<li>For each attribute, define the expression (relative to the record node) that selects the attribute value.</li>
			<li>If anything can not (or is not desired to) be mapped to relations, keep is as a tree, so we can process it later – these (sub)trees might be embedded in normal records or reside in a separate relation.</li>
			<li>We may do a full (lossless) conversion, but we may also extract just a single value from the whole tree (generate a single relation with single record and single attribute). Or anything in between. Anyway, the tool and the logic used is still the same.</li>
		</ul>
		
		<p>
			This is nothing new – and experienced SQL users should already know where the inspiration comes from:
			the <code>XMLTable()</code> SQL function that converts XML tree to a result set (relation).
			We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats.
			And for all of them, we use the same query language: XPath.
		</p>
		
		<p>
			Despite this sounds so <i>XML-ish</i>, we do not translate the alternative formats to the XML markup. There is no <i>text full of angle brackets and ampersands</i> in the middle of the process.
			In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications).
		</p>
		
		
		<h2 id="yamlToRelations">Converting a YAML tree to a set of relations</h2>
		
		<p>
			Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course).
			YAML is quite good tree-serialization format.
			It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution:
		</p>
		
		<m:pre jazyk="yaml" src="examples/netplan-1.yaml"/>
		
		<p>We can use following command to convert the tree to a set of relations:</p>
		
		<m:pre jazyk="bash" src="examples/netplan-1.sh"/>
		
		<p>
			So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address).
			We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address).
		</p>
		
		<m:img src="img/wmaker-yaml-xml-tabular-1.png"/>
		
		<p>
			n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON.
			In current version (v0.18) of <m:name/> the <code>relpipe-in-json</code> and <code>relpipe-in-jsontable</code> are just symbolic links to their YAML counterparts.
		</p>
		
		<p>
			There is also similar example: <m:a href="examples-in-xmltable-libvirt">Reading Libvirt XML files using XMLTable</m:a>
			where we build relations from a XML tree.
			The principles are the same for all input formats.
		</p>
		
		<h2 id="htmlTagSoup">Dealing with the HTML tagsoup</h2>
		
		<p>
			With <code>relpipe-in-htmltable</code> we can extract structured information from poor HTML pages.
			And unlike <code>relpipe-in-xmltable</code>, this tool does not require valid XML/XHTML, so it is good for the dirty work.
			Processing such invalid data is always bit unreliable, but still better than nothing.
		</p>
		
		<m:pre jazyk="bash" src="examples/html-tagsoup-1.sh"/>
		
		<p>Although Mr. Ryszczyks is unable to create a valid document, this script will print:</p>
		
		<m:pre jazyk="text" src="examples/html-tagsoup-1.txt"/>

		<p>
			And thanks to the terminal autodetection in the <code>format_result()</code> function,
			we can even pipe the result of this script to any <code>relpipe-tr-*</code> or <code>relpipe-out-*</code>
			and get machine-readable data instead of the ANSI-colored tables – 
			so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.).
		</p>
				
		<h2 id="the2xmlTool">The <code>2xml</code> helper script: <code>yaml2xml</code>, <code>json2xml</code>, <code>asn12xml</code>, <code>mime2xml</code> etc.</h2>
		
		<p>
			Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward.
			However, sometimes it is useful to see the XML serialization of this in-memory model.
			In the <code>relpipe-in-xmltable.cpp</code> repository we have a helper script called <code>
				<a href="http://hg.globalcode.info/relpipe/relpipe-in-xmltable.cpp/file/tip/examples/2xml.sh">2xml</a>
			</code>
			– this script is not intended to be called directly – instead the user should create a symlink e.g. <code>ini2xml</code>, <code>yaml2xml</code>, <code>asn12xml</code> etc.
			The <code>2xml</code> script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format.
		</p>
		
		<p>
			If we want to do the same thing without the helper script, it is quite simple.
			We use appropriate <code>relpipe-in-*table</code> tool and extract a single relation with single attribute and single record.
			The <code>--records</code> expression is <code>'/'</code> i.e. the root node.
			The <code>--attribute</code> expression is <code>'.'</code> i.e. still the root node.
			And then we just add the <code>--mode raw-xml</code> to this attribute, so we get the XML serialization of given node (root) instead of the text content.
		</p>
		
		<p>
			In addition to this, the <code>2xml</code> script does also formatting/indentation and syntax highlighting,
			if given tools (<code>xmllint</code> and <code>pygmentize</code>) are available and the STDOUT is a terminal.
		</p>
		
		<p>
			This script is useful when writing the expressions for <code>relpipe-in-*table</code>,
			but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats.
			We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document,
			or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language.
		</p>
		

	</text>

</stránka>