329
|
1 |
<stránka
|
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
|
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
|
|
4 |
|
|
5 |
<nadpis>Reading and querying JSON, YAML, CBOR, HTML, MIME, INI, ASN.1 and XML in a uniform way</nadpis>
|
|
6 |
<perex>run XPath queries and turn data from various sources to relations</perex>
|
|
7 |
<m:pořadí-příkladu>04600</m:pořadí-příkladu>
|
|
8 |
|
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml">
|
|
10 |
|
|
11 |
<p>
|
|
12 |
Data come in different shapes and formats.
|
|
13 |
We can distinguish several main logical models:
|
|
14 |
relational,
|
|
15 |
tree
|
|
16 |
and graph
|
|
17 |
(a tree is an undirected graph with no cycles)
|
|
18 |
Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with.
|
|
19 |
Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything.
|
|
20 |
(actually it can describe anything, it is just a question of how nice and native it should look)
|
|
21 |
Unsurprisingly, <m:name/> are build around the relational model.
|
|
22 |
However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape.
|
|
23 |
So we need to bridge the gap between trees/graphs and relations.
|
|
24 |
</p>
|
|
25 |
|
|
26 |
<p>
|
|
27 |
While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes).
|
|
28 |
Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc.
|
|
29 |
Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc.
|
|
30 |
</p>
|
|
31 |
|
|
32 |
<p>
|
|
33 |
Why reinvent the wheel and repeat the same work for each format?
|
|
34 |
</p>
|
|
35 |
|
|
36 |
<p>
|
|
37 |
We already have reusable code for relational data – this is given by the design of <m:name/>, because it separates: <em>inputs</em>, <em>transformations</em> and <em>outputs</em>.
|
|
38 |
Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter.
|
|
39 |
</p>
|
|
40 |
|
|
41 |
<p>
|
|
42 |
But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in <m:a href="release-v0.18">v0.18</m:a>:
|
|
43 |
</p>
|
|
44 |
|
|
45 |
<ul>
|
|
46 |
<li>XML: <code>relpipe-in-xmltable</code></li>
|
|
47 |
<li>ASN.1: <code>relpipe-in-asn1table</code></li>
|
|
48 |
<li>CBOR: <code>relpipe-in-cbortable</code></li>
|
|
49 |
<li>HTML: <code>relpipe-in-htmltable</code></li>
|
|
50 |
<li>INI: <code>relpipe-in-initable</code></li>
|
|
51 |
<li>MIME: <code>relpipe-in-mimetable</code></li>
|
|
52 |
<li>YAML: <code>relpipe-in-yamltable</code></li>
|
|
53 |
</ul>
|
|
54 |
|
|
55 |
<p>
|
|
56 |
These tools follow the same design principle and offer the same user interface.
|
|
57 |
So once the user learns one tool, he can use this knowledge also while working with other formats.
|
|
58 |
The principle is:
|
|
59 |
</p>
|
|
60 |
|
|
61 |
<ul>
|
|
62 |
<li>We are converting the tree structure to one or more relations.</li>
|
|
63 |
<li>For each relation, define the expression that selects record nodes from the tree.</li>
|
|
64 |
<li>For each attribute, define the expression (relative to the record node) that selects the attribute value.</li>
|
|
65 |
<li>If anything can not (or is not desired to) be mapped to relations, keep is as a tree, so we can process it later – these (sub)trees might be embedded in normal records or reside in a separate relation.</li>
|
|
66 |
<li>We may do a full (lossless) conversion, but we may also extract just a single value from the whole tree (generate a single relation with single record and single attribute). Or anything in between. Anyway, the tool and the logic used is still the same.</li>
|
|
67 |
</ul>
|
|
68 |
|
|
69 |
<p>
|
|
70 |
This is nothing new – and experienced SQL users should already know where the inspiration comes from:
|
|
71 |
the <code>XMLTable()</code> SQL function that converts XML tree to a result set (relation).
|
|
72 |
We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats.
|
|
73 |
And for all of them, we use the same query language: XPath.
|
|
74 |
</p>
|
|
75 |
|
|
76 |
<p>
|
|
77 |
Despite this sounds so <i>XML-ish</i>, we do not translate the alternative formats to the XML markup. There is no <i>text full of angle brackets and ampersands</i> in the middle of the process.
|
|
78 |
In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications).
|
|
79 |
</p>
|
|
80 |
|
|
81 |
|
|
82 |
<h2 id="yamlToRelations">Converting a YAML tree to a set of relations</h2>
|
|
83 |
|
|
84 |
<p>
|
|
85 |
Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course).
|
|
86 |
YAML is quite good tree-serialization format.
|
|
87 |
It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution:
|
|
88 |
</p>
|
|
89 |
|
|
90 |
<m:pre jazyk="yaml" src="examples/netplan-1.yaml"/>
|
|
91 |
|
|
92 |
<p>We can use following command to convert the tree to a set of relations:</p>
|
|
93 |
|
|
94 |
<m:pre jazyk="bash" src="examples/netplan-1.sh"/>
|
|
95 |
|
|
96 |
<p>
|
|
97 |
So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address).
|
|
98 |
We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address).
|
|
99 |
</p>
|
|
100 |
|
|
101 |
<m:img src="img/wmaker-yaml-xml-tabular-1.png"/>
|
|
102 |
|
|
103 |
<p>
|
|
104 |
n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON.
|
|
105 |
In current version (v0.18) of <m:name/> the <code>relpipe-in-json</code> and <code>relpipe-in-jsontable</code> are just symbolic links to their YAML counterparts.
|
|
106 |
</p>
|
|
107 |
|
|
108 |
<p>
|
|
109 |
There is also similar example: <m:a href="examples-in-xmltable-libvirt">Reading Libvirt XML files using XMLTable</m:a>
|
|
110 |
where we build relations from a XML tree.
|
|
111 |
The principles are the same for all input formats.
|
|
112 |
</p>
|
|
113 |
|
|
114 |
<h2 id="htmlTagSoup">Dealing with the HTML tagsoup</h2>
|
|
115 |
|
|
116 |
<p>
|
|
117 |
With <code>relpipe-in-htmltable</code> we can extract structured information from poor HTML pages.
|
|
118 |
And unlike <code>relpipe-in-xmltable</code>, this tool does not require valid XML/XHTML, so it is good for the dirty work.
|
|
119 |
Processing such invalid data is always bit unreliable, but still better than nothing.
|
|
120 |
</p>
|
|
121 |
|
|
122 |
<m:pre jazyk="bash" src="examples/html-tagsoup-1.sh"/>
|
|
123 |
|
|
124 |
<p>Although Mr. Ryszczyks is unable to create a valid document, this script will print:</p>
|
|
125 |
|
|
126 |
<m:pre jazyk="text" src="examples/html-tagsoup-1.txt"/>
|
|
127 |
|
|
128 |
<p>
|
|
129 |
And thanks to the terminal autodetection in the <code>format_result()</code> function,
|
|
130 |
we can even pipe the result of this script to any <code>relpipe-tr-*</code> or <code>relpipe-out-*</code>
|
|
131 |
and get machine-readable data instead of the ANSI-colored tables –
|
|
132 |
so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.).
|
|
133 |
</p>
|
|
134 |
|
|
135 |
<h2 id="the2xmlTool">The <code>2xml</code> helper script: <code>yaml2xml</code>, <code>json2xml</code>, <code>asn12xml</code>, <code>mime2xml</code> etc.</h2>
|
|
136 |
|
|
137 |
<p>
|
|
138 |
Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward.
|
|
139 |
However, sometimes it is useful to see the XML serialization of this in-memory model.
|
|
140 |
In the <code>relpipe-in-xmltable.cpp</code> repository we have a helper script called <code>
|
|
141 |
<a href="http://hg.globalcode.info/relpipe/relpipe-in-xmltable.cpp/file/tip/examples/2xml.sh">2xml</a>
|
|
142 |
</code>
|
|
143 |
– this script is not intended to be called directly – instead the user should create a symlink e.g. <code>ini2xml</code>, <code>yaml2xml</code>, <code>asn12xml</code> etc.
|
|
144 |
The <code>2xml</code> script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format.
|
|
145 |
</p>
|
|
146 |
|
|
147 |
<p>
|
|
148 |
If we want to do the same thing without the helper script, it is quite simple.
|
|
149 |
We use appropriate <code>relpipe-in-*table</code> tool and extract a single relation with single attribute and single record.
|
|
150 |
The <code>--records</code> expression is <code>'/'</code> i.e. the root node.
|
|
151 |
The <code>--attribute</code> expression is <code>'.'</code> i.e. still the root node.
|
|
152 |
And then we just add the <code>--mode raw-xml</code> to this attribute, so we get the XML serialization of given node (root) instead of the text content.
|
|
153 |
</p>
|
|
154 |
|
|
155 |
<p>
|
|
156 |
In addition to this, the <code>2xml</code> script does also formatting/indentation and syntax highlighting,
|
|
157 |
if given tools (<code>xmllint</code> and <code>pygmentize</code>) are available and the STDOUT is a terminal.
|
|
158 |
</p>
|
|
159 |
|
|
160 |
<p>
|
|
161 |
This script is useful when writing the expressions for <code>relpipe-in-*table</code>,
|
|
162 |
but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats.
|
|
163 |
We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document,
|
|
164 |
or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language.
|
|
165 |
</p>
|
|
166 |
|
|
167 |
|
|
168 |
</text>
|
|
169 |
|
|
170 |
</stránka>
|