relpipe-data/examples-reading-querying-uniform-way.xml
branchv_0
changeset 329 5bc2bb8b7946
equal deleted inserted replaced
328:cc60c8dd7924 329:5bc2bb8b7946
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>Reading and querying JSON, YAML, CBOR, HTML, MIME, INI, ASN.1 and XML in a uniform way</nadpis>
       
     6 	<perex>run XPath queries and turn data from various sources to relations</perex>
       
     7 	<m:pořadí-příkladu>04600</m:pořadí-příkladu>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 		
       
    11 		<p>
       
    12 			Data come in different shapes and formats.
       
    13 			We can distinguish several main logical models:
       
    14 			relational,
       
    15 			tree
       
    16 			and graph
       
    17 			(a tree is an undirected graph with no cycles)
       
    18 			Arbitrary trees or even graphs are more flexible, but they are also harder to comprehend and work with.
       
    19 			Relational model is somehow limited and easier to grasp, however still flexible enough to describe almost anything.
       
    20 			(actually it can describe anything, it is just a question of how nice and native it should look)
       
    21 			Unsurprisingly, <m:name/> are build around the relational model.
       
    22 			However, sometimes we have to interact with the tree/graph world and deal with data that have other than relational shape.
       
    23 			So we need to bridge the gap between trees/graphs and relations.
       
    24 		</p>
       
    25 		
       
    26 		<p>
       
    27 			While we have just few logical models, there is abundance of serialization formats i.e. mappings of given logical model to a sequence of octets (bytes).
       
    28 			Relations might be serialized as CSV, ODS, tables in a database, Recfiles etc.
       
    29 			Trees might be serialized as XML, YAML, ASN.1, CBOR, JSON etc.
       
    30 		</p>
       
    31 		
       
    32 		<p>
       
    33 			Why reinvent the wheel and repeat the same work for each format?
       
    34 		</p>
       
    35 		
       
    36 		<p>
       
    37 			We already have reusable code for relational data – this is given by the design of <m:name/>, because it separates: <em>inputs</em>, <em>transformations</em> and <em>outputs</em>.
       
    38 			Once the data (e.g. CSV) passes through the input filter, it becomes relational data and can be processed in a uniform way by any transformation(s) or output filter.
       
    39 		</p>
       
    40 		
       
    41 		<p>
       
    42 			But what about the tree data? We have created a set of tools (input filters) that support various serialization formats, in <m:a href="release-v0.18">v0.18</m:a>:
       
    43 		</p>
       
    44 		
       
    45 		<ul>
       
    46 			<li>XML: <code>relpipe-in-xmltable</code></li>
       
    47 			<li>ASN.1: <code>relpipe-in-asn1table</code></li>
       
    48 			<li>CBOR: <code>relpipe-in-cbortable</code></li>
       
    49 			<li>HTML: <code>relpipe-in-htmltable</code></li>
       
    50 			<li>INI: <code>relpipe-in-initable</code></li>
       
    51 			<li>MIME: <code>relpipe-in-mimetable</code></li>
       
    52 			<li>YAML: <code>relpipe-in-yamltable</code></li>
       
    53 		</ul>
       
    54 		
       
    55 		<p>
       
    56 			These tools follow the same design principle and offer the same user interface.
       
    57 			So once the user learns one tool, he can use this knowledge also while working with other formats.
       
    58 			The principle is:
       
    59 		</p>
       
    60 		
       
    61 		<ul>
       
    62 			<li>We are converting the tree structure to one or more relations.</li>
       
    63 			<li>For each relation, define the expression that selects record nodes from the tree.</li>
       
    64 			<li>For each attribute, define the expression (relative to the record node) that selects the attribute value.</li>
       
    65 			<li>If anything can not (or is not desired to) be mapped to relations, keep is as a tree, so we can process it later – these (sub)trees might be embedded in normal records or reside in a separate relation.</li>
       
    66 			<li>We may do a full (lossless) conversion, but we may also extract just a single value from the whole tree (generate a single relation with single record and single attribute). Or anything in between. Anyway, the tool and the logic used is still the same.</li>
       
    67 		</ul>
       
    68 		
       
    69 		<p>
       
    70 			This is nothing new – and experienced SQL users should already know where the inspiration comes from:
       
    71 			the <code>XMLTable()</code> SQL function that converts XML tree to a result set (relation).
       
    72 			We just implemented the same functionality as a separate CLI tool, without dependency on any SQL engine and with support for not only XML but also for alternative serialization formats.
       
    73 			And for all of them, we use the same query language: XPath.
       
    74 		</p>
       
    75 		
       
    76 		<p>
       
    77 			Despite this sounds so <i>XML-ish</i>, we do not translate the alternative formats to the XML markup. There is no <i>text full of angle brackets and ampersands</i> in the middle of the process.
       
    78 			In our case, we should see XML not as a markup text (meta)format, but rather as an in-memory model – a generic tree of node objects stored in the RAM that allows us doing various tree operations (queries, modifications).
       
    79 		</p>
       
    80 		
       
    81 		
       
    82 		<h2 id="yamlToRelations">Converting a YAML tree to a set of relations</h2>
       
    83 		
       
    84 		<p>
       
    85 			Flat key-value lists become sooner or later insufficient for software configuration and it is necessary to somehow manage trees of configuration items (or relations, of course).
       
    86 			YAML is quite good tree-serialization format.
       
    87 			It is used e.g. for configuring Java Spring applications or for Netplan network configuration in the Ubuntu GNU/Linux distribution:
       
    88 		</p>
       
    89 		
       
    90 		<m:pre jazyk="yaml" src="examples/netplan-1.yaml"/>
       
    91 		
       
    92 		<p>We can use following command to convert the tree to a set of relations:</p>
       
    93 		
       
    94 		<m:pre jazyk="bash" src="examples/netplan-1.sh"/>
       
    95 		
       
    96 		<p>
       
    97 			So we can do a full relational conversion of the original tree structure or extract just few desired values (e.g. the gateway IP address).
       
    98 			We can also pipe a relation to a shell loop and execute some command for each record (e.g. DNS server or IP address).
       
    99 		</p>
       
   100 		
       
   101 		<m:img src="img/wmaker-yaml-xml-tabular-1.png"/>
       
   102 		
       
   103 		<p>
       
   104 			n.b. YAML is considered to be a superset of JSON, thus tools that can read YAML, can also read JSON.
       
   105 			In current version (v0.18) of <m:name/> the <code>relpipe-in-json</code> and <code>relpipe-in-jsontable</code> are just symbolic links to their YAML counterparts.
       
   106 		</p>
       
   107 		
       
   108 		<p>
       
   109 			There is also similar example: <m:a href="examples-in-xmltable-libvirt">Reading Libvirt XML files using XMLTable</m:a>
       
   110 			where we build relations from a XML tree.
       
   111 			The principles are the same for all input formats.
       
   112 		</p>
       
   113 		
       
   114 		<h2 id="htmlTagSoup">Dealing with the HTML tagsoup</h2>
       
   115 		
       
   116 		<p>
       
   117 			With <code>relpipe-in-htmltable</code> we can extract structured information from poor HTML pages.
       
   118 			And unlike <code>relpipe-in-xmltable</code>, this tool does not require valid XML/XHTML, so it is good for the dirty work.
       
   119 			Processing such invalid data is always bit unreliable, but still better than nothing.
       
   120 		</p>
       
   121 		
       
   122 		<m:pre jazyk="bash" src="examples/html-tagsoup-1.sh"/>
       
   123 		
       
   124 		<p>Although Mr. Ryszczyks is unable to create a valid document, this script will print:</p>
       
   125 		
       
   126 		<m:pre jazyk="text" src="examples/html-tagsoup-1.txt"/>
       
   127 
       
   128 		<p>
       
   129 			And thanks to the terminal autodetection in the <code>format_result()</code> function,
       
   130 			we can even pipe the result of this script to any <code>relpipe-tr-*</code> or <code>relpipe-out-*</code>
       
   131 			and get machine-readable data instead of the ANSI-colored tables – 
       
   132 			so we can do some further processing or conversion to a different format (XHTML, GUI, ODS, Recfile etc.).
       
   133 		</p>
       
   134 				
       
   135 		<h2 id="the2xmlTool">The <code>2xml</code> helper script: <code>yaml2xml</code>, <code>json2xml</code>, <code>asn12xml</code>, <code>mime2xml</code> etc.</h2>
       
   136 		
       
   137 		<p>
       
   138 			Mapping from the original syntax to the tree structure is usually quite intuitive and straightforward.
       
   139 			However, sometimes it is useful to see the XML serialization of this in-memory model.
       
   140 			In the <code>relpipe-in-xmltable.cpp</code> repository we have a helper script called <code>
       
   141 				<a href="http://hg.globalcode.info/relpipe/relpipe-in-xmltable.cpp/file/tip/examples/2xml.sh">2xml</a>
       
   142 			</code>
       
   143 			– this script is not intended to be called directly – instead the user should create a symlink e.g. <code>ini2xml</code>, <code>yaml2xml</code>, <code>asn12xml</code> etc.
       
   144 			The <code>2xml</code> script choses the right input filter according to the symlink name and uses it for conversion from the source tree-serialization format to the XML tree-serialization format.
       
   145 		</p>
       
   146 		
       
   147 		<p>
       
   148 			If we want to do the same thing without the helper script, it is quite simple.
       
   149 			We use appropriate <code>relpipe-in-*table</code> tool and extract a single relation with single attribute and single record.
       
   150 			The <code>--records</code> expression is <code>'/'</code> i.e. the root node.
       
   151 			The <code>--attribute</code> expression is <code>'.'</code> i.e. still the root node.
       
   152 			And then we just add the <code>--mode raw-xml</code> to this attribute, so we get the XML serialization of given node (root) instead of the text content.
       
   153 		</p>
       
   154 		
       
   155 		<p>
       
   156 			In addition to this, the <code>2xml</code> script does also formatting/indentation and syntax highlighting,
       
   157 			if given tools (<code>xmllint</code> and <code>pygmentize</code>) are available and the STDOUT is a terminal.
       
   158 		</p>
       
   159 		
       
   160 		<p>
       
   161 			This script is useful when writing the expressions for <code>relpipe-in-*table</code>,
       
   162 			but also as a pipeline filter that allows us to use the whole XML ecosystem also for other formats.
       
   163 			We can read YAML, JSON, INI, MIME or even some binary formats etc. and apply a XSLT transformation on such data and generate e.g. some XHTML report or a DocBook document,
       
   164 			or validate such structures using XSD or Relax NG schema or we can process such data using XQuery functional language.
       
   165 		</p>
       
   166 		
       
   167 
       
   168 	</text>
       
   169 
       
   170 </stránka>