--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-xpath-filtering-transforming.xml Mon Feb 21 00:43:11 2022 +0100
@@ -0,0 +1,202 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Filtering and transforming relational data with XPath</nadpis>
+ <perex>do simple restrictions and projections using a well-established query language</perex>
+ <m:pořadí-příkladu>04700</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath.
+ It is now part of the toolset consisting of SQL, AWK, Scheme and others.
+ However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML,
+ and also for data that violates the rules of normal forms.
+ We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>.
+ </p>
+
+
+ <h2>Basic filtering</h2>
+
+ <p>Let us have some CSV data:</p>
+ <m:pre jazyk="text" src="examples/film-1.csv"/>
+
+ <p>That look like this formatted as a table:</p>
+ <m:pre jazyk="text" src="examples/film-1.tabular"/>
+
+
+ <p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p>
+ <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv \
+ | relpipe-in-csv --relation "film" \
+ | relpipe-tr-xpath \
+ --relation '.*' \
+ --where 'year >= 1980 and year < 1990' \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>filtered result:</p>
+ <m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/>
+
+ <p>
+ n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it
+ i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath.
+ </p>
+
+
+ <h2>Filtering records with tree structures</h2>
+
+ <p>
+ The CSV above is not a best example of data modeling.
+ Actually, it is quite terrible.
+ But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
+ </p>
+
+ <p>
+ Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>.
+ In this case, we would break this denormalized CSV table into several relations:
+ <code>film</code>, <code>director</code>, <code>screenwriter</code>…
+ or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>…
+ </p>
+
+ <p>
+ But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code>
+ (that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>).
+ So, we will keep some data in classic relational attributes and some in nested XML structure.
+ This approach allows us to combine rigid attributes with free-form rich tree structures.
+ </p>
+
+ <m:pre jazyk="text" src="examples/film-2.tabular"/>
+
+ <p>
+ The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model.
+ We use the same language syntax and principles for both kinds of attributes:
+ </p>
+
+
+ <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
+ | relpipe-in-csv --relation "film" \
+ | relpipe-tr-xpath \
+ --relation '.*' \
+ --xml-attribute 'metadata' \
+ --where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>Filtered result:</p>
+
+ <m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/>
+
+ <p>
+ n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>.
+ In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
+ </p>
+
+ <p>
+ This way, we can work with free-form attributes containing multiple values or run various functions on them.
+ We can e.g. list films that have more than one screenwriter:
+ </p>
+
+ <m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) > 1'</m:pre>
+
+ <p>Well, well… here we are:</p>
+
+ <m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/>
+
+ <p>
+ We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL.
+ </p>
+
+ <!--
+ cat relpipe-data/examples/film-2.csv \
+ | relpipe-in-csv -\-relation 'film' \
+ | relpipe-tr-sql \
+ -\-data-source-name myPostgreSQL \
+ -\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \
+ -\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \
+ | relpipe-out-tabular
+ -->
+
+
+ <h2>Adding new attributes and transforming data</h2>
+
+ <p>
+ The <code>relpipe-tr-xpath</code> does not only restriction but also projection.
+ It can add, remove or modify the attributes while converting the input to the result set.
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
+ | relpipe-in-csv --relation "film" \
+ | relpipe-tr-xpath \
+ --relation '.*' \
+ --xml-attribute 'metadata' \
+ --output-attribute 'title' string 'title' \
+ --output-attribute 'director' string 'metadata/film/director' \
+ --output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
+ | relpipe-out-tabular]]></m:pre>
+
+ <p>We removed some attributes and created new ones:</p>
+ <m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/>
+
+
+ <p>Or we may concatenate the values:</p>
+
+ <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
+ | relpipe-in-csv \
+ | relpipe-tr-xpath \
+ --relation '.*' \
+ --xml-attribute 'metadata' \
+ --output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
+ | relpipe-out-nullbyte | tr \\0 \\n]]></m:pre>
+ <!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' -->
+
+ <p>and build some sentences:</p>
+ <m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/>
+
+ <h2>Exctracting values from multiple XML files</h2>
+
+ <p>
+ Input data may come not only from some kind of database or some carefully designed data set,
+ they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database…
+ despite this fact, we can still collect and query such data in a relational way.
+ </p>
+
+ <p>
+ For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files.
+ Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files:
+ </p>
+
+ <m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0 \
+ | relpipe-in-filesystem \
+ --relation 'module' \
+ --file path \
+ --file content \
+ | relpipe-tr-xpath \
+ --namespace 'm' 'http://maven.apache.org/POM/4.0.0' \
+ --relation '.*' \
+ --xml-attribute 'content' \
+ --output-attribute 'path' string 'path' \
+ --output-attribute 'group_id' string 'content/m:project/m:groupId' \
+ --output-attribute 'artifact_id' string 'content/m:project/m:artifactId' \
+ --output-attribute 'version' string 'content/m:project/m:version' \
+ | relpipe-out-tabular]]></m:pre>
+ <!-- see also relpipe-in-filesystem -\-streamlet xpath -->
+
+ <p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p>
+ <m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/>
+
+ <p>
+ This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML.
+ Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a>
+ </p>
+
+
+ <h2>Post scriptum</h2>
+
+ <p>
+ The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice.
+ But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them.
+ When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
+ </p>
+
+ </text>
+
+</stránka>