relpipe-data/examples-xpath-filtering-transforming.xml
branchv_0
changeset 329 5bc2bb8b7946
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-xpath-filtering-transforming.xml	Mon Feb 21 00:43:11 2022 +0100
@@ -0,0 +1,202 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>Filtering and transforming relational data with XPath</nadpis>
+	<perex>do simple restrictions and projections using a well-established query language</perex>
+	<m:pořadí-příkladu>04700</m:pořadí-příkladu>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		
+		<p>
+			In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath.
+			It is now part of the toolset consisting of SQL, AWK, Scheme and others.
+			However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML,
+			and also for data that violates the rules of normal forms.
+			We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>.
+		</p>
+
+		
+		<h2>Basic filtering</h2>
+		
+		<p>Let us have some CSV data:</p>		
+		<m:pre jazyk="text" src="examples/film-1.csv"/>
+		
+		<p>That look like this formatted as a table:</p>
+		<m:pre jazyk="text" src="examples/film-1.tabular"/>
+		
+		
+		<p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p>
+		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv               \
+	| relpipe-in-csv --relation "film"             \
+	| relpipe-tr-xpath                             \
+		--relation '.*'                            \
+			--where 'year >= 1980 and year < 1990' \
+	| relpipe-out-tabular]]></m:pre>
+	
+		<p>filtered result:</p>
+		<m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/>
+		
+		<p>
+			n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it
+			i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath.
+		</p>
+		
+		
+		<h2>Filtering records with tree structures</h2>
+		
+		<p>
+			The CSV above is not a best example of data modeling.
+			Actually, it is quite terrible.
+			But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
+		</p>
+		
+		<p>
+			Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>.
+			In this case, we would break this denormalized CSV table into several relations:
+			<code>film</code>, <code>director</code>, <code>screenwriter</code>…
+			or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>…
+		</p>
+		
+		<p>
+			But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code>
+			(that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>).
+			So, we will keep some data in classic relational attributes and some in nested XML structure.
+			This approach allows us to combine rigid attributes with free-form rich tree structures.
+		</p>
+		
+		<m:pre jazyk="text" src="examples/film-2.tabular"/>
+		
+		<p>
+			The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model.
+			We use the same language syntax and principles for both kinds of attributes:
+		</p>
+		
+		
+		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                            \
+	| relpipe-in-csv --relation "film"                                          \
+	| relpipe-tr-xpath                                                          \
+		--relation '.*'                                                         \
+			--xml-attribute 'metadata'                                          \
+			--where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
+	| relpipe-out-tabular]]></m:pre>
+	
+		<p>Filtered result:</p>
+	
+		<m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/>
+		
+		<p>
+			n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>.
+			In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
+		</p>
+		
+		<p>
+			This way, we can work with free-form attributes containing multiple values or run various functions on them.
+			We can e.g. list films that have more than one screenwriter:
+		</p>
+		
+		<m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) &gt; 1'</m:pre>
+		
+		<p>Well, well… here we are:</p>
+		
+		<m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/>
+		
+		<p>
+			We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL.
+		</p>
+		
+		<!--
+		cat relpipe-data/examples/film-2.csv \
+			| relpipe-in-csv -\-relation 'film' \
+			| relpipe-tr-sql \
+				-\-data-source-name myPostgreSQL \
+				-\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \
+				-\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \
+			| relpipe-out-tabular
+		-->
+		
+		
+		<h2>Adding new attributes and transforming data</h2>
+		
+		<p>
+			The <code>relpipe-tr-xpath</code> does not only restriction but also projection.
+			It can add, remove or modify the attributes while converting the input to the result set.
+		</p>
+		
+		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                                            \
+	| relpipe-in-csv --relation "film"                                                          \
+	| relpipe-tr-xpath                                                                          \
+		--relation '.*'                                                                         \
+			--xml-attribute 'metadata'                                                          \
+			--output-attribute 'title'              string  'title'                             \
+			--output-attribute 'director'           string  'metadata/film/director'            \
+			--output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
+	| relpipe-out-tabular]]></m:pre>
+
+		<p>We removed some attributes and created new ones:</p>
+		<m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/>
+		
+		
+		<p>Or we may concatenate the values:</p>
+		
+		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv       \
+	| relpipe-in-csv                       \
+	| relpipe-tr-xpath                     \
+		--relation '.*'                    \
+			--xml-attribute 'metadata'     \
+			--output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
+	| relpipe-out-nullbyte | tr \\0 \\n]]></m:pre>
+		<!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' -->
+		
+		<p>and build some sentences:</p>
+		<m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/>
+		
+		<h2>Exctracting values from multiple XML files</h2>
+		
+		<p>
+			Input data may come not only from some kind of database or some carefully designed data set,
+			they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database…
+			despite this fact, we can still collect and query such data in a relational way.
+		</p>
+		
+		<p>
+			For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files.
+			Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files:
+		</p>
+		
+		<m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0                                                    \
+	| relpipe-in-filesystem                                                             \
+		--relation 'module'                                                             \
+		--file path                                                                     \
+		--file content                                                                  \
+	| relpipe-tr-xpath                                                                  \
+		--namespace 'm' 'http://maven.apache.org/POM/4.0.0'                             \
+		--relation '.*'                                                                 \
+			--xml-attribute 'content'                                                   \
+			--output-attribute 'path'        string 'path'                              \
+			--output-attribute 'group_id'    string 'content/m:project/m:groupId'       \
+			--output-attribute 'artifact_id' string 'content/m:project/m:artifactId'    \
+			--output-attribute 'version'     string 'content/m:project/m:version'       \
+	| relpipe-out-tabular]]></m:pre>
+		<!-- see also relpipe-in-filesystem -\-streamlet xpath -->
+	
+		<p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p>
+		<m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/>
+	
+		<p>
+			This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML.
+			Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a>
+		</p>
+		
+		
+		<h2>Post scriptum</h2>
+		
+		<p>
+			The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice.
+			But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them.
+			When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
+		</p>
+		
+	</text>
+
+</stránka>