relpipe-data/examples-xpath-filtering-transforming.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 329 5bc2bb8b7946
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Filtering and transforming relational data with XPath</nadpis>
	<perex>do simple restrictions and projections using a well-established query language</perex>
	<m:pořadí-příkladu>04700</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath.
			It is now part of the toolset consisting of SQL, AWK, Scheme and others.
			However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML,
			and also for data that violates the rules of normal forms.
			We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>.
		</p>

		
		<h2>Basic filtering</h2>
		
		<p>Let us have some CSV data:</p>		
		<m:pre jazyk="text" src="examples/film-1.csv"/>
		
		<p>That look like this formatted as a table:</p>
		<m:pre jazyk="text" src="examples/film-1.tabular"/>
		
		
		<p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p>
		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv               \
	| relpipe-in-csv --relation "film"             \
	| relpipe-tr-xpath                             \
		--relation '.*'                            \
			--where 'year >= 1980 and year < 1990' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>filtered result:</p>
		<m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/>
		
		<p>
			n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it
			i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath.
		</p>
		
		
		<h2>Filtering records with tree structures</h2>
		
		<p>
			The CSV above is not a best example of data modeling.
			Actually, it is quite terrible.
			But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
		</p>
		
		<p>
			Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>.
			In this case, we would break this denormalized CSV table into several relations:
			<code>film</code>, <code>director</code>, <code>screenwriter</code>…
			or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>…
		</p>
		
		<p>
			But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code>
			(that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>).
			So, we will keep some data in classic relational attributes and some in nested XML structure.
			This approach allows us to combine rigid attributes with free-form rich tree structures.
		</p>
		
		<m:pre jazyk="text" src="examples/film-2.tabular"/>
		
		<p>
			The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model.
			We use the same language syntax and principles for both kinds of attributes:
		</p>
		
		
		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                            \
	| relpipe-in-csv --relation "film"                                          \
	| relpipe-tr-xpath                                                          \
		--relation '.*'                                                         \
			--xml-attribute 'metadata'                                          \
			--where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
	| relpipe-out-tabular]]></m:pre>
	
		<p>Filtered result:</p>
	
		<m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/>
		
		<p>
			n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>.
			In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
		</p>
		
		<p>
			This way, we can work with free-form attributes containing multiple values or run various functions on them.
			We can e.g. list films that have more than one screenwriter:
		</p>
		
		<m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) &gt; 1'</m:pre>
		
		<p>Well, well… here we are:</p>
		
		<m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/>
		
		<p>
			We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL.
		</p>
		
		<!--
		cat relpipe-data/examples/film-2.csv \
			| relpipe-in-csv -\-relation 'film' \
			| relpipe-tr-sql \
				-\-data-source-name myPostgreSQL \
				-\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \
				-\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \
			| relpipe-out-tabular
		-->
		
		
		<h2>Adding new attributes and transforming data</h2>
		
		<p>
			The <code>relpipe-tr-xpath</code> does not only restriction but also projection.
			It can add, remove or modify the attributes while converting the input to the result set.
		</p>
		
		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                                            \
	| relpipe-in-csv --relation "film"                                                          \
	| relpipe-tr-xpath                                                                          \
		--relation '.*'                                                                         \
			--xml-attribute 'metadata'                                                          \
			--output-attribute 'title'              string  'title'                             \
			--output-attribute 'director'           string  'metadata/film/director'            \
			--output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
	| relpipe-out-tabular]]></m:pre>

		<p>We removed some attributes and created new ones:</p>
		<m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/>
		
		
		<p>Or we may concatenate the values:</p>
		
		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv       \
	| relpipe-in-csv                       \
	| relpipe-tr-xpath                     \
		--relation '.*'                    \
			--xml-attribute 'metadata'     \
			--output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
	| relpipe-out-nullbyte | tr \\0 \\n]]></m:pre>
		<!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' -->
		
		<p>and build some sentences:</p>
		<m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/>
		
		<h2>Exctracting values from multiple XML files</h2>
		
		<p>
			Input data may come not only from some kind of database or some carefully designed data set,
			they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database…
			despite this fact, we can still collect and query such data in a relational way.
		</p>
		
		<p>
			For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files.
			Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0                                                    \
	| relpipe-in-filesystem                                                             \
		--relation 'module'                                                             \
		--file path                                                                     \
		--file content                                                                  \
	| relpipe-tr-xpath                                                                  \
		--namespace 'm' 'http://maven.apache.org/POM/4.0.0'                             \
		--relation '.*'                                                                 \
			--xml-attribute 'content'                                                   \
			--output-attribute 'path'        string 'path'                              \
			--output-attribute 'group_id'    string 'content/m:project/m:groupId'       \
			--output-attribute 'artifact_id' string 'content/m:project/m:artifactId'    \
			--output-attribute 'version'     string 'content/m:project/m:version'       \
	| relpipe-out-tabular]]></m:pre>
		<!-- see also relpipe-in-filesystem -\-streamlet xpath -->
	
		<p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p>
		<m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/>
	
		<p>
			This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML.
			Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a>
		</p>
		
		
		<h2>Post scriptum</h2>
		
		<p>
			The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice.
			But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them.
			When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
		</p>
		
	</text>

</stránka>