diff -r cc60c8dd7924 -r 5bc2bb8b7946 relpipe-data/examples-xpath-filtering-transforming.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-xpath-filtering-transforming.xml Mon Feb 21 00:43:11 2022 +0100 @@ -0,0 +1,202 @@ + + + Filtering and transforming relational data with XPath + do simple restrictions and projections using a well-established query language + 04700 + + + +

+ In v0.18 we got a new powerful language for filtering and transformations: XPath. + It is now part of the toolset consisting of SQL, AWK, Scheme and others. + However XPath is originally a language designed for XML, in we can use it for relational data coming from various sources, not only XML, + and also for data that violates the rules of normal forms. + We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like x = "a" or y = 123. +

+ + +

Basic filtering

+ +

Let us have some CSV data:

+ + +

That look like this formatted as a table:

+ + + +

Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:

+ = 1980 and year < 1990' \ + | relpipe-out-tabular]]> + +

filtered result:

+ + +

+ n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as relpipe-in-*table commands do it + i.e. by adding underscores and unicode codepoints of given characters – e.g. the weird:field attribute will be available as weird_3a_field in XPath. +

+ + +

Filtering records with tree structures

+ +

+ The CSV above is not a best example of data modeling. + Actually, it is quite terrible. + But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job. +

+ +

+ Usually the best way is to normalize the model – follow the rules of Normal forms. + In this case, we would break this denormalized CSV table into several relations: + film, director, screenwriter… + or rather film, role, person, film_person_role… +

+ +

+ But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of screenwriter = name1 + name2 + (that makes trouble when the value contains certain characters and requires writing a parser for never-specified syntax). + So, we will keep some data in classic relational attributes and some in nested XML structure. + This approach allows us to combine rigid attributes with free-form rich tree structures. +

+ + + +

+ The relpipe-tr-xpath seamlessly integrates the schema-backed (year) and schema-free (metadata/film) parts of our data model. + We use the same language syntax and principles for both kinds of attributes: +

+ + + + +

Filtered result:

+ + + +

+ n.b. In current version, we have to mark the attributes containing XML: --xml-attribute 'metadata'. + In later versions, there will be a dedicated data type for XML, so these hints will not be necessary. +

+ +

+ This way, we can work with free-form attributes containing multiple values or run various functions on them. + We can e.g. list films that have more than one screenwriter: +

+ + --where 'count(metadata/film/screenwriter) > 1' + +

Well, well… here we are:

+ + + +

+ We can also run XPath from SQL queries (relpipe-tr-sql) e.g. in PostgreSQL. +

+ + + + +

Adding new attributes and transforming data

+ +

+ The relpipe-tr-xpath does not only restriction but also projection. + It can add, remove or modify the attributes while converting the input to the result set. +

+ + + +

We removed some attributes and created new ones:

+ + + +

Or we may concatenate the values:

+ + + + +

and build some sentences:

+ + +

Exctracting values from multiple XML files

+ +

+ Input data may come not only from some kind of database or some carefully designed data set, + they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database… + despite this fact, we can still collect and query such data in a relational way. +

+ +

+ For example, Maven (a build system for Java) describe its modules in XML format in pom.xml files. + Using the find and relpipe-in-filesystem we collect them and create a relation containing names and contents of such files: +

+ + + + +

Then we extract desired values using relpipe-tr-xpath and get:

+ + +

+ This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML. + Such conversions are already available for formats like INI, ASN.1, MIME, HTML JSON, YAML etc. +

+ + +

Post scriptum

+ +

+ The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice. + But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them. + When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so. +

+ + + +