329
|
1 |
<stránka
|
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
|
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
|
|
4 |
|
|
5 |
<nadpis>Filtering and transforming relational data with XPath</nadpis>
|
|
6 |
<perex>do simple restrictions and projections using a well-established query language</perex>
|
|
7 |
<m:pořadí-příkladu>04700</m:pořadí-příkladu>
|
|
8 |
|
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml">
|
|
10 |
|
|
11 |
<p>
|
|
12 |
In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath.
|
|
13 |
It is now part of the toolset consisting of SQL, AWK, Scheme and others.
|
|
14 |
However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML,
|
|
15 |
and also for data that violates the rules of normal forms.
|
|
16 |
We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>.
|
|
17 |
</p>
|
|
18 |
|
|
19 |
|
|
20 |
<h2>Basic filtering</h2>
|
|
21 |
|
|
22 |
<p>Let us have some CSV data:</p>
|
|
23 |
<m:pre jazyk="text" src="examples/film-1.csv"/>
|
|
24 |
|
|
25 |
<p>That look like this formatted as a table:</p>
|
|
26 |
<m:pre jazyk="text" src="examples/film-1.tabular"/>
|
|
27 |
|
|
28 |
|
|
29 |
<p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p>
|
|
30 |
<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv \
|
|
31 |
| relpipe-in-csv --relation "film" \
|
|
32 |
| relpipe-tr-xpath \
|
|
33 |
--relation '.*' \
|
|
34 |
--where 'year >= 1980 and year < 1990' \
|
|
35 |
| relpipe-out-tabular]]></m:pre>
|
|
36 |
|
|
37 |
<p>filtered result:</p>
|
|
38 |
<m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/>
|
|
39 |
|
|
40 |
<p>
|
|
41 |
n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it
|
|
42 |
i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath.
|
|
43 |
</p>
|
|
44 |
|
|
45 |
|
|
46 |
<h2>Filtering records with tree structures</h2>
|
|
47 |
|
|
48 |
<p>
|
|
49 |
The CSV above is not a best example of data modeling.
|
|
50 |
Actually, it is quite terrible.
|
|
51 |
But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
|
|
52 |
</p>
|
|
53 |
|
|
54 |
<p>
|
|
55 |
Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>.
|
|
56 |
In this case, we would break this denormalized CSV table into several relations:
|
|
57 |
<code>film</code>, <code>director</code>, <code>screenwriter</code>…
|
|
58 |
or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>…
|
|
59 |
</p>
|
|
60 |
|
|
61 |
<p>
|
|
62 |
But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code>
|
|
63 |
(that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>).
|
|
64 |
So, we will keep some data in classic relational attributes and some in nested XML structure.
|
|
65 |
This approach allows us to combine rigid attributes with free-form rich tree structures.
|
|
66 |
</p>
|
|
67 |
|
|
68 |
<m:pre jazyk="text" src="examples/film-2.tabular"/>
|
|
69 |
|
|
70 |
<p>
|
|
71 |
The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model.
|
|
72 |
We use the same language syntax and principles for both kinds of attributes:
|
|
73 |
</p>
|
|
74 |
|
|
75 |
|
|
76 |
<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
|
|
77 |
| relpipe-in-csv --relation "film" \
|
|
78 |
| relpipe-tr-xpath \
|
|
79 |
--relation '.*' \
|
|
80 |
--xml-attribute 'metadata' \
|
|
81 |
--where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
|
|
82 |
| relpipe-out-tabular]]></m:pre>
|
|
83 |
|
|
84 |
<p>Filtered result:</p>
|
|
85 |
|
|
86 |
<m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/>
|
|
87 |
|
|
88 |
<p>
|
|
89 |
n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>.
|
|
90 |
In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
|
|
91 |
</p>
|
|
92 |
|
|
93 |
<p>
|
|
94 |
This way, we can work with free-form attributes containing multiple values or run various functions on them.
|
|
95 |
We can e.g. list films that have more than one screenwriter:
|
|
96 |
</p>
|
|
97 |
|
|
98 |
<m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) > 1'</m:pre>
|
|
99 |
|
|
100 |
<p>Well, well… here we are:</p>
|
|
101 |
|
|
102 |
<m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/>
|
|
103 |
|
|
104 |
<p>
|
|
105 |
We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL.
|
|
106 |
</p>
|
|
107 |
|
|
108 |
<!--
|
|
109 |
cat relpipe-data/examples/film-2.csv \
|
|
110 |
| relpipe-in-csv -\-relation 'film' \
|
|
111 |
| relpipe-tr-sql \
|
|
112 |
-\-data-source-name myPostgreSQL \
|
|
113 |
-\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \
|
|
114 |
-\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \
|
|
115 |
| relpipe-out-tabular
|
|
116 |
-->
|
|
117 |
|
|
118 |
|
|
119 |
<h2>Adding new attributes and transforming data</h2>
|
|
120 |
|
|
121 |
<p>
|
|
122 |
The <code>relpipe-tr-xpath</code> does not only restriction but also projection.
|
|
123 |
It can add, remove or modify the attributes while converting the input to the result set.
|
|
124 |
</p>
|
|
125 |
|
|
126 |
<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
|
|
127 |
| relpipe-in-csv --relation "film" \
|
|
128 |
| relpipe-tr-xpath \
|
|
129 |
--relation '.*' \
|
|
130 |
--xml-attribute 'metadata' \
|
|
131 |
--output-attribute 'title' string 'title' \
|
|
132 |
--output-attribute 'director' string 'metadata/film/director' \
|
|
133 |
--output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
|
|
134 |
| relpipe-out-tabular]]></m:pre>
|
|
135 |
|
|
136 |
<p>We removed some attributes and created new ones:</p>
|
|
137 |
<m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/>
|
|
138 |
|
|
139 |
|
|
140 |
<p>Or we may concatenate the values:</p>
|
|
141 |
|
|
142 |
<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \
|
|
143 |
| relpipe-in-csv \
|
|
144 |
| relpipe-tr-xpath \
|
|
145 |
--relation '.*' \
|
|
146 |
--xml-attribute 'metadata' \
|
|
147 |
--output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
|
|
148 |
| relpipe-out-nullbyte | tr \\0 \\n]]></m:pre>
|
|
149 |
<!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' -->
|
|
150 |
|
|
151 |
<p>and build some sentences:</p>
|
|
152 |
<m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/>
|
|
153 |
|
|
154 |
<h2>Exctracting values from multiple XML files</h2>
|
|
155 |
|
|
156 |
<p>
|
|
157 |
Input data may come not only from some kind of database or some carefully designed data set,
|
|
158 |
they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database…
|
|
159 |
despite this fact, we can still collect and query such data in a relational way.
|
|
160 |
</p>
|
|
161 |
|
|
162 |
<p>
|
|
163 |
For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files.
|
|
164 |
Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files:
|
|
165 |
</p>
|
|
166 |
|
|
167 |
<m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0 \
|
|
168 |
| relpipe-in-filesystem \
|
|
169 |
--relation 'module' \
|
|
170 |
--file path \
|
|
171 |
--file content \
|
|
172 |
| relpipe-tr-xpath \
|
|
173 |
--namespace 'm' 'http://maven.apache.org/POM/4.0.0' \
|
|
174 |
--relation '.*' \
|
|
175 |
--xml-attribute 'content' \
|
|
176 |
--output-attribute 'path' string 'path' \
|
|
177 |
--output-attribute 'group_id' string 'content/m:project/m:groupId' \
|
|
178 |
--output-attribute 'artifact_id' string 'content/m:project/m:artifactId' \
|
|
179 |
--output-attribute 'version' string 'content/m:project/m:version' \
|
|
180 |
| relpipe-out-tabular]]></m:pre>
|
|
181 |
<!-- see also relpipe-in-filesystem -\-streamlet xpath -->
|
|
182 |
|
|
183 |
<p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p>
|
|
184 |
<m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/>
|
|
185 |
|
|
186 |
<p>
|
|
187 |
This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML.
|
|
188 |
Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a>
|
|
189 |
</p>
|
|
190 |
|
|
191 |
|
|
192 |
<h2>Post scriptum</h2>
|
|
193 |
|
|
194 |
<p>
|
|
195 |
The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice.
|
|
196 |
But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them.
|
|
197 |
When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
|
|
198 |
</p>
|
|
199 |
|
|
200 |
</text>
|
|
201 |
|
|
202 |
</stránka>
|