|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Filtering and transforming relational data with XPath</nadpis> |
|
6 <perex>do simple restrictions and projections using a well-established query language</perex> |
|
7 <m:pořadí-příkladu>04700</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath. |
|
13 It is now part of the toolset consisting of SQL, AWK, Scheme and others. |
|
14 However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML, |
|
15 and also for data that violates the rules of normal forms. |
|
16 We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>. |
|
17 </p> |
|
18 |
|
19 |
|
20 <h2>Basic filtering</h2> |
|
21 |
|
22 <p>Let us have some CSV data:</p> |
|
23 <m:pre jazyk="text" src="examples/film-1.csv"/> |
|
24 |
|
25 <p>That look like this formatted as a table:</p> |
|
26 <m:pre jazyk="text" src="examples/film-1.tabular"/> |
|
27 |
|
28 |
|
29 <p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p> |
|
30 <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv \ |
|
31 | relpipe-in-csv --relation "film" \ |
|
32 | relpipe-tr-xpath \ |
|
33 --relation '.*' \ |
|
34 --where 'year >= 1980 and year < 1990' \ |
|
35 | relpipe-out-tabular]]></m:pre> |
|
36 |
|
37 <p>filtered result:</p> |
|
38 <m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/> |
|
39 |
|
40 <p> |
|
41 n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it |
|
42 i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath. |
|
43 </p> |
|
44 |
|
45 |
|
46 <h2>Filtering records with tree structures</h2> |
|
47 |
|
48 <p> |
|
49 The CSV above is not a best example of data modeling. |
|
50 Actually, it is quite terrible. |
|
51 But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job. |
|
52 </p> |
|
53 |
|
54 <p> |
|
55 Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>. |
|
56 In this case, we would break this denormalized CSV table into several relations: |
|
57 <code>film</code>, <code>director</code>, <code>screenwriter</code>… |
|
58 or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>… |
|
59 </p> |
|
60 |
|
61 <p> |
|
62 But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code> |
|
63 (that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>). |
|
64 So, we will keep some data in classic relational attributes and some in nested XML structure. |
|
65 This approach allows us to combine rigid attributes with free-form rich tree structures. |
|
66 </p> |
|
67 |
|
68 <m:pre jazyk="text" src="examples/film-2.tabular"/> |
|
69 |
|
70 <p> |
|
71 The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model. |
|
72 We use the same language syntax and principles for both kinds of attributes: |
|
73 </p> |
|
74 |
|
75 |
|
76 <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \ |
|
77 | relpipe-in-csv --relation "film" \ |
|
78 | relpipe-tr-xpath \ |
|
79 --relation '.*' \ |
|
80 --xml-attribute 'metadata' \ |
|
81 --where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \ |
|
82 | relpipe-out-tabular]]></m:pre> |
|
83 |
|
84 <p>Filtered result:</p> |
|
85 |
|
86 <m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/> |
|
87 |
|
88 <p> |
|
89 n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>. |
|
90 In later versions, there will be a dedicated data type for XML, so these hints will not be necessary. |
|
91 </p> |
|
92 |
|
93 <p> |
|
94 This way, we can work with free-form attributes containing multiple values or run various functions on them. |
|
95 We can e.g. list films that have more than one screenwriter: |
|
96 </p> |
|
97 |
|
98 <m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) > 1'</m:pre> |
|
99 |
|
100 <p>Well, well… here we are:</p> |
|
101 |
|
102 <m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/> |
|
103 |
|
104 <p> |
|
105 We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL. |
|
106 </p> |
|
107 |
|
108 <!-- |
|
109 cat relpipe-data/examples/film-2.csv \ |
|
110 | relpipe-in-csv -\-relation 'film' \ |
|
111 | relpipe-tr-sql \ |
|
112 -\-data-source-name myPostgreSQL \ |
|
113 -\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \ |
|
114 -\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \ |
|
115 | relpipe-out-tabular |
|
116 --> |
|
117 |
|
118 |
|
119 <h2>Adding new attributes and transforming data</h2> |
|
120 |
|
121 <p> |
|
122 The <code>relpipe-tr-xpath</code> does not only restriction but also projection. |
|
123 It can add, remove or modify the attributes while converting the input to the result set. |
|
124 </p> |
|
125 |
|
126 <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \ |
|
127 | relpipe-in-csv --relation "film" \ |
|
128 | relpipe-tr-xpath \ |
|
129 --relation '.*' \ |
|
130 --xml-attribute 'metadata' \ |
|
131 --output-attribute 'title' string 'title' \ |
|
132 --output-attribute 'director' string 'metadata/film/director' \ |
|
133 --output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \ |
|
134 | relpipe-out-tabular]]></m:pre> |
|
135 |
|
136 <p>We removed some attributes and created new ones:</p> |
|
137 <m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/> |
|
138 |
|
139 |
|
140 <p>Or we may concatenate the values:</p> |
|
141 |
|
142 <m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv \ |
|
143 | relpipe-in-csv \ |
|
144 | relpipe-tr-xpath \ |
|
145 --relation '.*' \ |
|
146 --xml-attribute 'metadata' \ |
|
147 --output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \ |
|
148 | relpipe-out-nullbyte | tr \\0 \\n]]></m:pre> |
|
149 <!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' --> |
|
150 |
|
151 <p>and build some sentences:</p> |
|
152 <m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/> |
|
153 |
|
154 <h2>Exctracting values from multiple XML files</h2> |
|
155 |
|
156 <p> |
|
157 Input data may come not only from some kind of database or some carefully designed data set, |
|
158 they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database… |
|
159 despite this fact, we can still collect and query such data in a relational way. |
|
160 </p> |
|
161 |
|
162 <p> |
|
163 For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files. |
|
164 Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files: |
|
165 </p> |
|
166 |
|
167 <m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0 \ |
|
168 | relpipe-in-filesystem \ |
|
169 --relation 'module' \ |
|
170 --file path \ |
|
171 --file content \ |
|
172 | relpipe-tr-xpath \ |
|
173 --namespace 'm' 'http://maven.apache.org/POM/4.0.0' \ |
|
174 --relation '.*' \ |
|
175 --xml-attribute 'content' \ |
|
176 --output-attribute 'path' string 'path' \ |
|
177 --output-attribute 'group_id' string 'content/m:project/m:groupId' \ |
|
178 --output-attribute 'artifact_id' string 'content/m:project/m:artifactId' \ |
|
179 --output-attribute 'version' string 'content/m:project/m:version' \ |
|
180 | relpipe-out-tabular]]></m:pre> |
|
181 <!-- see also relpipe-in-filesystem -\-streamlet xpath --> |
|
182 |
|
183 <p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p> |
|
184 <m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/> |
|
185 |
|
186 <p> |
|
187 This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML. |
|
188 Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a> |
|
189 </p> |
|
190 |
|
191 |
|
192 <h2>Post scriptum</h2> |
|
193 |
|
194 <p> |
|
195 The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice. |
|
196 But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them. |
|
197 When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so. |
|
198 </p> |
|
199 |
|
200 </text> |
|
201 |
|
202 </stránka> |