relpipe-data/examples-xpath-filtering-transforming.xml
branchv_0
changeset 329 5bc2bb8b7946
equal deleted inserted replaced
328:cc60c8dd7924 329:5bc2bb8b7946
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>Filtering and transforming relational data with XPath</nadpis>
       
     6 	<perex>do simple restrictions and projections using a well-established query language</perex>
       
     7 	<m:pořadí-příkladu>04700</m:pořadí-příkladu>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 		
       
    11 		<p>
       
    12 			In <m:name/> <m:a href="release-v0.18">v0.18</m:a> we got a new powerful language for filtering and transformations: XPath.
       
    13 			It is now part of the toolset consisting of SQL, AWK, Scheme and others.
       
    14 			However XPath is originally a language designed for XML, in <m:name/> we can use it for relational data coming from various sources, not only XML,
       
    15 			and also for data that violates the rules of normal forms.
       
    16 			We can process quite complex tree structures entangled in records but we can also write simple and intuitive expressions like <code>x = "a" or y = 123</code>.
       
    17 		</p>
       
    18 
       
    19 		
       
    20 		<h2>Basic filtering</h2>
       
    21 		
       
    22 		<p>Let us have some CSV data:</p>		
       
    23 		<m:pre jazyk="text" src="examples/film-1.csv"/>
       
    24 		
       
    25 		<p>That look like this formatted as a table:</p>
       
    26 		<m:pre jazyk="text" src="examples/film-1.tabular"/>
       
    27 		
       
    28 		
       
    29 		<p>Attributes of particular relations are available in XPath under their names, so we can directly reference them in our queries:</p>
       
    30 		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-1.csv               \
       
    31 	| relpipe-in-csv --relation "film"             \
       
    32 	| relpipe-tr-xpath                             \
       
    33 		--relation '.*'                            \
       
    34 			--where 'year >= 1980 and year < 1990' \
       
    35 	| relpipe-out-tabular]]></m:pre>
       
    36 	
       
    37 		<p>filtered result:</p>
       
    38 		<m:pre jazyk="text" src="examples/film-1.filtered-1.tabular"/>
       
    39 		
       
    40 		<p>
       
    41 			n.b. If there were any characters that are not valid XML name, they would be escaped in the same way as <code>relpipe-in-*table</code> commands do it
       
    42 			i.e. by adding underscores and unicode codepoints of given characters – e.g. the <code>weird:field</code> attribute will be available as <code>weird_3a_field</code> in XPath.
       
    43 		</p>
       
    44 		
       
    45 		
       
    46 		<h2>Filtering records with tree structures</h2>
       
    47 		
       
    48 		<p>
       
    49 			The CSV above is not a best example of data modeling.
       
    50 			Actually, it is quite terrible.
       
    51 			But in the real world, we often have to deal with such data – either work with them directly or give them some better shape before we start doing our job.
       
    52 		</p>
       
    53 		
       
    54 		<p>
       
    55 			Usually the best way is to normalize the model – follow the rules of <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms">Normal forms</a>.
       
    56 			In this case, we would break this denormalized CSV table into several relations:
       
    57 			<code>film</code>, <code>director</code>, <code>screenwriter</code>…
       
    58 			or rather <code>film</code>, <code>role</code>, <code>person</code>, <code>film_person_role</code>…
       
    59 		</p>
       
    60 		
       
    61 		<p>
       
    62 			But for now, we will keep the data denormalized and just give them a better and machine-readable structure instead of limited and ambiguous notation of <code>screenwriter = name1 + name2</code>
       
    63 			(that makes trouble when the value contains certain characters and requires writing a parser for <em>never-specified syntax</em>).
       
    64 			So, we will keep some data in classic relational attributes and some in nested XML structure.
       
    65 			This approach allows us to combine rigid attributes with free-form rich tree structures.
       
    66 		</p>
       
    67 		
       
    68 		<m:pre jazyk="text" src="examples/film-2.tabular"/>
       
    69 		
       
    70 		<p>
       
    71 			The <code>relpipe-tr-xpath</code> seamlessly integrates the schema-backed (<code>year</code>) and schema-free (<code>metadata/film</code>) parts of our data model.
       
    72 			We use the same language syntax and principles for both kinds of attributes:
       
    73 		</p>
       
    74 		
       
    75 		
       
    76 		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                            \
       
    77 	| relpipe-in-csv --relation "film"                                          \
       
    78 	| relpipe-tr-xpath                                                          \
       
    79 		--relation '.*'                                                         \
       
    80 			--xml-attribute 'metadata'                                          \
       
    81 			--where 'year = 1986 or metadata/film/screenwriter = "John Hughes"' \
       
    82 	| relpipe-out-tabular]]></m:pre>
       
    83 	
       
    84 		<p>Filtered result:</p>
       
    85 	
       
    86 		<m:pre jazyk="text" src="examples/film-2.filtered-1.tabular"/>
       
    87 		
       
    88 		<p>
       
    89 			n.b. In current version, we have to mark the attributes containing XML: <code>--xml-attribute 'metadata'</code>.
       
    90 			In later versions, there will be a dedicated data type for XML, so these hints will not be necessary.
       
    91 		</p>
       
    92 		
       
    93 		<p>
       
    94 			This way, we can work with free-form attributes containing multiple values or run various functions on them.
       
    95 			We can e.g. list films that have more than one screenwriter:
       
    96 		</p>
       
    97 		
       
    98 		<m:pre jazyk="bash">--where 'count(metadata/film/screenwriter) &gt; 1'</m:pre>
       
    99 		
       
   100 		<p>Well, well… here we are:</p>
       
   101 		
       
   102 		<m:pre jazyk="text" src="examples/film-2.filtered-2.tabular"/>
       
   103 		
       
   104 		<p>
       
   105 			We can also run XPath from SQL queries (<code>relpipe-tr-sql</code>) e.g. in PostgreSQL.
       
   106 		</p>
       
   107 		
       
   108 		<!--
       
   109 		cat relpipe-data/examples/film-2.csv \
       
   110 			| relpipe-in-csv -\-relation 'film' \
       
   111 			| relpipe-tr-sql \
       
   112 				-\-data-source-name myPostgreSQL \
       
   113 				-\-relation film_1 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter)', metadata::xml))[1]::text::integer > 1" \
       
   114 				-\-relation film_2 "SELECT * FROM film WHERE (xpath('count(/film/screenwriter) > 1', metadata::xml))[1]::text::boolean" \
       
   115 			| relpipe-out-tabular
       
   116 		-->
       
   117 		
       
   118 		
       
   119 		<h2>Adding new attributes and transforming data</h2>
       
   120 		
       
   121 		<p>
       
   122 			The <code>relpipe-tr-xpath</code> does not only restriction but also projection.
       
   123 			It can add, remove or modify the attributes while converting the input to the result set.
       
   124 		</p>
       
   125 		
       
   126 		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv                                                            \
       
   127 	| relpipe-in-csv --relation "film"                                                          \
       
   128 	| relpipe-tr-xpath                                                                          \
       
   129 		--relation '.*'                                                                         \
       
   130 			--xml-attribute 'metadata'                                                          \
       
   131 			--output-attribute 'title'              string  'title'                             \
       
   132 			--output-attribute 'director'           string  'metadata/film/director'            \
       
   133 			--output-attribute 'screenwriter_count' integer 'count(metadata/film/screenwriter)' \
       
   134 	| relpipe-out-tabular]]></m:pre>
       
   135 
       
   136 		<p>We removed some attributes and created new ones:</p>
       
   137 		<m:pre jazyk="text" src="examples/film-2.filtered-3.tabular"/>
       
   138 		
       
   139 		
       
   140 		<p>Or we may concatenate the values:</p>
       
   141 		
       
   142 		<m:pre jazyk="bash"><![CDATA[cat relpipe-data/examples/film-2.csv       \
       
   143 	| relpipe-in-csv                       \
       
   144 	| relpipe-tr-xpath                     \
       
   145 		--relation '.*'                    \
       
   146 			--xml-attribute 'metadata'     \
       
   147 			--output-attribute 'sentence' string 'concat("The film ", title, " was directed by ", metadata/film/director, " in year ", year, ".")' \
       
   148 	| relpipe-out-nullbyte | tr \\0 \\n]]></m:pre>
       
   149 		<!-- alias relpipe-out-lines='relpipe-out-nullbyte | tr \\0 \\n' -->
       
   150 		
       
   151 		<p>and build some sentences:</p>
       
   152 		<m:pre jazyk="text" src="examples/film-2.filtered-4.txt"/>
       
   153 		
       
   154 		<h2>Exctracting values from multiple XML files</h2>
       
   155 		
       
   156 		<p>
       
   157 			Input data may come not only from some kind of database or some carefully designed data set,
       
   158 			they may be e.g. scattered on our filesystem in some already defined file format never intended for use as a database…
       
   159 			despite this fact, we can still collect and query such data in a relational way.
       
   160 		</p>
       
   161 		
       
   162 		<p>
       
   163 			For example, Maven (a build system for Java) describe its modules in XML format in <code>pom.xml</code> files.
       
   164 			Using the <code>find</code> and <code>relpipe-in-filesystem</code> we collect them and create a relation containing names and contents of such files:
       
   165 		</p>
       
   166 		
       
   167 		<m:pre jazyk="bash"><![CDATA[find -type f -name 'pom.xml' -print0                                                    \
       
   168 	| relpipe-in-filesystem                                                             \
       
   169 		--relation 'module'                                                             \
       
   170 		--file path                                                                     \
       
   171 		--file content                                                                  \
       
   172 	| relpipe-tr-xpath                                                                  \
       
   173 		--namespace 'm' 'http://maven.apache.org/POM/4.0.0'                             \
       
   174 		--relation '.*'                                                                 \
       
   175 			--xml-attribute 'content'                                                   \
       
   176 			--output-attribute 'path'        string 'path'                              \
       
   177 			--output-attribute 'group_id'    string 'content/m:project/m:groupId'       \
       
   178 			--output-attribute 'artifact_id' string 'content/m:project/m:artifactId'    \
       
   179 			--output-attribute 'version'     string 'content/m:project/m:version'       \
       
   180 	| relpipe-out-tabular]]></m:pre>
       
   181 		<!-- see also relpipe-in-filesystem -\-streamlet xpath -->
       
   182 	
       
   183 		<p>Then we extract desired values using <code>relpipe-tr-xpath</code> and get:</p>
       
   184 		<m:pre jazyk="text" src="examples/xpath-maven-1.tabular"/>
       
   185 	
       
   186 		<p>
       
   187 			This way we can harvest useful values from XML files – and not only XML files, also from various alternative formats, after we convert them (on-the-fly) to XML.
       
   188 			Such conversions are already available for formats like <m:a href="examples-reading-querying-uniform-way">INI, ASN.1, MIME, HTML JSON, YAML etc.</m:a>
       
   189 		</p>
       
   190 		
       
   191 		
       
   192 		<h2>Post scriptum</h2>
       
   193 		
       
   194 		<p>
       
   195 			The abovementioned combination of classic relational attributes with free-form XML structures is definitely not a design of first choice.
       
   196 			But sometimes it makes sense and sometimes we have to work with data not designed by us and need some tools to deal with them.
       
   197 			When we are designing the data model ourselves, we should always pursue the normalized form …and break the rules only if we have really good reason to do so.
       
   198 		</p>
       
   199 		
       
   200 	</text>
       
   201 
       
   202 </stránka>