<stránka
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
<nadpis>Collecting statistics from XHTML pages</nadpis>
<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
<m:pořadí-příkladu>04000</m:pořadí-příkladu>
<text xmlns="http://www.w3.org/1999/xhtml">
<p>
The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
We can use this feature to collect data from e.g. XHTML pages.
</p>
<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
<p>
The script above will show this barchart and statistics:
</p>
<m:img src="img/xhtml-filesystem-xpath-1.png"/>
<p>
This pipeline consists of four steps:
</p>
<ul>
<li>
<code>findFiles</code>
– prepares the list of files separated by <code>\0</code> byte;
we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
</li>
<li>
<code>fetchAttributes</code>
– does the heavy work – tries to parse each given file as a XML
and if valid, extracts several values specified by the XPath expressions;
thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
we can experiment with the N value and look how the total time decreases
</li>
<li>
<code>filterAndOrder</code>
– uses SQL to skip the records (files) that are not XHTML
and takes five valid files with most number of headlines
</li>
<li>
<code>relpipe-out-gui</code>
– displays the data is a GUI window and generates a bar chart from the numeric values
(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
</li>
</ul>
<p>
We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
</p>
</text>
</stránka>