--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-xhtml-filesystem-xpath.xml Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,60 @@
+<stránka
+ xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+ xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+
+ <nadpis>Collecting statistics from XHTML pages</nadpis>
+ <perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
+ <m:pořadí-příkladu>04000</m:pořadí-příkladu>
+
+ <text xmlns="http://www.w3.org/1999/xhtml">
+
+ <p>
+ The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
+ We can use this feature to collect data from e.g. XHTML pages.
+ </p>
+
+ <m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
+
+ <p>
+ The script above will show this barchart and statistics:
+ </p>
+
+ <m:img src="img/xhtml-filesystem-xpath-1.png"/>
+
+ <p>
+ This pipeline consists of four steps:
+ </p>
+
+ <ul>
+ <li>
+ <code>findFiles</code>
+ – prepares the list of files separated by <code>\0</code> byte;
+ we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
+ </li>
+ <li>
+ <code>fetchAttributes</code>
+ – does the heavy work – tries to parse each given file as a XML
+ and if valid, extracts several values specified by the XPath expressions;
+ thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
+ we can experiment with the N value and look how the total time decreases
+ </li>
+ <li>
+ <code>filterAndOrder</code>
+ – uses SQL to skip the records (files) that are not XHTML
+ and takes five valid files with most number of headlines
+ </li>
+ <li>
+ <code>relpipe-out-gui</code>
+ – displays the data is a GUI window and generates a bar chart from the numeric values
+ (we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
+ </li>
+ </ul>
+
+ <p>
+ We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
+ Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
+ </p>
+
+ </text>
+
+</stránka>