relpipe-data/examples-xhtml-filesystem-xpath.xml
branchv_0
changeset 294 abbc9bcfbcc4
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-xhtml-filesystem-xpath.xml	Mon Feb 03 22:10:07 2020 +0100
@@ -0,0 +1,60 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>Collecting statistics from XHTML pages</nadpis>
+	<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
+	<m:pořadí-příkladu>04000</m:pořadí-příkladu>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		
+		<p>
+			The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
+			We can use this feature to collect data from e.g. XHTML pages.
+		</p>
+		
+		<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
+		
+		<p>
+			The script above will show this barchart and statistics:
+		</p>
+		
+		<m:img src="img/xhtml-filesystem-xpath-1.png"/>
+		
+		<p>
+			This pipeline consists of four steps:
+		</p>
+		
+		<ul>
+			<li>
+				<code>findFiles</code>
+				– prepares the list of files separated by <code>\0</code> byte;
+				we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
+			</li>
+			<li>
+				<code>fetchAttributes</code>
+				– does the heavy work – tries to parse each given file as a XML
+				and if valid, extracts several values specified by the XPath expressions;
+				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
+				we can experiment with the N value and look how the total time decreases
+			</li>
+			<li>
+				<code>filterAndOrder</code>
+				– uses SQL to skip the records (files) that are not XHTML
+				and takes five valid files with most number of headlines
+			</li>
+			<li>
+				<code>relpipe-out-gui</code>
+				– displays the data is a GUI window and generates a bar chart from the numeric values
+				(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
+			</li>
+		</ul>
+		
+		<p>
+			We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
+			Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
+		</p>
+		
+	</text>
+
+</stránka>