relpipe-data/examples-xhtml-filesystem-xpath.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 00:43:11 +0100
branchv_0
changeset 329 5bc2bb8b7946
parent 294 abbc9bcfbcc4
permissions -rw-r--r--
Release v0.18

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Collecting statistics from XHTML pages</nadpis>
	<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
	<m:pořadí-příkladu>04000</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
			We can use this feature to collect data from e.g. XHTML pages.
		</p>
		
		<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
		
		<p>
			The script above will show this barchart and statistics:
		</p>
		
		<m:img src="img/xhtml-filesystem-xpath-1.png"/>
		
		<p>
			This pipeline consists of four steps:
		</p>
		
		<ul>
			<li>
				<code>findFiles</code>
				– prepares the list of files separated by <code>\0</code> byte;
				we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
			</li>
			<li>
				<code>fetchAttributes</code>
				– does the heavy work – tries to parse each given file as a XML
				and if valid, extracts several values specified by the XPath expressions;
				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
				we can experiment with the N value and look how the total time decreases
			</li>
			<li>
				<code>filterAndOrder</code>
				– uses SQL to skip the records (files) that are not XHTML
				and takes five valid files with most number of headlines
			</li>
			<li>
				<code>relpipe-out-gui</code>
				– displays the data is a GUI window and generates a bar chart from the numeric values
				(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
			</li>
		</ul>
		
		<p>
			We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
			Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
		</p>
		
	</text>

</stránka>