relpipe-data/examples-xhtml-filesystem-xpath.xml
branchv_0
changeset 294 abbc9bcfbcc4
equal deleted inserted replaced
293:b862d16a2e9f 294:abbc9bcfbcc4
       
     1 <stránka
       
     2 	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
       
     3 	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
       
     4 	
       
     5 	<nadpis>Collecting statistics from XHTML pages</nadpis>
       
     6 	<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
       
     7 	<m:pořadí-příkladu>04000</m:pořadí-příkladu>
       
     8 
       
     9 	<text xmlns="http://www.w3.org/1999/xhtml">
       
    10 		
       
    11 		<p>
       
    12 			The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
       
    13 			We can use this feature to collect data from e.g. XHTML pages.
       
    14 		</p>
       
    15 		
       
    16 		<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
       
    17 		
       
    18 		<p>
       
    19 			The script above will show this barchart and statistics:
       
    20 		</p>
       
    21 		
       
    22 		<m:img src="img/xhtml-filesystem-xpath-1.png"/>
       
    23 		
       
    24 		<p>
       
    25 			This pipeline consists of four steps:
       
    26 		</p>
       
    27 		
       
    28 		<ul>
       
    29 			<li>
       
    30 				<code>findFiles</code>
       
    31 				– prepares the list of files separated by <code>\0</code> byte;
       
    32 				we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
       
    33 			</li>
       
    34 			<li>
       
    35 				<code>fetchAttributes</code>
       
    36 				– does the heavy work – tries to parse each given file as a XML
       
    37 				and if valid, extracts several values specified by the XPath expressions;
       
    38 				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
       
    39 				we can experiment with the N value and look how the total time decreases
       
    40 			</li>
       
    41 			<li>
       
    42 				<code>filterAndOrder</code>
       
    43 				– uses SQL to skip the records (files) that are not XHTML
       
    44 				and takes five valid files with most number of headlines
       
    45 			</li>
       
    46 			<li>
       
    47 				<code>relpipe-out-gui</code>
       
    48 				– displays the data is a GUI window and generates a bar chart from the numeric values
       
    49 				(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
       
    50 			</li>
       
    51 		</ul>
       
    52 		
       
    53 		<p>
       
    54 			We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
       
    55 			Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
       
    56 		</p>
       
    57 		
       
    58 	</text>
       
    59 
       
    60 </stránka>