relpipe-data/examples-xhtml-filesystem-xpath.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 01:21:22 +0100
branchv_0
changeset 330 70e7eb578cfa
parent 294 abbc9bcfbcc4
permissions -rw-r--r--
Added tag relpipe-v0.18 for changeset 5bc2bb8b7946
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
294
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     1
<stránka
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     2
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     3
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     4
	
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     5
	<nadpis>Collecting statistics from XHTML pages</nadpis>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     6
	<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     7
	<m:pořadí-příkladu>04000</m:pořadí-příkladu>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     8
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     9
	<text xmlns="http://www.w3.org/1999/xhtml">
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    10
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    11
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    12
			The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    13
			We can use this feature to collect data from e.g. XHTML pages.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    14
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    15
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    16
		<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    17
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    18
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    19
			The script above will show this barchart and statistics:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    20
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    21
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    22
		<m:img src="img/xhtml-filesystem-xpath-1.png"/>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    23
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    24
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    25
			This pipeline consists of four steps:
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    26
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    27
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    28
		<ul>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    29
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    30
				<code>findFiles</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    31
				– prepares the list of files separated by <code>\0</code> byte;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    32
				we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    33
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    34
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    35
				<code>fetchAttributes</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    36
				– does the heavy work – tries to parse each given file as a XML
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    37
				and if valid, extracts several values specified by the XPath expressions;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    38
				thanks to <code>--parallel N</code> option, utilizes N cores of our CPU;
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    39
				we can experiment with the N value and look how the total time decreases
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    40
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    41
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    42
				<code>filterAndOrder</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    43
				– uses SQL to skip the records (files) that are not XHTML
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    44
				and takes five valid files with most number of headlines
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    45
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    46
			<li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    47
				<code>relpipe-out-gui</code>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    48
				– displays the data is a GUI window and generates a bar chart from the numeric values
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    49
				(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format)
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    50
			</li>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    51
		</ul>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    52
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    53
		<p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    54
			We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions).
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    55
			Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans.
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    56
		</p>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    57
		
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    58
	</text>
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    59
abbc9bcfbcc4 Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    60
</stránka>