author | František Kučera <franta-hg@frantovo.cz> |
Mon, 21 Feb 2022 00:43:11 +0100 | |
branch | v_0 |
changeset 329 | 5bc2bb8b7946 |
parent 294 | abbc9bcfbcc4 |
permissions | -rw-r--r-- |
294
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
1 |
<stránka |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
4 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
5 |
<nadpis>Collecting statistics from XHTML pages</nadpis> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
6 |
<perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
7 |
<m:pořadí-příkladu>04000</m:pořadí-příkladu> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
8 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml"> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
10 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
11 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
12 |
The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
13 |
We can use this feature to collect data from e.g. XHTML pages. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
14 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
15 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
16 |
<m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
17 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
18 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
19 |
The script above will show this barchart and statistics: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
20 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
21 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
22 |
<m:img src="img/xhtml-filesystem-xpath-1.png"/> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
23 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
24 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
25 |
This pipeline consists of four steps: |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
26 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
27 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
28 |
<ul> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
29 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
30 |
<code>findFiles</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
31 |
– prepares the list of files separated by <code>\0</code> byte; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
32 |
we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
33 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
34 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
35 |
<code>fetchAttributes</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
36 |
– does the heavy work – tries to parse each given file as a XML |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
37 |
and if valid, extracts several values specified by the XPath expressions; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
38 |
thanks to <code>--parallel N</code> option, utilizes N cores of our CPU; |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
39 |
we can experiment with the N value and look how the total time decreases |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
40 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
41 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
42 |
<code>filterAndOrder</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
43 |
– uses SQL to skip the records (files) that are not XHTML |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
44 |
and takes five valid files with most number of headlines |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
45 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
46 |
<li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
47 |
<code>relpipe-out-gui</code> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
48 |
– displays the data is a GUI window and generates a bar chart from the numeric values |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
49 |
(we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format) |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
50 |
</li> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
51 |
</ul> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
52 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
53 |
<p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
54 |
We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions). |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
55 |
Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans. |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
56 |
</p> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
57 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
58 |
</text> |
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
59 |
|
abbc9bcfbcc4
Release v0.15 – streamlets, parallel processing
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
60 |
</stránka> |