|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Collecting statistics from XHTML pages</nadpis> |
|
6 <perex>use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics</perex> |
|
7 <m:pořadí-příkladu>04000</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 The <code>relpipe-in-filesystem</code> and the <code>xpath</code> streamlet allows us to extract multiple values (attributes) from XML files. |
|
13 We can use this feature to collect data from e.g. XHTML pages. |
|
14 </p> |
|
15 |
|
16 <m:pre src="examples/xhtml-filesystem-xpath.sh" jazyk="bash"/> |
|
17 |
|
18 <p> |
|
19 The script above will show this barchart and statistics: |
|
20 </p> |
|
21 |
|
22 <m:img src="img/xhtml-filesystem-xpath-1.png"/> |
|
23 |
|
24 <p> |
|
25 This pipeline consists of four steps: |
|
26 </p> |
|
27 |
|
28 <ul> |
|
29 <li> |
|
30 <code>findFiles</code> |
|
31 – prepares the list of files separated by <code>\0</code> byte; |
|
32 we can add <code>-iname '*.xhtml'</code> if we know the extension and make the pipeline more efficient |
|
33 </li> |
|
34 <li> |
|
35 <code>fetchAttributes</code> |
|
36 – does the heavy work – tries to parse each given file as a XML |
|
37 and if valid, extracts several values specified by the XPath expressions; |
|
38 thanks to <code>--parallel N</code> option, utilizes N cores of our CPU; |
|
39 we can experiment with the N value and look how the total time decreases |
|
40 </li> |
|
41 <li> |
|
42 <code>filterAndOrder</code> |
|
43 – uses SQL to skip the records (files) that are not XHTML |
|
44 and takes five valid files with most number of headlines |
|
45 </li> |
|
46 <li> |
|
47 <code>relpipe-out-gui</code> |
|
48 – displays the data is a GUI window and generates a bar chart from the numeric values |
|
49 (we could use e.g. <code>relpipe-out-tabular</code> to display the data in the text terminal or format the results as XML, CSV or other format) |
|
50 </li> |
|
51 </ul> |
|
52 |
|
53 <p> |
|
54 We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions). |
|
55 Using <code>--option mode raw-xml</code> we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans. |
|
56 </p> |
|
57 |
|
58 </text> |
|
59 |
|
60 </stránka> |