diff -r b862d16a2e9f -r abbc9bcfbcc4 relpipe-data/examples-xhtml-filesystem-xpath.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-xhtml-filesystem-xpath.xml Mon Feb 03 22:10:07 2020 +0100 @@ -0,0 +1,60 @@ + + + Collecting statistics from XHTML pages + use XPath to get titles and headlines counts from web pages and then show a bar chart and statistics + 04000 + + + +

+ The relpipe-in-filesystem and the xpath streamlet allows us to extract multiple values (attributes) from XML files. + We can use this feature to collect data from e.g. XHTML pages. +

+ + + +

+ The script above will show this barchart and statistics: +

+ + + +

+ This pipeline consists of four steps: +

+ +
    +
  • + findFiles + – prepares the list of files separated by \0 byte; + we can add -iname '*.xhtml' if we know the extension and make the pipeline more efficient +
  • +
  • + fetchAttributes + – does the heavy work – tries to parse each given file as a XML + and if valid, extracts several values specified by the XPath expressions; + thanks to --parallel N option, utilizes N cores of our CPU; + we can experiment with the N value and look how the total time decreases +
  • +
  • + filterAndOrder + – uses SQL to skip the records (files) that are not XHTML + and takes five valid files with most number of headlines +
  • +
  • + relpipe-out-gui + – displays the data is a GUI window and generates a bar chart from the numeric values + (we could use e.g. relpipe-out-tabular to display the data in the text terminal or format the results as XML, CSV or other format) +
  • +
+ +

+ We can use a similar pipeline to extract any values from any set of XML files (e.g. Maven POM files or WSDL definitions). + Using --option mode raw-xml we can extract even sub-trees (XML fragments) from the XML files, so we can collect also arbitrarily structured data, not only simple values like strings or booleans. +

+ +
+ +