# HG changeset patch # User František Kučera # Date 1572372052 -3600 # Node ID eccf2de78284a1a371b9d540694a3282d93c3bb5 # Parent de1b49ba06f1d7a9678394ab9f089ea6c9a5a7c1 examples: Getting the outline of an XHTML page diff -r de1b49ba06f1 -r eccf2de78284 relpipe-data/examples-in-xmltable-xhtml-outline.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-in-xmltable-xhtml-outline.xml Tue Oct 29 19:00:52 2019 +0100 @@ -0,0 +1,176 @@ + + + Getting the outline of an XHTML page + collect list of headlines (or images, links etc.) of a website using XMLTable + 03400 + + + +

+ Because an XHTML web page is an XML document, it can be processed using XML tools (XSLT, XPath, XQuery etc.). + In this example, we will use relpipe-in-xmltable to get list of headlines (outline) and other objects from a web page. +

+ + + +

+ This pipeline looks for h1 and h2 headlines and presents them as a relation. + We can fine-tune the XPath expression to get only certain kinds of headlines (this is specific to particular site): +

+ + + +

And get this listing:

+ + + +

+ Using slightly modified expressions: +

+ + + +

+ we will get list of images in different article on the same site: +

+ + + + +

+ There might be multiple --relation sections and we can get multiple relations from a single XML stream: +

+ + + +

+ So we can collect various types of objects in a single run. + Such data can be stored/catalogized for later use. + Or we can e.g. run a shell command for each of them – like if we have a website with some interesting content, + we will find the XPath pattern of such content and use it to download desired files: +

+ + + +

+ XPath is a very powerful language and allows us to work with the context of the nodes (XPath axes) + or call various functions – so we can easily pick exactly what we want and download just it, or process it in a different way (compute some statistics, catalogize etc.). +

+ +

+ n.b. many web pages are poorly written and contain invalid formatting. + But fortunatelly there is the tidy tool which can usually clean up such garbage: +

+ + + +

+ so we can fix their mistakes and process even such web sites: +

+ + + +

+ The AWK transformation is used just as an illustration how we can combine various tools together. + However, limiting of the records can be done by the --records '(//h:h1|//h:h2)[position() <= 10]' XPath expression in the relpipe-in-xmltable transformation. +

+ + + +