# HG changeset patch # User František Kučera # Date 1566906637 -7200 # Node ID 1b8576c9640cb9a5dac40d1f4bd1c94bf96f1fce # Parent 1826d1cce40421d1305df311fe7b81ef989ab499 examples: XHTML table processing in SQL diff -r 1826d1cce404 -r 1b8576c9640c relpipe-data/examples-in-xmltable-tr-sql-xhtml-table.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-in-xmltable-tr-sql-xhtml-table.xml Tue Aug 27 13:50:37 2019 +0200 @@ -0,0 +1,54 @@ + + + Processing data from an XHTML page using XMLTable and SQL + reading a web table and compute some statistics + 03000 + + + +

+ Sometimes there are interesting data in a semi-structured form on a website. + We can read such data and process them as relations using the XMLTable input and e.g. SQL transformation. + This example shows how to read the list of available Relpipe implementations, + filter the commands (executables) and compute statistics, so we can see, how many input filters, output filters and transformations we have: +

+ + + +

This script will generate a relation:

+ + + +

+ Using these tools we can build e.g. an automatic system which watches a website and notifies us about the changes. + In SQL, we can use the EXCEPT operation and compare current data with older ones and SELECT only the new or changed records. +

+ +

+ There are also some caveats: +

+ +

+ What if the table structure changes? + At first, we must say that parsing a web page (which is a presentation form, not designed for machine processing) is always suboptimal and hackish. + The propper way is to arrange a machine-readable format for data exchange (e.g. XML with well-defined schema). + But if we do not have this option and must parse some web page, we can improve it in two ways: +

+ +
    +
  • modify the --records XPath expression so it will select the table with exact number of colums and propper names instead of selecting the first table,
  • +
  • use XQuery which is much more powerful than XMLTable and can generate even dynamic relations with attributes derived from the content of the XHTML table, so if new columns are added, we will get automatically new attributes.
  • +
+ +

+ What if the web page is invalid? Unfortunately, current web is full of invalid and faulty documents that can not be easily parsed. + In such case, we can pass the stream through the tidy tool which fixes the bugs and then pass it to the relpipe-in-xmltable. + It is just one additional step in our pipeline. +

+ + +
+ +
diff -r 1826d1cce404 -r 1b8576c9640c relpipe-data/examples/xhtml-table-sql-statistics.sh --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples/xhtml-table-sql-statistics.sh Tue Aug 27 13:50:37 2019 +0200 @@ -0,0 +1,26 @@ +#!/bin/bash + +URL="https://relational-pipes.globalcode.info/v_0/implementation.xhtml"; + +# or we can use wget, if curl is missing or we like wget more: +# curl() { wget -O - "$1"; } + +curl "$URL" \ + | relpipe-in-xmltable \ + --namespace "h" "http://www.w3.org/1999/xhtml" \ + --relation "implementation" \ + --records "//h:table[1]/h:tbody/h:tr" \ + --attribute "name" string "h:td[1]" \ + --attribute "type" string "h:td[2]" \ + --attribute "subtype" string "h:td[3]" \ + | relpipe-tr-sql \ + --relation \ + "relpipe_commands" \ + "SELECT + subtype AS type, + count(*) AS count + FROM implementation + WHERE type = 'executable' + GROUP BY type, subtype + ORDER BY count DESC" \ + | relpipe-out-tabular diff -r 1826d1cce404 -r 1b8576c9640c relpipe-data/examples/xhtml-table-sql-statistics.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples/xhtml-table-sql-statistics.txt Tue Aug 27 13:50:37 2019 +0200 @@ -0,0 +1,9 @@ +relpipe_commands: + ╭────────────────┬────────────────╮ + │ type (string) │ count (string) │ + ├────────────────┼────────────────┤ + │ output │ 8 │ + │ transformation │ 8 │ + │ input │ 7 │ + ╰────────────────┴────────────────╯ +Record count: 3 diff -r 1826d1cce404 -r 1b8576c9640c relpipe-data/index.xml --- a/relpipe-data/index.xml Tue Aug 27 12:32:43 2019 +0200 +++ b/relpipe-data/index.xml Tue Aug 27 13:50:37 2019 +0200 @@ -178,7 +178,7 @@

- Data can flow through several transformation or directly from the input filter to the output filter. + Data can flow through a sequence of several transformations or directly from the input filter to the output filter.