examples: XHTML table processing in SQL v_0
authorFrantišek Kučera <franta-hg@frantovo.cz>
Tue, 27 Aug 2019 13:50:37 +0200
branchv_0
changeset 268 1b8576c9640c
parent 267 1826d1cce404
child 269 dc36750c7c0a
examples: XHTML table processing in SQL
relpipe-data/examples-in-xmltable-tr-sql-xhtml-table.xml
relpipe-data/examples/xhtml-table-sql-statistics.sh
relpipe-data/examples/xhtml-table-sql-statistics.txt
relpipe-data/index.xml
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples-in-xmltable-tr-sql-xhtml-table.xml	Tue Aug 27 13:50:37 2019 +0200
@@ -0,0 +1,54 @@
+<stránka
+	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
+	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
+	
+	<nadpis>Processing data from an XHTML page using XMLTable and SQL</nadpis>
+	<perex>reading a web table and compute some statistics</perex>
+	<m:pořadí-příkladu>03000</m:pořadí-příkladu>
+
+	<text xmlns="http://www.w3.org/1999/xhtml">
+		
+		<p>
+			Sometimes there are interesting data in a semi-structured form on a website.
+			We can read such data and process them as relations using the XMLTable input and e.g. SQL transformation.
+			This example shows how to read the list of available Relpipe implementations,
+			filter the commands (executables) and compute statistics, so we can see, how many input filters, output filters and transformations we have:
+		</p>
+		
+		<m:pre jazyk="bash" src="examples/xhtml-table-sql-statistics.sh"/>
+
+		<p>This script will generate a relation:</p>
+
+		<m:pre jazyk="text" src="examples/xhtml-table-sql-statistics.txt"/>
+		
+		<p>
+			Using these tools we can build e.g. an automatic system which watches a website and notifies us about the changes.
+			In SQL, we can use the EXCEPT operation and compare current data with older ones and SELECT only the new or changed records.
+		</p>
+		
+		<p>
+			There are also some caveats:
+		</p>
+		
+		<p>
+			What if the table structure changes? 
+			At first, we must say that parsing a web page (which is a presentation form, not designed for machine processing) is always suboptimal and hackish.
+			The propper way is to arrange a machine-readable format for data exchange (e.g. XML with well-defined schema).
+			But if we do not have this option and must parse some web page, we can improve it in two ways:
+		</p>
+		
+		<ul>
+			<li>modify the <code>--records</code> XPath expression so it will select the table with exact number of colums and propper names instead of selecting the first table,</li>
+			<li>use XQuery which is much more powerful than XMLTable and can generate even dynamic relations with attributes derived from the content of the XHTML table, so if new columns are added, we will get automatically new attributes.</li>
+		</ul>
+		
+		<p>
+			What if the web page is invalid? Unfortunately, current web is full of invalid and faulty documents that can not be easily parsed.
+			In such case, we can pass the stream through the <code>tidy</code> tool which fixes the bugs and then pass it to the <code>relpipe-in-xmltable</code>.
+			It is just one additional step in our pipeline.
+		</p>
+
+		
+	</text>
+
+</stránka>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/xhtml-table-sql-statistics.sh	Tue Aug 27 13:50:37 2019 +0200
@@ -0,0 +1,26 @@
+#!/bin/bash
+
+URL="https://relational-pipes.globalcode.info/v_0/implementation.xhtml";
+
+# or we can use wget, if curl is missing or we like wget more:
+# curl() { wget -O - "$1"; }
+
+curl "$URL" \
+	| relpipe-in-xmltable \
+		--namespace "h" "http://www.w3.org/1999/xhtml" \
+		--relation "implementation" \
+			--records "//h:table[1]/h:tbody/h:tr" \
+			--attribute "name"     string "h:td[1]" \
+			--attribute "type"     string "h:td[2]" \
+			--attribute "subtype"  string "h:td[3]" \
+	| relpipe-tr-sql \
+		--relation \
+			"relpipe_commands" \
+			"SELECT
+				subtype AS type,
+				count(*) AS count
+			FROM implementation
+			WHERE type = 'executable'
+			GROUP BY type, subtype
+			ORDER BY count DESC" \
+	| relpipe-out-tabular
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/relpipe-data/examples/xhtml-table-sql-statistics.txt	Tue Aug 27 13:50:37 2019 +0200
@@ -0,0 +1,9 @@
+relpipe_commands:
+ ╭────────────────┬────────────────╮
+ │ type  (string) │ count (string) │
+ ├────────────────┼────────────────┤
+ │ output         │ 8              │
+ │ transformation │ 8              │
+ │ input          │ 7              │
+ ╰────────────────┴────────────────╯
+Record count: 3
--- a/relpipe-data/index.xml	Tue Aug 27 12:32:43 2019 +0200
+++ b/relpipe-data/index.xml	Tue Aug 27 13:50:37 2019 +0200
@@ -178,7 +178,7 @@
 		<m:img src="img/relational-pipes-big-picture-1.png"/>
 		
 		<p>
-			Data can flow through several transformation or directly from the input filter to the output filter.
+			Data can flow through a sequence of several transformations or directly from the input filter to the output filter.
 		</p>