relpipe-data/examples-in-xmltable-tr-sql-xhtml-table.xml
author František Kučera <franta-hg@frantovo.cz>
Fri, 17 Jan 2020 19:56:22 +0100
branchv_0
changeset 292 c4b4864225de
parent 268 1b8576c9640c
permissions -rw-r--r--
streamlets preview

<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Processing data from an XHTML page using XMLTable and SQL</nadpis>
	<perex>reading a web table and compute some statistics</perex>
	<m:pořadí-příkladu>03000</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			Sometimes there are interesting data in a semi-structured form on a website.
			We can read such data and process them as relations using the XMLTable input and e.g. SQL transformation.
			This example shows how to read the list of available Relpipe implementations,
			filter the commands (executables) and compute statistics, so we can see, how many input filters, output filters and transformations we have:
		</p>
		
		<m:pre jazyk="bash" src="examples/xhtml-table-sql-statistics.sh"/>

		<p>This script will generate a relation:</p>

		<m:pre jazyk="text" src="examples/xhtml-table-sql-statistics.txt"/>
		
		<p>
			Using these tools we can build e.g. an automatic system which watches a website and notifies us about the changes.
			In SQL, we can use the EXCEPT operation and compare current data with older ones and SELECT only the new or changed records.
		</p>
		
		<p>
			There are also some caveats:
		</p>
		
		<p>
			What if the table structure changes? 
			At first, we must say that parsing a web page (which is a presentation form, not designed for machine processing) is always suboptimal and hackish.
			The propper way is to arrange a machine-readable format for data exchange (e.g. XML with well-defined schema).
			But if we do not have this option and must parse some web page, we can improve it in two ways:
		</p>
		
		<ul>
			<li>modify the <code>--records</code> XPath expression so it will select the table with exact number of colums and propper names instead of selecting the first table,</li>
			<li>use XQuery which is much more powerful than XMLTable and can generate even dynamic relations with attributes derived from the content of the XHTML table, so if new columns are added, we will get automatically new attributes.</li>
		</ul>
		
		<p>
			What if the web page is invalid? Unfortunately, current web is full of invalid and faulty documents that can not be easily parsed.
			In such case, we can pass the stream through the <code>tidy</code> tool which fixes the bugs and then pass it to the <code>relpipe-in-xmltable</code>.
			It is just one additional step in our pipeline.
		</p>

		
	</text>

</stránka>