author | František Kučera <franta-hg@frantovo.cz> |
Mon, 21 Feb 2022 00:43:11 +0100 | |
branch | v_0 |
changeset 329 | 5bc2bb8b7946 |
parent 268 | 1b8576c9640c |
permissions | -rw-r--r-- |
268
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
1 |
<stránka |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
4 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
5 |
<nadpis>Processing data from an XHTML page using XMLTable and SQL</nadpis> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
6 |
<perex>reading a web table and compute some statistics</perex> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
7 |
<m:pořadí-příkladu>03000</m:pořadí-příkladu> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
8 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml"> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
10 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
11 |
<p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
12 |
Sometimes there are interesting data in a semi-structured form on a website. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
13 |
We can read such data and process them as relations using the XMLTable input and e.g. SQL transformation. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
14 |
This example shows how to read the list of available Relpipe implementations, |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
15 |
filter the commands (executables) and compute statistics, so we can see, how many input filters, output filters and transformations we have: |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
16 |
</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
17 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
18 |
<m:pre jazyk="bash" src="examples/xhtml-table-sql-statistics.sh"/> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
19 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
20 |
<p>This script will generate a relation:</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
21 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
22 |
<m:pre jazyk="text" src="examples/xhtml-table-sql-statistics.txt"/> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
23 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
24 |
<p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
25 |
Using these tools we can build e.g. an automatic system which watches a website and notifies us about the changes. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
26 |
In SQL, we can use the EXCEPT operation and compare current data with older ones and SELECT only the new or changed records. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
27 |
</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
28 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
29 |
<p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
30 |
There are also some caveats: |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
31 |
</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
32 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
33 |
<p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
34 |
What if the table structure changes? |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
35 |
At first, we must say that parsing a web page (which is a presentation form, not designed for machine processing) is always suboptimal and hackish. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
36 |
The propper way is to arrange a machine-readable format for data exchange (e.g. XML with well-defined schema). |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
37 |
But if we do not have this option and must parse some web page, we can improve it in two ways: |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
38 |
</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
39 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
40 |
<ul> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
41 |
<li>modify the <code>--records</code> XPath expression so it will select the table with exact number of colums and propper names instead of selecting the first table,</li> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
42 |
<li>use XQuery which is much more powerful than XMLTable and can generate even dynamic relations with attributes derived from the content of the XHTML table, so if new columns are added, we will get automatically new attributes.</li> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
43 |
</ul> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
44 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
45 |
<p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
46 |
What if the web page is invalid? Unfortunately, current web is full of invalid and faulty documents that can not be easily parsed. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
47 |
In such case, we can pass the stream through the <code>tidy</code> tool which fixes the bugs and then pass it to the <code>relpipe-in-xmltable</code>. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
48 |
It is just one additional step in our pipeline. |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
49 |
</p> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
50 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
51 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
52 |
</text> |
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
53 |
|
1b8576c9640c
examples: XHTML table processing in SQL
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
54 |
</stránka> |