author | František Kučera <franta-hg@frantovo.cz> |
Mon, 21 Feb 2022 00:43:11 +0100 | |
branch | v_0 |
changeset 329 | 5bc2bb8b7946 |
parent 290 | e73765513aec |
permissions | -rw-r--r-- |
280
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
1 |
<stránka |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
2 |
xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
3 |
xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
4 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
5 |
<nadpis>Getting the outline of an XHTML page</nadpis> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
6 |
<perex>collect list of headlines (or images, links etc.) of a website using XMLTable</perex> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
7 |
<m:pořadí-příkladu>03400</m:pořadí-příkladu> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
8 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
9 |
<text xmlns="http://www.w3.org/1999/xhtml"> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
10 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
11 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
12 |
Because an XHTML web page is an XML document, it can be processed using XML tools (XSLT, XPath, XQuery etc.). |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
13 |
In this example, we will use <code>relpipe-in-xmltable</code> to get list of headlines (outline) and other objects from a web page. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
14 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
15 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
16 |
<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
17 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
18 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
19 |
--relation 'headlines' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
20 |
--records '//h:h1|//h:h2' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
21 |
--attribute 'level' string 'name()' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
22 |
--attribute 'headline' string '.' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
23 |
| relpipe-out-tabular]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
24 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
25 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
26 |
This pipeline looks for <code>h1</code> and <code>h2</code> headlines and presents them as a relation. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
27 |
We can fine-tune the XPath expression to get only certain kinds of headlines (this is specific to particular site): |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
28 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
29 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
30 |
<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
31 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
32 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
33 |
--relation 'headlines' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
34 |
--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
35 |
--attribute 'level' string 'name()' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
36 |
--attribute 'headline' string '.' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
37 |
| relpipe-out-tabular]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
38 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
39 |
<p>And get this listing:</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
40 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
41 |
<m:pre jazyk="text"><![CDATA[headlines: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
42 |
╭────────────────┬────────────────────────────────────╮ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
43 |
│ level (string) │ headline (string) │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
44 |
├────────────────┼────────────────────────────────────┤ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
45 |
│ h1 │ Opravujeme myš: výměna spínačů │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
46 |
│ h2 │ Popis problému │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
47 |
│ h2 │ Výdrž 50 milionů kliknutí? │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
48 |
│ h2 │ Oprava – výměna spínače │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
49 |
│ h2 │ Myš Razer Orochi │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
50 |
│ h2 │ Trackball Logitech TrackMan Marble │ |
288
5cf3a702f47d
examples: AWKing through a XML file
František Kučera <franta-hg@frantovo.cz>
parents:
280
diff
changeset
|
51 |
│ h2 │ Myší spínače Omron a Panasonic │ |
280
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
52 |
│ h2 │ Závěr │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
53 |
╰────────────────┴────────────────────────────────────╯ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
54 |
Record count: 8]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
55 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
56 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
57 |
Using slightly modified expressions: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
58 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
59 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
60 |
<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/376 \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
61 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
62 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
63 |
--relation 'images' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
64 |
--records '//h:div[@id="obsah"]//h:img' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
65 |
--attribute 'image_file' string '@src' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
66 |
--attribute 'title' string '@title' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
67 |
| relpipe-out-tabular]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
68 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
69 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
70 |
we will get list of images in different article on the same site: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
71 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
72 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
73 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
74 |
<m:pre jazyk="text"><![CDATA[images: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
75 |
╭──────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────╮ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
76 |
│ image_file (string) │ title (string) │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
77 |
├──────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────┤ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
78 |
│ /s/1467/nahled_IMG_2778.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
79 |
│ /s/1472/nahled_IMG_2782.JPG │ nRF24, hackování bezdrátových klávesnic a myší │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
80 |
│ /s/1465/nahled_IMG_2768.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
81 |
│ /s/1475/logitacker-prikazy.png │ LOGITacker – rozhraní a příkazy │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
82 |
│ /s/1476/logitacker-zarizeni-modra-1.png │ LOGITacker – zařízení – bez klávesnice a myši │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
83 |
│ /s/1477/logitacker-zarizeni-modra-2.png │ LOGITacker – zařízení – rozpoznána klávesnice a myš │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
84 |
│ /s/1478/logitacker-zarizeni-zelena-1.png │ LOGITacker – zařízení – dešifrovaná klávesnice a myš │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
85 |
│ /s/1463/nahled_IMG_2762.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
86 |
╰──────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────╯ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
87 |
Record count: 8]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
88 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
89 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
90 |
There might be multiple <code>--relation</code> sections and we can get multiple relations from a single XML stream: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
91 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
92 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
93 |
<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
94 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
95 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
96 |
--relation 'headlines' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
97 |
--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
98 |
--attribute 'level' string 'name()' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
99 |
--attribute 'headline' string '.' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
100 |
--relation 'images' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
101 |
--records '//h:div[@id="obsah"]//h:img' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
102 |
--attribute 'image_file' string '@src' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
103 |
--attribute 'title' string '@title' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
104 |
| relpipe-out-tabular]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
105 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
106 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
107 |
So we can collect various types of objects in a single run. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
108 |
Such data can be stored/catalogized for later use. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
109 |
Or we can e.g. run a shell command for each of them – like if we have a website with some interesting content, |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
110 |
we will find the XPath pattern of such content and use it to download desired files: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
111 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
112 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
113 |
<m:pre jazyk="bash"><![CDATA[# our favorite function used also in other examples; |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
114 |
# reads values separated by a \0 byte into a variable; |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
115 |
# this is a safer way than a space or newline separated data: |
290
e73765513aec
fix read_nullbyte() to avoid trimming whitespace
František Kučera <franta-hg@frantovo.cz>
parents:
288
diff
changeset
|
116 |
read_nullbyte() { local IFS=; for v in "$@"; do export "$v"; read -r -d '' "$v"; done } |
280
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
117 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
118 |
wget -O - http://blog.frantovo.cz/ \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
119 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
120 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
121 |
--relation 'images' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
122 |
--records '//h:div[@id="plakáty"]//h:img' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
123 |
--attribute 'image_file' string '@src' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
124 |
| relpipe-out-nullbyte \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
125 |
| while read_nullbyte img; do wget "https://blog.frantovo.cz$img"; done]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
126 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
127 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
128 |
XPath is a very powerful language and allows us to work with the context of the nodes (<a href="https://www.w3.org/TR/xpath-10/#axes">XPath axes</a>) |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
129 |
or call various functions – so we can easily pick exactly what we want and download just it, or process it in a different way (compute some statistics, catalogize etc.). |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
130 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
131 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
132 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
133 |
n.b. many web pages are poorly written and contain invalid formatting. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
134 |
But fortunatelly there is the <code>tidy</code> tool which can usually clean up such garbage: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
135 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
136 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
137 |
<m:pre jazyk="bash"><![CDATA[wget -O - https://www.abclinuxu.cz/blog \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
138 |
| tidy -asxhtml -numeric \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
139 |
| relpipe-in-xmltable \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
140 |
--namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
141 |
--relation 'headlines' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
142 |
--records '//h:h1|//h:h2' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
143 |
--attribute 'level' string 'name()' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
144 |
--attribute 'title' string 'normalize-space(.)' \ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
145 |
| relpipe-tr-awk --relation '.*' --where 'NR <= 10' |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
146 |
| relpipe-out-tabular]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
147 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
148 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
149 |
so we can fix their mistakes and process even such web sites: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
150 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
151 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
152 |
<m:pre jazyk="text"><![CDATA[headlines: |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
153 |
╭────────────────┬────────────────────────────────────────────────────╮ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
154 |
│ level (string) │ title (string) │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
155 |
├────────────────┼────────────────────────────────────────────────────┤ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
156 |
│ h2 │ KOMIX - Časový posun │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
157 |
│ h2 │ Opus Magnum │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
158 |
│ h2 │ Jednoduchá CRUD aplikace (Go a MySQL) │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
159 |
│ h2 │ Vzdálená správa většího počtu strojů │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
160 |
│ h2 │ Haluan kadota, vielä paremmin, en koskaan syntynyt │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
161 |
│ h2 │ LibreOffice a viac ako 1024 stĺpcov │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
162 |
│ h2 │ The Catch CTF 2019 │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
163 |
│ h2 │ Zprávička: Nový programovací jazyk Č++ │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
164 |
│ h2 │ Stroj se zastaví │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
165 |
│ h2 │ KOMIX - Užívání │ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
166 |
╰────────────────┴────────────────────────────────────────────────────╯ |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
167 |
Record count: 10]]></m:pre> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
168 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
169 |
<p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
170 |
The AWK transformation is used just as an illustration how we can combine various tools together. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
171 |
However, limiting of the records can be done by the <code>--records '(//h:h1|//h:h2)[position() <= 10]'</code> XPath expression in the <code>relpipe-in-xmltable</code> transformation. |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
172 |
</p> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
173 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
174 |
</text> |
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
175 |
|
eccf2de78284
examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff
changeset
|
176 |
</stránka> |