|
1 <stránka |
|
2 xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana" |
|
3 xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro"> |
|
4 |
|
5 <nadpis>Getting the outline of an XHTML page</nadpis> |
|
6 <perex>collect list of headlines (or images, links etc.) of a website using XMLTable</perex> |
|
7 <m:pořadí-příkladu>03400</m:pořadí-příkladu> |
|
8 |
|
9 <text xmlns="http://www.w3.org/1999/xhtml"> |
|
10 |
|
11 <p> |
|
12 Because an XHTML web page is an XML document, it can be processed using XML tools (XSLT, XPath, XQuery etc.). |
|
13 In this example, we will use <code>relpipe-in-xmltable</code> to get list of headlines (outline) and other objects from a web page. |
|
14 </p> |
|
15 |
|
16 <m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
|
17 | relpipe-in-xmltable \ |
|
18 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
19 --relation 'headlines' \ |
|
20 --records '//h:h1|//h:h2' \ |
|
21 --attribute 'level' string 'name()' \ |
|
22 --attribute 'headline' string '.' \ |
|
23 | relpipe-out-tabular]]></m:pre> |
|
24 |
|
25 <p> |
|
26 This pipeline looks for <code>h1</code> and <code>h2</code> headlines and presents them as a relation. |
|
27 We can fine-tune the XPath expression to get only certain kinds of headlines (this is specific to particular site): |
|
28 </p> |
|
29 |
|
30 <m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
|
31 | relpipe-in-xmltable \ |
|
32 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
33 --relation 'headlines' \ |
|
34 --records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \ |
|
35 --attribute 'level' string 'name()' \ |
|
36 --attribute 'headline' string '.' \ |
|
37 | relpipe-out-tabular]]></m:pre> |
|
38 |
|
39 <p>And get this listing:</p> |
|
40 |
|
41 <m:pre jazyk="text"><![CDATA[headlines: |
|
42 ╭────────────────┬────────────────────────────────────╮ |
|
43 │ level (string) │ headline (string) │ |
|
44 ├────────────────┼────────────────────────────────────┤ |
|
45 │ h1 │ Opravujeme myš: výměna spínačů │ |
|
46 │ h2 │ Popis problému │ |
|
47 │ h2 │ Výdrž 50 milionů kliknutí? │ |
|
48 │ h2 │ Oprava – výměna spínače │ |
|
49 │ h2 │ Myš Razer Orochi │ |
|
50 │ h2 │ Trackball Logitech TrackMan Marble │ |
|
51 │ h2 │ Myší spínače Omron a⎵Panasonic │ |
|
52 │ h2 │ Závěr │ |
|
53 ╰────────────────┴────────────────────────────────────╯ |
|
54 Record count: 8]]></m:pre> |
|
55 |
|
56 <p> |
|
57 Using slightly modified expressions: |
|
58 </p> |
|
59 |
|
60 <m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/376 \ |
|
61 | relpipe-in-xmltable \ |
|
62 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
63 --relation 'images' \ |
|
64 --records '//h:div[@id="obsah"]//h:img' \ |
|
65 --attribute 'image_file' string '@src' \ |
|
66 --attribute 'title' string '@title' \ |
|
67 | relpipe-out-tabular]]></m:pre> |
|
68 |
|
69 <p> |
|
70 we will get list of images in different article on the same site: |
|
71 </p> |
|
72 |
|
73 |
|
74 <m:pre jazyk="text"><![CDATA[images: |
|
75 ╭──────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────╮ |
|
76 │ image_file (string) │ title (string) │ |
|
77 ├──────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────┤ |
|
78 │ /s/1467/nahled_IMG_2778.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
|
79 │ /s/1472/nahled_IMG_2782.JPG │ nRF24, hackování bezdrátových klávesnic a myší │ |
|
80 │ /s/1465/nahled_IMG_2768.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
|
81 │ /s/1475/logitacker-prikazy.png │ LOGITacker – rozhraní a příkazy │ |
|
82 │ /s/1476/logitacker-zarizeni-modra-1.png │ LOGITacker – zařízení – bez klávesnice a myši │ |
|
83 │ /s/1477/logitacker-zarizeni-modra-2.png │ LOGITacker – zařízení – rozpoznána klávesnice a myš │ |
|
84 │ /s/1478/logitacker-zarizeni-zelena-1.png │ LOGITacker – zařízení – dešifrovaná klávesnice a myš │ |
|
85 │ /s/1463/nahled_IMG_2762.JPG │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │ |
|
86 ╰──────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────╯ |
|
87 Record count: 8]]></m:pre> |
|
88 |
|
89 <p> |
|
90 There might be multiple <code>--relation</code> sections and we can get multiple relations from a single XML stream: |
|
91 </p> |
|
92 |
|
93 <m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \ |
|
94 | relpipe-in-xmltable \ |
|
95 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
96 --relation 'headlines' \ |
|
97 --records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \ |
|
98 --attribute 'level' string 'name()' \ |
|
99 --attribute 'headline' string '.' \ |
|
100 --relation 'images' \ |
|
101 --records '//h:div[@id="obsah"]//h:img' \ |
|
102 --attribute 'image_file' string '@src' \ |
|
103 --attribute 'title' string '@title' \ |
|
104 | relpipe-out-tabular]]></m:pre> |
|
105 |
|
106 <p> |
|
107 So we can collect various types of objects in a single run. |
|
108 Such data can be stored/catalogized for later use. |
|
109 Or we can e.g. run a shell command for each of them – like if we have a website with some interesting content, |
|
110 we will find the XPath pattern of such content and use it to download desired files: |
|
111 </p> |
|
112 |
|
113 <m:pre jazyk="bash"><![CDATA[# our favorite function used also in other examples; |
|
114 # reads values separated by a \0 byte into a variable; |
|
115 # this is a safer way than a space or newline separated data: |
|
116 read_nullbyte() { for v in "$@"; do export "$v"; read -r -d '' "$v"; done } |
|
117 |
|
118 wget -O - http://blog.frantovo.cz/ \ |
|
119 | relpipe-in-xmltable \ |
|
120 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
121 --relation 'images' \ |
|
122 --records '//h:div[@id="plakáty"]//h:img' \ |
|
123 --attribute 'image_file' string '@src' \ |
|
124 | relpipe-out-nullbyte \ |
|
125 | while read_nullbyte img; do wget "https://blog.frantovo.cz$img"; done]]></m:pre> |
|
126 |
|
127 <p> |
|
128 XPath is a very powerful language and allows us to work with the context of the nodes (<a href="https://www.w3.org/TR/xpath-10/#axes">XPath axes</a>) |
|
129 or call various functions – so we can easily pick exactly what we want and download just it, or process it in a different way (compute some statistics, catalogize etc.). |
|
130 </p> |
|
131 |
|
132 <p> |
|
133 n.b. many web pages are poorly written and contain invalid formatting. |
|
134 But fortunatelly there is the <code>tidy</code> tool which can usually clean up such garbage: |
|
135 </p> |
|
136 |
|
137 <m:pre jazyk="bash"><![CDATA[wget -O - https://www.abclinuxu.cz/blog \ |
|
138 | tidy -asxhtml -numeric \ |
|
139 | relpipe-in-xmltable \ |
|
140 --namespace 'h' 'http://www.w3.org/1999/xhtml' \ |
|
141 --relation 'headlines' \ |
|
142 --records '//h:h1|//h:h2' \ |
|
143 --attribute 'level' string 'name()' \ |
|
144 --attribute 'title' string 'normalize-space(.)' \ |
|
145 | relpipe-tr-awk --relation '.*' --where 'NR <= 10' |
|
146 | relpipe-out-tabular]]></m:pre> |
|
147 |
|
148 <p> |
|
149 so we can fix their mistakes and process even such web sites: |
|
150 </p> |
|
151 |
|
152 <m:pre jazyk="text"><![CDATA[headlines: |
|
153 ╭────────────────┬────────────────────────────────────────────────────╮ |
|
154 │ level (string) │ title (string) │ |
|
155 ├────────────────┼────────────────────────────────────────────────────┤ |
|
156 │ h2 │ KOMIX - Časový posun │ |
|
157 │ h2 │ Opus Magnum │ |
|
158 │ h2 │ Jednoduchá CRUD aplikace (Go a MySQL) │ |
|
159 │ h2 │ Vzdálená správa většího počtu strojů │ |
|
160 │ h2 │ Haluan kadota, vielä paremmin, en koskaan syntynyt │ |
|
161 │ h2 │ LibreOffice a viac ako 1024 stĺpcov │ |
|
162 │ h2 │ The Catch CTF 2019 │ |
|
163 │ h2 │ Zprávička: Nový programovací jazyk Č++ │ |
|
164 │ h2 │ Stroj se zastaví │ |
|
165 │ h2 │ KOMIX - Užívání │ |
|
166 ╰────────────────┴────────────────────────────────────────────────────╯ |
|
167 Record count: 10]]></m:pre> |
|
168 |
|
169 <p> |
|
170 The AWK transformation is used just as an illustration how we can combine various tools together. |
|
171 However, limiting of the records can be done by the <code>--records '(//h:h1|//h:h2)[position() <= 10]'</code> XPath expression in the <code>relpipe-in-xmltable</code> transformation. |
|
172 </p> |
|
173 |
|
174 </text> |
|
175 |
|
176 </stránka> |