author František Kučera <>
Sun, 26 Jul 2020 18:13:48 +0200
changeset 304 95add699346d
parent 290 e73765513aec
permissions -rw-r--r--
css: Latin Modern Sans + Latin Modern Mono (in <code/>) + DejaVu Sans Mono (in <pre/> due to box-drawing)

	<nadpis>Getting the outline of an XHTML page</nadpis>
	<perex>collect list of headlines (or images, links etc.) of a website using XMLTable</perex>

	<text xmlns="">
			Because an XHTML web page is an XML document, it can be processed using XML tools (XSLT, XPath, XQuery etc.).
			In this example, we will use <code>relpipe-in-xmltable</code> to get list of headlines (outline) and other objects from a web page.
		<m:pre jazyk="bash"><![CDATA[wget -O - \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'headlines' \
			--records '//h:h1|//h:h2' \
			--attribute 'level'    string 'name()' \
			--attribute 'headline' string '.' \
	| relpipe-out-tabular]]></m:pre>
			This pipeline looks for <code>h1</code> and <code>h2</code> headlines and presents them as a relation.
			We can fine-tune the XPath expression to get only certain kinds of headlines (this is specific to particular site):
		<m:pre jazyk="bash"><![CDATA[wget -O - \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'headlines' \
			--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \
			--attribute 'level'    string 'name()' \
			--attribute 'headline' string '.' \
	| relpipe-out-tabular]]></m:pre>
		<p>And get this listing:</p>
		<m:pre jazyk="text"><![CDATA[headlines:
 │ level (string) │ headline                  (string) │
 │ h1             │ Opravujeme myš: výměna spínačů     │
 │ h2             │ Popis problému                     │
 │ h2             │ Výdrž 50 milionů kliknutí?         │
 │ h2             │ Oprava – výměna spínače            │
 │ h2             │ Myš Razer Orochi                   │
 │ h2             │ Trackball Logitech TrackMan Marble │
 │ h2             │ Myší spínače Omron a Panasonic     │
 │ h2             │ Závěr                              │
Record count: 8]]></m:pre>

			Using slightly modified expressions:
		<m:pre jazyk="bash"><![CDATA[wget -O - \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'images' \
			--records '//h:div[@id="obsah"]//h:img' \
			--attribute 'image_file' string '@src' \
			--attribute 'title'      string '@title' \
	| relpipe-out-tabular]]></m:pre>
			we will get list of images in different article on the same site:
		<m:pre jazyk="text"><![CDATA[images:
 │ image_file                      (string) │ title                                                                 (string) │
 │ /s/1467/nahled_IMG_2778.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
 │ /s/1472/nahled_IMG_2782.JPG              │ nRF24, hackování bezdrátových klávesnic a myší                                 │
 │ /s/1465/nahled_IMG_2768.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
 │ /s/1475/logitacker-prikazy.png           │ LOGITacker – rozhraní a příkazy                                                │
 │ /s/1476/logitacker-zarizeni-modra-1.png  │ LOGITacker – zařízení – bez klávesnice a myši                                  │
 │ /s/1477/logitacker-zarizeni-modra-2.png  │ LOGITacker – zařízení – rozpoznána klávesnice a myš                            │
 │ /s/1478/logitacker-zarizeni-zelena-1.png │ LOGITacker – zařízení – dešifrovaná klávesnice a myš                           │
 │ /s/1463/nahled_IMG_2762.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
Record count: 8]]></m:pre>

			There might be multiple <code>--relation</code> sections and we can get multiple relations from a single XML stream:
		<m:pre jazyk="bash"><![CDATA[wget -O - \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'headlines' \
			--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \
			--attribute 'level'    string 'name()' \
			--attribute 'headline' string '.' \
		--relation 'images' \
			--records '//h:div[@id="obsah"]//h:img' \
			--attribute 'image_file' string '@src' \
			--attribute 'title'      string '@title' \
	| relpipe-out-tabular]]></m:pre>
			So we can collect various types of objects in a single run.
			Such data can be stored/catalogized for later use.
			Or we can e.g. run a shell command for each of them – like if we have a website with some interesting content,
			we will find the XPath pattern of such content and use it to download desired files: 
		<m:pre jazyk="bash"><![CDATA[# our favorite function used also in other examples;
# reads values separated by a \0 byte into a variable;
# this is a safer way than a space or newline separated data:
read_nullbyte() { local IFS=; for v in "$@"; do export "$v"; read -r -d '' "$v"; done }

wget -O - \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'images' \
			--records '//h:div[@id="plakáty"]//h:img' \
			--attribute 'image_file' string '@src' \
	| relpipe-out-nullbyte \
	| while read_nullbyte img; do wget "$img"; done]]></m:pre>
			XPath is a very powerful language and allows us to work with the context of the nodes (<a href="">XPath axes</a>)
			or call various functions – so we can easily pick exactly what we want and download just it, or process it in a different way (compute some statistics, catalogize etc.).
			n.b. many web pages are poorly written and contain invalid formatting.
			But fortunatelly there is the <code>tidy</code> tool which can usually clean up such garbage:
		<m:pre jazyk="bash"><![CDATA[wget -O - \
	| tidy -asxhtml -numeric \
	| relpipe-in-xmltable \
		--namespace 'h' '' \
		--relation 'headlines' \
			--records '//h:h1|//h:h2' \
			--attribute 'level' string 'name()' \
			--attribute 'title' string 'normalize-space(.)' \
	| relpipe-tr-awk --relation '.*' --where 'NR <= 10'
	| relpipe-out-tabular]]></m:pre>
			so we can fix their mistakes and process even such web sites:
		<m:pre jazyk="text"><![CDATA[headlines:
 │ level (string) │ title                                     (string) │
 │ h2             │ KOMIX - Časový posun                               │
 │ h2             │ Opus Magnum                                        │
 │ h2             │ Jednoduchá CRUD aplikace (Go a MySQL)              │
 │ h2             │ Vzdálená správa většího počtu strojů               │
 │ h2             │ Haluan kadota, vielä paremmin, en koskaan syntynyt │
 │ h2             │ LibreOffice a viac ako 1024 stĺpcov                │
 │ h2             │ The Catch CTF 2019                                 │
 │ h2             │ Zprávička: Nový programovací jazyk Č++             │
 │ h2             │ Stroj se zastaví                                   │
 │ h2             │ KOMIX - Užívání                                    │
Record count: 10]]></m:pre>

			The AWK transformation is used just as an illustration how we can combine various tools together.
			However, limiting of the records can be done by the <code>--records '(//h:h1|//h:h2)[position() &lt;= 10]'</code> XPath expression in the <code>relpipe-in-xmltable</code> transformation.
