relpipe-data/examples-in-xmltable-xhtml-outline.xml
author František Kučera <franta-hg@frantovo.cz>
Mon, 21 Feb 2022 00:43:11 +0100
branchv_0
changeset 329 5bc2bb8b7946
parent 290 e73765513aec
permissions -rw-r--r--
Release v0.18
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
280
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     1
<stránka
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     2
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     3
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     4
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     5
	<nadpis>Getting the outline of an XHTML page</nadpis>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     6
	<perex>collect list of headlines (or images, links etc.) of a website using XMLTable</perex>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     7
	<m:pořadí-příkladu>03400</m:pořadí-příkladu>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     8
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
     9
	<text xmlns="http://www.w3.org/1999/xhtml">
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    10
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    11
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    12
			Because an XHTML web page is an XML document, it can be processed using XML tools (XSLT, XPath, XQuery etc.).
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    13
			In this example, we will use <code>relpipe-in-xmltable</code> to get list of headlines (outline) and other objects from a web page.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    14
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    15
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    16
		<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    17
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    18
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    19
		--relation 'headlines' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    20
			--records '//h:h1|//h:h2' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    21
			--attribute 'level'    string 'name()' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    22
			--attribute 'headline' string '.' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    23
	| relpipe-out-tabular]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    24
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    25
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    26
			This pipeline looks for <code>h1</code> and <code>h2</code> headlines and presents them as a relation.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    27
			We can fine-tune the XPath expression to get only certain kinds of headlines (this is specific to particular site):
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    28
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    29
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    30
		<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    31
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    32
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    33
		--relation 'headlines' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    34
			--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    35
			--attribute 'level'    string 'name()' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    36
			--attribute 'headline' string '.' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    37
	| relpipe-out-tabular]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    38
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    39
		<p>And get this listing:</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    40
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    41
		<m:pre jazyk="text"><![CDATA[headlines:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    42
 ╭────────────────┬────────────────────────────────────╮
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    43
 │ level (string) │ headline                  (string) │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    44
 ├────────────────┼────────────────────────────────────┤
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    45
 │ h1             │ Opravujeme myš: výměna spínačů     │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    46
 │ h2             │ Popis problému                     │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    47
 │ h2             │ Výdrž 50 milionů kliknutí?         │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    48
 │ h2             │ Oprava – výměna spínače            │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    49
 │ h2             │ Myš Razer Orochi                   │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    50
 │ h2             │ Trackball Logitech TrackMan Marble │
288
5cf3a702f47d examples: AWKing through a XML file
František Kučera <franta-hg@frantovo.cz>
parents: 280
diff changeset
    51
 │ h2             │ Myší spínače Omron a Panasonic     │
280
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    52
 │ h2             │ Závěr                              │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    53
 ╰────────────────┴────────────────────────────────────╯
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    54
Record count: 8]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    55
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    56
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    57
			Using slightly modified expressions:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    58
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    59
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    60
		<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/376 \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    61
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    62
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    63
		--relation 'images' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    64
			--records '//h:div[@id="obsah"]//h:img' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    65
			--attribute 'image_file' string '@src' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    66
			--attribute 'title'      string '@title' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    67
	| relpipe-out-tabular]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    68
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    69
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    70
			we will get list of images in different article on the same site:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    71
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    72
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    73
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    74
		<m:pre jazyk="text"><![CDATA[images:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    75
 ╭──────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────╮
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    76
 │ image_file                      (string) │ title                                                                 (string) │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    77
 ├──────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────┤
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    78
 │ /s/1467/nahled_IMG_2778.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    79
 │ /s/1472/nahled_IMG_2782.JPG              │ nRF24, hackování bezdrátových klávesnic a myší                                 │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    80
 │ /s/1465/nahled_IMG_2768.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    81
 │ /s/1475/logitacker-prikazy.png           │ LOGITacker – rozhraní a příkazy                                                │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    82
 │ /s/1476/logitacker-zarizeni-modra-1.png  │ LOGITacker – zařízení – bez klávesnice a myši                                  │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    83
 │ /s/1477/logitacker-zarizeni-modra-2.png  │ LOGITacker – zařízení – rozpoznána klávesnice a myš                            │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    84
 │ /s/1478/logitacker-zarizeni-zelena-1.png │ LOGITacker – zařízení – dešifrovaná klávesnice a myš                           │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    85
 │ /s/1463/nahled_IMG_2762.JPG              │ Nordic nRF52840, Logitech, bezpečnost bezdrátových myší a klávesnic, MouseJack │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    86
 ╰──────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────╯
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    87
Record count: 8]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    88
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    89
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    90
			There might be multiple <code>--relation</code> sections and we can get multiple relations from a single XML stream:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    91
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    92
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    93
		<m:pre jazyk="bash"><![CDATA[wget -O - http://blog.frantovo.cz/c/373 \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    94
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    95
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    96
		--relation 'headlines' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    97
			--records '//h:div[@id="obsah"]//h:h1|//h:div[@id="obsah"]//h:h2' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    98
			--attribute 'level'    string 'name()' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
    99
			--attribute 'headline' string '.' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   100
		--relation 'images' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   101
			--records '//h:div[@id="obsah"]//h:img' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   102
			--attribute 'image_file' string '@src' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   103
			--attribute 'title'      string '@title' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   104
	| relpipe-out-tabular]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   105
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   106
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   107
			So we can collect various types of objects in a single run.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   108
			Such data can be stored/catalogized for later use.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   109
			Or we can e.g. run a shell command for each of them – like if we have a website with some interesting content,
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   110
			we will find the XPath pattern of such content and use it to download desired files: 
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   111
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   112
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   113
		<m:pre jazyk="bash"><![CDATA[# our favorite function used also in other examples;
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   114
# reads values separated by a \0 byte into a variable;
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   115
# this is a safer way than a space or newline separated data:
290
e73765513aec fix read_nullbyte() to avoid trimming whitespace
František Kučera <franta-hg@frantovo.cz>
parents: 288
diff changeset
   116
read_nullbyte() { local IFS=; for v in "$@"; do export "$v"; read -r -d '' "$v"; done }
280
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   117
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   118
wget -O - http://blog.frantovo.cz/ \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   119
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   120
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   121
		--relation 'images' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   122
			--records '//h:div[@id="plakáty"]//h:img' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   123
			--attribute 'image_file' string '@src' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   124
	| relpipe-out-nullbyte \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   125
	| while read_nullbyte img; do wget "https://blog.frantovo.cz$img"; done]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   126
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   127
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   128
			XPath is a very powerful language and allows us to work with the context of the nodes (<a href="https://www.w3.org/TR/xpath-10/#axes">XPath axes</a>)
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   129
			or call various functions – so we can easily pick exactly what we want and download just it, or process it in a different way (compute some statistics, catalogize etc.).
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   130
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   131
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   132
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   133
			n.b. many web pages are poorly written and contain invalid formatting.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   134
			But fortunatelly there is the <code>tidy</code> tool which can usually clean up such garbage:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   135
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   136
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   137
		<m:pre jazyk="bash"><![CDATA[wget -O - https://www.abclinuxu.cz/blog \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   138
	| tidy -asxhtml -numeric \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   139
	| relpipe-in-xmltable \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   140
		--namespace 'h' 'http://www.w3.org/1999/xhtml' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   141
		--relation 'headlines' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   142
			--records '//h:h1|//h:h2' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   143
			--attribute 'level' string 'name()' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   144
			--attribute 'title' string 'normalize-space(.)' \
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   145
	| relpipe-tr-awk --relation '.*' --where 'NR <= 10'
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   146
	| relpipe-out-tabular]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   147
	
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   148
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   149
			so we can fix their mistakes and process even such web sites:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   150
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   151
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   152
		<m:pre jazyk="text"><![CDATA[headlines:
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   153
 ╭────────────────┬────────────────────────────────────────────────────╮
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   154
 │ level (string) │ title                                     (string) │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   155
 ├────────────────┼────────────────────────────────────────────────────┤
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   156
 │ h2             │ KOMIX - Časový posun                               │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   157
 │ h2             │ Opus Magnum                                        │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   158
 │ h2             │ Jednoduchá CRUD aplikace (Go a MySQL)              │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   159
 │ h2             │ Vzdálená správa většího počtu strojů               │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   160
 │ h2             │ Haluan kadota, vielä paremmin, en koskaan syntynyt │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   161
 │ h2             │ LibreOffice a viac ako 1024 stĺpcov                │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   162
 │ h2             │ The Catch CTF 2019                                 │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   163
 │ h2             │ Zprávička: Nový programovací jazyk Č++             │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   164
 │ h2             │ Stroj se zastaví                                   │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   165
 │ h2             │ KOMIX - Užívání                                    │
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   166
 ╰────────────────┴────────────────────────────────────────────────────╯
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   167
Record count: 10]]></m:pre>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   168
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   169
		<p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   170
			The AWK transformation is used just as an illustration how we can combine various tools together.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   171
			However, limiting of the records can be done by the <code>--records '(//h:h1|//h:h2)[position() &lt;= 10]'</code> XPath expression in the <code>relpipe-in-xmltable</code> transformation.
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   172
		</p>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   173
		
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   174
	</text>
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   175
eccf2de78284 examples: Getting the outline of an XHTML page
František Kučera <franta-hg@frantovo.cz>
parents:
diff changeset
   176
</stránka>