relpipe/relpipe-web: relpipe-data/examples-rdf-sparql.xml@0a65e49a076f


<stránka
	xmlns="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/strana"
	xmlns:m="https://trac.frantovo.cz/xml-web-generator/wiki/xmlns/makro">
	
	<nadpis>Querying an RDF triplestore using SPARQL</nadpis>
	<perex>use SQL-DK with Jena JDBC driver or a custom script to gather linked data</perex>
	<m:pořadí-příkladu>04300</m:pořadí-příkladu>

	<text xmlns="http://www.w3.org/1999/xhtml">
		
		<p>
			In the Resource Description Framework (<a href="https://www.w3.org/RDF/">RDF</a>) world, there are no relations.
			The data model is quite different.
			It is built on top of triples: subject – predicate – object.
			Despite there are no tables (compared to relational databases), RDF is not a schema-less clutter – 
			actually RDF has a schema (ontology, vocabulary), just differently shaped.
			Subjects and predicates are identified by <a href="https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier">IRI</a>s
			(or formerly <a href="https://en.wikipedia.org/wiki/Uniform_Resource_Identifier">URI</a>s)
			that are globally unique (compared to primary keys in relational databases that are almost never globally unique).
			Objects are also identified by IRIs (and yes, one can be both subject and object) or they can be a primitive values like a text string or a number.
		</p>
		
		<m:diagram orientace="vodorovně">
			node [fontname = "Latin Modern Sans, sans-serif"];
			edge [fontname = "Latin Modern Sans, sans-serif"];
			subject	->	object [ label = "predicate"];
		</m:diagram>
		
		<p>
			This <em>triple</em> is also called a <em>statement</em>.
			In the following statement:
		</p>
		
		<blockquote>
			<m:name/> tools are released under the GNU GPL license.
		</blockquote>
		
		<p>we recognize:</p>
		
		<ul>
			<li>
				Subject: <i>					
					<m:name/> tools</i>
			</li>
			<li>Predicate: <i>is released under license</i></li>
			<li>Object: <i>GNU GPL</i></li>
		</ul>
		
		<p>
			This data model is seemingly simple: just a graph, two kinds of nodes and edges connecting them together.
			Or a flat list of statements (triples).
			But it can be also very complicated, depending on how we use it and how rich ontologies we design.
			RDF can be studied for years and is a great topic for diploma thesis and dissertations,
			but in this example, we will keep it as simple as possible.
		</p>
		
		<p>
			Collections of statements are stored in special databases called triplestores.
			The data inside can be queried using the 
			<a href="https://www.w3.org/TR/sparql11-overview/">SPARQL</a> language through the endpoint provided by the triplestore.
			Popular implementations are 
			<a href="https://jena.apache.org/">Jena</a>,
			<a href="http://vos.openlinksw.com/owiki/wiki/VOS">Virtuoso</a> and
			<a href="https://rdf4j.org/about/">RDF4J</a>
			(all free software).
		</p>
		
		<p>
			Relational model can be easily mapped to RDF.
			We can just simply add a prefix to the primary keys to make them globally unique IRIs.
			The attributes will become predicates (also prefixed).
			And the values will become objects (either primitive values or IRIs in case of foreign keys).
			Of course, more complex transformation can be done – this is the most straightforward way.
		</p>
		
		<p>
			Mapping RDF data to relational model is bit more difficult.
			Sometimes easy, sometimes very cumbersome.
			We can always design some kind of EAV (entity – attribute – value) model in the relational database
			or we can create a relation for each predicate…
			If we do some universal automatic mapping and retain the flexibility of RDF and richness of the original ontology,
			we usually lose the performance and simplicity of our relational queries.
			Good mapping that will feel natural and idiomatic in the relational world and will perform well usually poses some hard work.
		</p>
		
		<p>
			But mapping mere results of a SPARQL query obtained from an RDF endpoint is a different story.
			These results can be seen as records and processed using our relational tools,
			stored, transformed or converted to other formats, displayed in GUI windows or safely passed to shell scripts.
			This example shows how we can bridge the RDF and relational worlds.
		</p>
		
		
		<h2>Several ways of connecting to an RDF triplestore</h2>
		
		<p>
			Currently there is no official <code>relpipe-in-rdf</code> or <code>relpipe-in-sparql</code> tool.
			It will be probably part of some future release of <m:name/>.
			But until then, despite this lack, we still have several options how to join the RDF world
			and let the data from an RDF triplestore flow through our relational pipelines:
		</p>
		
		<ul>
			<li>SQL-DK + Jena JDBC driver + <code>relpipe-in-xml</code></li>
			<li>ODBC-JDBC bridge + Jena JDBC driver + <code>relpipe-in-sql</code></li>
			<li>A native SPARQL ODBC driver + <code>relpipe-in-sql</code></li>
			<li>A shell script + <code>relpipe-in-csv</code> or <code>relpipe-in-xml</code></li>
		</ul>
		
		<p>In this example, we will look at the first and the last option.</p>
		
		<h2>SQL-DK + Jena JDBC driver</h2>
		
		
		<p>
			Apache Jena is not only a triplestore,
			it is a framework consisting of several parts
			and provides also a special JDBC driver that is ready to use
			(despite this <a href="https://issues.apache.org/jira/browse/JENA-1939">small bug</a>).
			Thanks to this driver, we can use existing Java tools and run SPARQL queries instead of SQL ones.
		</p>
		
		<p>
			Such a tool that uses this standard API (JDBC)
			is <a href="https://sql-dk.globalcode.info/">SQL-DK</a>.
			This tool integrates well with <m:name/> because it can output results in the XML format (or alternatively the Recfile format)
			that can be directly consumed by <code>relpipe-in-xml</code> (or alternatively <code>relpipe-in-recfile</code>).
		</p>
		
		<p>First we download Jena source codes:</p>
		
		<m:pre jazyk="bash"><![CDATA[mkdir -p ~/src; cd ~/src
git clone https://gitbox.apache.org/repos/asf/jena.git]]></m:pre>
		
		<p>
			and apply the <a href="https://git-zaloha.frantovo.cz/gitbox.apache.org/repos/asf/jena.git/commit/?h=JENA-1939_updateCount&amp;id=bdb5439d22b80b2909258449d82fb7b5003fd64c">patch</a>
			for abovementioned bug (if not already merged in the upstream).
		</p>
		
		<p>n.b. As always when doing such experiments, we would probably run this under a separate user account or in a virtual machine.</p>
		
		<p>Then we will compile the JDBC driver:</p>
		
		<m:pre jazyk="bash"><![CDATA[cd ~/src/jena/jena-jdbc/
mvn clean install]]></m:pre>

		<p>
			Now we will install SQL-DK (either from sources or from <code>.deb</code> or <code>.rpm</code> package)
			and run it for the first time (which creates the configuration directory and files):
		</p>
		
		<pre>sql-dk --list-databases</pre>
		
		<p>Then we will register the previously compiled Jena JDBC driver in the <code>~/.sql-dk/environment.sh</code></p>
		
		<m:pre jazyk="bash"><![CDATA[CUSTOM_JDBC=(
	~/src/jena/jena-jdbc/jena-jdbc-driver-bundle/target/jena-jdbc-driver-bundle-*.jar
);]]></m:pre>

		<p>And we should see it among other drivers:</p>
		
		<pre><![CDATA[$ sql-dk --list-jdbc-drivers 
 ╭──────────────────────────────────────────────────┬───────────────────┬─────────────────┬─────────────────┬──────────────────────────╮
 │ class                                  (VARCHAR) │ version (VARCHAR) │ major (INTEGER) │ minor (INTEGER) │ jdbc_compliant (BOOLEAN) │
 ├──────────────────────────────────────────────────┼───────────────────┼─────────────────┼─────────────────┼──────────────────────────┤
 │ org.postgresql.Driver                            │ 9.4               │               9 │               4 │                    false │
 │ com.mysql.jdbc.Driver                            │ 5.1               │               5 │               1 │                    false │
 │ org.sqlite.JDBC                                  │ 3.25              │               3 │              25 │                    false │
 │ org.apache.jena.jdbc.mem.MemDriver               │ 1.0               │               1 │               0 │                    false │
 │ org.apache.jena.jdbc.remote.RemoteEndpointDriver │ 1.0               │               1 │               0 │                    false │
 │ org.apache.jena.jdbc.tdb.TDBDriver               │ 1.0               │               1 │               0 │                    false │
 ╰──────────────────────────────────────────────────┴───────────────────┴─────────────────┴─────────────────┴──────────────────────────╯
Record count: 6]]></pre>

		<p>The driver seems present so we can configure the connection in the <code>~/.sql-dk/config.xml</code> file:</p>
		
		<m:pre jazyk="xml"><![CDATA[<database>
	<name>rdf-dbpedia</name>
	<url>jdbc:jena:remote:query=http://dbpedia.org/sparql</url>
	<userName></userName>
	<password></password>
</database>]]></m:pre>

		<p>
			This will connect us to the DBpedia endpoint (more datasources are mentioned in the chapter below).
			We can test the connection:
		</p>

		<pre><![CDATA[$ sql-dk --test-connection rdf-dbpedia 
 ╭─────────────────────────┬──────────────────────┬─────────────────────┬────────────────────────┬───────────────────────────╮
 │ database_name (VARCHAR) │ configured (BOOLEAN) │ connected (BOOLEAN) │ product_name (VARCHAR) │ product_version (VARCHAR) │
 ├─────────────────────────┼──────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────┤
 │ rdf-dbpedia             │                 true │                true │                        │                           │
 ╰─────────────────────────┴──────────────────────┴─────────────────────┴────────────────────────┴───────────────────────────╯
Record count: 1]]></pre>

		<p>and run our first SPARQL query:</p>

		<pre><![CDATA[$ sql-dk --db rdf-dbpedia --formatter tabular-prefetching --sql "SELECT * WHERE { ?subject ?predicate ?object . } LIMIT 8"
 ╭──────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────┬─────────────────────────────────────────────────────────╮
 │ subject                                         (org.apache.jena.graph.Node) │ predicate          (org.apache.jena.graph.Node) │ object                     (org.apache.jena.graph.Node) │
 ├──────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤
 │ http://www.openlinksw.com/virtrdf-data-formats#default-iid                   │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#default-iid-nullable          │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#default-iid-nonblank          │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#default-iid-nonblank-nullable │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#default                       │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#default-nullable              │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#sql-varchar                   │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 │ http://www.openlinksw.com/virtrdf-data-formats#sql-varchar-nullable          │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://www.openlinksw.com/schemas/virtrdf#QuadMapFormat │
 ╰──────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────┴─────────────────────────────────────────────────────────╯
Record count: 8]]></pre>

		<p>
			Not a big fun yet, but it proves that the connection is working and we are getting some results from the endpoint.
			We will run some more interesting queries later.
		</p>

		<p>
			When we switch to the <code>--formatter xml</code> we can pipe the stream from SQL-DK
			to <code>relpipe-in-xml</code> and then process it using relational tools.
			We can also use the <code>--sql-in</code> option of SQL-DK which reads the query from STDIN (instead of from command line argument)
			and then wrap it as a reusable script that reads SPARQL and outputs relational data:
		</p>
		
		<m:pre jazyk="bash">sql-dk --db "rdf-dbpedia" --formatter "xml" --sql-in | relpipe-in-xml</m:pre>
		
		<p>
			For accessing remote SPARQL endpoint this is a bit overkill with lot of dependencies (so we will use different approach in the next chapter).
			But Jena JDBC driver is not only for accessing remote endpoints – we can use it as an embedded database,
			either an in-memory one or regular DB backed by persistent files.
		</p>
		
		<p>
			The in-memory database loads some initial data and then operates on them.
			So we configure such connection:
		</p>
		
		<m:pre jazyk="xml"><![CDATA[<database>
	<name>rdf-in-memory</name>
	<url>jdbc:jena:mem:dataset=/tmp/rdf-initial-data.ttl</url>
	<userName></userName>
	<password></password>
</database>]]></m:pre>

		<p>It runs fine, but <a href="https://en.wikipedia.org/wiki/Turtle_(syntax)">turtles</a> are not at home:</p>
		
		<pre><![CDATA[$ echo > /tmp/rdf-initial-data.ttl
$ echo "SELECT * WHERE { ?subject ?predicate ?object . }" | sql-dk --db rdf-in-memory --formatter tabular-prefetching --sql-in 
 ╭──────────────────────────────────────┬────────────────────────────────────────┬─────────────────────────────────────╮
 │ subject (org.apache.jena.graph.Node) │ predicate (org.apache.jena.graph.Node) │ object (org.apache.jena.graph.Node) │
 ├──────────────────────────────────────┼────────────────────────────────────────┼─────────────────────────────────────┤
 ╰──────────────────────────────────────┴────────────────────────────────────────┴─────────────────────────────────────╯
Record count: 0]]></pre>

		<p>
			If we are in a desperate need of turtles and have installed any <a href="https://lv2plug.in/">LV2</a> plugins,
			we can find some and put them in our initial data file or reconfigure the database connection:
		</p>
		
		<pre><![CDATA[$ find /usr/lib -name '*.ttl' | head
/usr/lib/lv2/fil4.lv2/manifest.ttl
/usr/lib/lv2/fil4.lv2/fil4.ttl
/usr/lib/ardour5/LV2/a-fluidsynth.lv2/manifest.ttl
/usr/lib/ardour5/LV2/a-fluidsynth.lv2/a-fluidsynth.ttl
/usr/lib/ardour5/LV2/reasonablesynth.lv2/manifest.ttl
/usr/lib/ardour5/LV2/reasonablesynth.lv2/reasonablesynth.ttl
/usr/lib/ardour5/LV2/a-delay.lv2/manifest.ttl
/usr/lib/ardour5/LV2/a-delay.lv2/presets.ttl
/usr/lib/ardour5/LV2/a-delay.lv2/a-delay.ttl
/usr/lib/ardour5/LV2/a-eq.lv2/manifest.ttl

$ cat /usr/lib/lv2/fil4.lv2/manifest.ttl > /tmp/rdf-initial-data.ttl
$ sed s@/tmp/rdf-initial-data.ttl@/usr/lib/lv2/fil4.lv2/manifest.ttl@g -i ~/.sql-dk/config.xml]]></pre>

		<p>and look through Jena/RDF/SPARQL what is inside:</p>
			
		<pre><![CDATA[$ echo "SELECT * WHERE { ?subject ?predicate ?object . }" | sql-dk --db rdf-in-memory --formatter xml --sql-in | relpipe-in-xml | relpipe-out-tabular 
r1:
 ╭───────────────────────────────────────┬─────────────────────────────────────────────────┬───────────────────────────────────────────╮
 │ subject                      (string) │ predicate                              (string) │ object                           (string) │
 ├───────────────────────────────────────┼─────────────────────────────────────────────────┼───────────────────────────────────────────┤
 │ http://gareus.org/oss/lv2/fil4#ui_gl  │ http://www.w3.org/2000/01/rdf-schema#seeAlso    │ file:///usr/lib/lv2/fil4.lv2/fil4.ttl     │
 │ http://gareus.org/oss/lv2/fil4#ui_gl  │ http://lv2plug.in/ns/extensions/ui#binary       │ file:///usr/lib/lv2/fil4.lv2/fil4UI_gl.so │
 │ http://gareus.org/oss/lv2/fil4#ui_gl  │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://lv2plug.in/ns/extensions/ui#X11UI  │
 │ http://gareus.org/oss/lv2/fil4#mono   │ http://www.w3.org/2000/01/rdf-schema#seeAlso    │ file:///usr/lib/lv2/fil4.lv2/fil4.ttl     │
 │ http://gareus.org/oss/lv2/fil4#mono   │ http://lv2plug.in/ns/lv2core#binary             │ file:///usr/lib/lv2/fil4.lv2/fil4.so      │
 │ http://gareus.org/oss/lv2/fil4#mono   │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://lv2plug.in/ns/lv2core#Plugin       │
 │ http://gareus.org/oss/lv2/fil4#stereo │ http://www.w3.org/2000/01/rdf-schema#seeAlso    │ file:///usr/lib/lv2/fil4.lv2/fil4.ttl     │
 │ http://gareus.org/oss/lv2/fil4#stereo │ http://lv2plug.in/ns/lv2core#binary             │ file:///usr/lib/lv2/fil4.lv2/fil4.so      │
 │ http://gareus.org/oss/lv2/fil4#stereo │ http://www.w3.org/1999/02/22-rdf-syntax-ns#type │ http://lv2plug.in/ns/lv2core#Plugin       │
 ╰───────────────────────────────────────┴─────────────────────────────────────────────────┴───────────────────────────────────────────╯
Record count: 9]]></pre>

		<p>
			Now we can be sure that LV2 uses the Turtle format for plugin configurations,
			which is quite ingenious and inspirational – 
			such configuration is well structured and its options (predicates in general) have globally unique identifiers (IRIs).
			Also plugins are identified by IRIs which is great, because it avoids name collisions.
		</p>
		
		<p id="turtle">
			Let us make some own turtles.
			Reconfigure the database connection back:
		</p>

		<pre>sed s@/usr/lib/lv2/fil4.lv2/manifest.ttl@/tmp/rdf-initial-data.ttl@g -i ~/.sql-dk/config.xml</pre>
		
		<p>and fill the <code>/tmp/rdf-initial-data.ttl</code> with some new data:</p>
		
		<m:pre jazyk="turtle"><![CDATA[<http://example.org/person/you>
	<http://example.org/predicate/have>
	<http://example.org/thing/nice-day> .]]></m:pre>
	
		<p>
			Turtle is a simple format that contains statements.
			Subjects, predicates and objects are separated by spaces (tabs and line-ends are here just to make it more readable for us).
			And statements end with <i>full stop</i> like ordinary sentences.
		</p>
		
		<p>
			To avoid repeating common parts of IRIs we can declare namespace prefixes:
		</p>
		
		<m:pre jazyk="turtle"><![CDATA[@prefix person:     <http://example.org/person/> .
@prefix predicate:  <http://example.org/predicate/> .
@prefix thing:      <http://example.org/thing/> .

person:you
	predicate:have
		thing:nice-day .]]></m:pre>
		
		<p>
			This format is very concise.
			If we describe the same subject, we use <i>semicolon</i> to avoid repeating it.
			And if even the predicate is the same (multiple values), we use <i>comma</i>:
		</p>
		
		<m:pre jazyk="turtle"><![CDATA[@prefix person:     <http://example.org/person/> .
@prefix predicate:  <http://example.org/predicate/> .
@prefix thing:      <http://example.org/thing/> .

person:you
	predicate:have
		thing:nice-day, thing:much-fun;
	predicate:read-about
		thing:relational-pipes .]]></m:pre>

		<p>
			Jena will parse our file and respond to our basic query with these data:
		</p>

		<pre><![CDATA[$ echo "SELECT * WHERE { ?subject ?predicate ?object . }" | sql-dk --db rdf-in-memory --formatter xml --sql-in --relation rdf_results | relpipe-in-xml | relpipe-out-tabular 
rdf_results:
 ╭───────────────────────────────┬─────────────────────────────────────────┬───────────────────────────────────────────╮
 │ subject              (string) │ predicate                      (string) │ object                           (string) │
 ├───────────────────────────────┼─────────────────────────────────────────┼───────────────────────────────────────────┤
 │ http://example.org/person/you │ http://example.org/predicate/read-about │ http://example.org/thing/relational-pipes │
 │ http://example.org/person/you │ http://example.org/predicate/have       │ http://example.org/thing/much-fun         │
 │ http://example.org/person/you │ http://example.org/predicate/have       │ http://example.org/thing/nice-day         │
 ╰───────────────────────────────┴─────────────────────────────────────────┴───────────────────────────────────────────╯
Record count: 3]]></pre>

		<p>Or if we prefer more vertical formats like Recfile:</p>
		
		<pre><![CDATA[$ echo "SELECT * WHERE { ?subject ?predicate ?object . }" | sql-dk --db rdf-in-memory --formatter xml --sql-in --relation rdf_results | relpipe-in-xml | relpipe-out-recfile 
%rec: rdf_results

subject: http://example.org/person/you
predicate: http://example.org/predicate/read-about
object: http://example.org/thing/relational-pipes

subject: http://example.org/person/you
predicate: http://example.org/predicate/have
object: http://example.org/thing/much-fun

subject: http://example.org/person/you
predicate: http://example.org/predicate/have
object: http://example.org/thing/nice-day]]></pre>

		<p>Let us create some more data:</p>
		
		<m:pre jazyk="turtle" src="examples/rdf-heathers.ttl"/>
		
		<p>list them as statements:</p>
		
		<m:pre jazyk="text" src="examples/rdf-heathers.txt"/>
		
		<p>and run some more SPARQL queries…</p>
		
		<p>
			Note:
			<em>
				we use <a href="https://tools.ietf.org/html/rfc4151">The tag: URI scheme</a> for our IRIs.
				It makes URIs (IRIs) globally unique not only in space but also in time (domain owners change during time).
				Which is great.
				In the semantic web and linked data world, it is not common and locators (URLs) are used rather than pure identifiers (URIs, IRIs).
				But here we want to emphasise that we work strictly with our local data
				and make it clear that we do not depend on any on-line resources and nothing will be downloaded from remote servers.
				And in a real project, we should use existing ontologies / vocabularies as much as possible instead of inventing new ones.
				But we keep this example rather isolated from the complexity of the outer world and bit synthetic.
			</em>
		</p>
		
		<p>Find all quotes and names of their authors:</p>
		<m:sparql-example name="examples/rdf-heathers-quotes"/>
		
		<p>List groups and counts of their members:</p>
		<m:sparql-example name="examples/rdf-heathers-members"/>
		
		<p>Filter by a regular expression and list actor names rather than characters:</p>
		<m:sparql-example name="examples/rdf-heathers-much"/>
		
		<p>Now imagine semantic model of Twin Peaks… How very!</p>
		
		<h2>Improvised relpipe-in-sparql tool</h2>
		
		<p>
			Starting the JVM and creating always a new database from scratch on each query is quite… <i>heavy</i>.
			We can keep Jena running in the background and connect to its SPARQL endpoint – or connect to any other endpoint on the internet.
			So we will hack together a light script and name it <code>relpipe-in-sparql</code> (in some future release there will be such official tool).
		</p>
		
		<p>
			Because SPARQL endpoints accept plain HTTP requests, support besides XML also CSV and we already have <code>relpipe-in-csv</code>
			the script can be very simple:
		</p>
		
		<m:pre jazyk="bash"><![CDATA[curl \
	--header "Accept: text/csv" \
	--data-urlencode query="SELECT * WHERE { ?subject ?predicate ?object . } LIMIT 3" \
	https://dbpedia.org/sparql | relpipe-in-csv | relpipe-out-tabular]]></m:pre>
	
		<p>
			It becomes bit longer if we add some documentation, argument parsing and configuration:
		</p>

		
		<m:pre jazyk="bash" src="examples/relpipe-in-sparql.sh" odkaz="ano"/>
		
		<p>
			Here we have even two implementations that could be switched using the <code>RELPIPE_IN_SPARQL_IMPLEMENTATION</code> environmental variable.
			The XML one is more powerful and can be customized (e.g. to specifically handle localized strings or add some new attributes to the relational output).
			On the other hand, the CSV one has fewer dependencies and support streaming of long result sets (XSLT needs to load whole document first).
		</p>
		
		<p>Both implementation should work:</p>
		
		<m:pre jazyk="bash"><![CDATA[export RELPIPE_IN_SPARQL_IMPLEMENTATION=xml
export RELPIPE_IN_SPARQL_IMPLEMENTATION=csv
echo 'SELECT * WHERE { ?subject ?predicate "Laura Dern"@en . } LIMIT 3' \
	| relpipe-in-sparql \
		--relation "jurassic" \
		--endpoint "https://dbpedia.org/sparql" \
	| relpipe-out-tabular]]></m:pre>

		<p>and produce the same output:</p>

		<pre><![CDATA[jurassic:
 ╭────────────────────────────────────────┬────────────────────────────────────────────╮
 │ subject                       (string) │ predicate                         (string) │
 ├────────────────────────────────────────┼────────────────────────────────────────────┤
 │ http://dbpedia.org/resource/Laura_Dern │ http://www.w3.org/2000/01/rdf-schema#label │
 │ http://www.wikidata.org/entity/Q220901 │ http://www.w3.org/2000/01/rdf-schema#label │
 │ http://dbpedia.org/resource/Laura_Dern │ http://xmlns.com/foaf/0.1/name             │
 ╰────────────────────────────────────────┴────────────────────────────────────────────╯
Record count: 3]]></pre>

		<p>And maybe somewhere nearby in the graph we will find:</p>

		<blockquote>It's a Unix System… I know this!</blockquote>
		
		<h2>Sources of RDF data</h2>
		
		<p></p>
		
		<p>
			The bad news are that we are not querying the real world. 
			We are querying an imperfect, incomplete and outdated snapshot of the reality stored in someone's database.
			The good news are that we can improve the content of certain databases like we improve articles in Wikipedia.
		</p>
		
		<p>
			Some addresses have already <i>leaked</i> in the <code>relpipe-in-sparql --help</code> above.
			Here is brief description of some publicly available sources of RDF data
			that we can play with.
		</p>
		
		
		<h3>Wikidata</h3>
		
		<p>
			A free and open knowledge base, a sister project of Wikipedia.
			Anyone can use and even edit its content.
		</p>

		<m:sparql-endpoint url="https://query.wikidata.org/sparql" website-url="https://www.wikidata.org/" website-title="Wikidata"/>
		
		
		<h3>DBpedia</h3>
		
		<p>
			They extract structured content from the information created in various Wikimedia projects.
			And publish this knowledge graph for everyone.
		</p>
		
		<m:sparql-endpoint url="https://dbpedia.org/sparql" website-url="https://wiki.dbpedia.org/" website-title="DBpedia"/>
		
		<h3>Czech government</h3>
		<p>
			Ministries and other institutions publish some data as open data and part of them as linked open data (LOD).
		</p>
		
		<m:sparql-endpoint url="https://data.gov.cz/sparql" website-url="https://data.gov.cz/english/" website-title="Open data portal of the Czech Republic"/>
		<m:sparql-endpoint url="https://data.cssz.cz/sparql" website-url="https://data.cssz.cz/" website-title="Open data portal of the Czech Social Security Administration"/>
		<m:sparql-endpoint url="https://cedropendata.mfcr.cz/c3lod/cedr/sparql" website-url="https://cedropendata.mfcr.cz/" website-title="Open Data CEDR III"/>
		
		
		<h2>Running SPARQL queries as scripts</h2>
		
		<p>Besides piping SPARQL queries through <code>relpipe-in-sparql</code> like this:</p>
		<m:pre jazyk="bash"><![CDATA[cat query.sparql | relpipe-in-sparql | relpipe-out-tabular]]></m:pre>
		
		<p>we can make them executable and run like a (Bash, Perl, PHP etc.) script:</p>
		<m:pre jazyk="bash"><![CDATA[chmod +x query.sparql
./query.sparql | relpipe-out-csv     # output in the CSV format
./query.sparql | relpipe-out-recfile # output in the Recfile format
./query.sparql                       # automatically appends relpipe-out-tabular to the pipeline
]]></m:pre>
		
		<p>(see the <m:a href="implementation">Implementation</m:a> page for complete list of available transformations and output filters)</p>

		<p>
			We need to add the first line comment that points to the interpreter.
			The <code>endpoint</code> and <code>relation</code> parameters
			are optional – we can say, where this query will be executed and how the output relation will be named:
		</p>
		
		<m:pre jazyk="sparql" src="examples/rdf-sample-triples.sparql" odkaz="ano"/>
		
		<p>
			Environmental variables <code>RELPIPE_IN_SPARQL_ENDPOINT</code> and <code>RELPIPE_IN_SPARQL_RELATION</code>
			can be set to override the parameters from the file.
			All the magic is done by this (bit hackish) helper script:
		</p>
		
		<m:pre jazyk="bash" src="examples/rdf-sparql-interpreter.sh" odkaz="ano"/>

		<p>
			This script requires the <code>relpipe-in-sparql</code> we put together earlier.
			Both scripts are just examples (not part of any release yet).
		</p>
		
		
		
		<h2>Samples of SPARQL queries</h2>
		
		<p>
			<i>Hey kid, rock and roll,</i>
			let us list the films where both Coreys starred:
		</p>
		
		<m:sparql-example name="examples/rdf-coreys"/>
		
		<p>
			<i>So Mercedes has scratched our Cadillac, but it was still a great night. </i>
		</p>
		
		<p>Now it is time to visit our friends from the club:</p>
		
		<m:sparql-example name="examples/rdf-breakfast-club"/>
		
		<p>
			Not only <i>pretty in pink</i>, this is true <i>wisdom</i> and we could have much fun traversing this part of the graph.
			But let us turn the globe around… there is also a lot to see in the Eastern Bloc.
		</p>
		
		<m:sparql-example name="examples/rdf-blonde-and-brunette"/>
		
		<p>
			<i>
				Dad, what is this place?
				Where are we?
				Is there anyone here?<br/>
				No. Just us.
			</i>
		</p>
		
		<m:sparql-example name="examples/rdf-return"/>
		
		<h2>P.S.</h2>
		<p>
			<i>
				If you got an impression that RDF is just a poor relational database with a single table consisting of mere three columns
				and with freaky SQL dialect, please be assured that this example shows just a small fraction of the wonderful RDF world.
			</i>
		</p>
		
		
	</text>

</stránka>
author	František Kučera <franta-hg@frantovo.cz>
	Tue, 04 Aug 2020 00:58:46 +0200
branch	v_0
changeset 312	0a65e49a076f
parent 310	aeda3cb4528d
permissions	-rw-r--r--