diff -r 71a627e72815 -r aeda3cb4528d relpipe-data/examples-rdf-sparql.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/relpipe-data/examples-rdf-sparql.xml Mon Jul 27 17:51:53 2020 +0200 @@ -0,0 +1,602 @@ + + + Querying an RDF triplestore using SPARQL + use SQL-DK with Jena JDBC driver or a custom script to gather linked data + 04300 + + + +

+ In the Resource Description Framework (RDF) world, there are no relations. + The data model is quite different. + It is built on top of triples: subject – predicate – object. + Despite there are no tables (compared to relational databases), RDF is not a schema-less clutter – + actually RDF has a schema (ontology, vocabulary), just differently shaped. + Subjects and predicates are identified by IRIs + (or formerly URIs) + that are globally unique (compared to primary keys in relational databases that are almost never globally unique). + Objects are also identified by IRIs (and yes, one can be both subject and object) or they can be a primitive values like a text string or a number. +

+ + + node [fontname = "Latin Modern Sans, sans-serif"]; + edge [fontname = "Latin Modern Sans, sans-serif"]; + subject -> object [ label = "predicate"]; + + +

+ This triple is also called a statement. + In the following statement: +

+ +
+ tools are released under the GNU GPL license. +
+ +

we recognize:

+ + + +

+ This data model is seemingly simple: just a graph, two kinds of nodes and edges connecting them together. + Or a flat list of statements (triples). + But it can be also very complicated, depending on how we use it and how rich ontologies we design. + RDF can be studied for years and is a great topic for diploma thesis and dissertations, + but in this example, we will keep it as simple as possible. +

+ +

+ Collections of statements are stored in special databases called triplestores. + The data inside can be queried using the + SPARQL language through the endpoint provided by the triplestore. + Popular implementations are + Jena, + Virtuoso and + RDF4J + (all free software). +

+ +

+ Relational model can be easily mapped to RDF. + We can just simply add a prefix to the primary keys to make them globally unique IRIs. + The attributes will become predicates (also prefixed). + And the values will become objects (either primitive values or IRIs in case of foreign keys). + Of course, more complex transformation can be done – this is the most straightforward way. +

+ +

+ Mapping RDF data to relational model is bit more difficult. + Sometimes easy, sometimes very cumbersome. + We can always design some kind of EAV (entity – attribute – value) model in the relational database + or we can create a relation for each predicate… + If we do some universal automatic mapping and retain the flexibility of RDF and richness of the original ontology, + we usually lose the performance and simplicity of our relational queries. + Good mapping that will feel natural and idiomatic in the relational world and will perform well usually poses some hard work. +

+ +

+ But mapping mere results of a SPARQL query obtained from an RDF endpoint is a different story. + These results can be seen as records and processed using our relational tools, + stored, transformed or converted to other formats, displayed in GUI windows or safely passed to shell scripts. + This example shows how we can bridge the RDF and relational worlds. +

+ + +

Several ways of connecting to an RDF triplestore

+ +

+ Currently there is no official relpipe-in-rdf or relpipe-in-sparql tool. + It will be probably part of some future release of . + But until then, despite this lack, we still have several options how to join the RDF world + and let the data from an RDF triplestore flow through our relational pipelines: +

+ + + +

In this example, we will look at the first and the last option.

+ +

SQL-DK + Jena JDBC driver

+ + +

+ Apache Jena is not only a triplestore, + it is a framework consisting of several parts + and provides also a special JDBC driver that is ready to use + (despite this small bug). + Thanks to this driver, we can use existing Java tools and run SPARQL queries instead of SQL ones. +

+ +

+ Such a tool that uses this standard API (JDBC) + is SQL-DK. + This tool integrates well with because it can output results in the XML format (or alternatively the Recfile format) + that can be directly consumed by relpipe-in-xml (or alternatively relpipe-in-recfile). +

+ +

First we download Jena source codes:

+ + + +

+ and apply the patch + for abovementioned bug (if not already merged in the upstream). +

+ +

n.b. As always when doing such experiments, we would probably run this under a separate user account or in a virtual machine.

+ +

Then we will compile the JDBC driver:

+ + + +

+ Now we will install SQL-DK (either from sources or from .deb or .rpm package) + and run it for the first time (which creates the configuration directory and files): +

+ +
sql-dk --list-databases
+ +

Then we will register the previously compiled Jena JDBC driver in the ~/.sql-dk/environment.sh

+ + + +

And we should see it among other drivers:

+ +
+ +

The driver seems present so we can configure the connection in the ~/.sql-dk/config.xml file:

+ + + rdf-dbpedia + jdbc:jena:remote:query=http://dbpedia.org/sparql + + +]]> + +

+ This will connect us to the DBpedia endpoint (more datasources are mentioned in the chapter below). + We can test the connection: +

+ +
+ +

and run our first SPARQL query:

+ +
+ +

+ Not a big fun yet, but it proves that the connection is working and we are getting some results from the endpoint. + We will run some more interesting queries later. +

+ +

+ When we switch to the --formatter xml we can pipe the stream from SQL-DK + to relpipe-in-xml and then process it using relational tools. + We can also use the --sql-in option of SQL-DK which reads the query from STDIN (instead of from command line argument) + and then wrap it as a reusable script that reads SPARQL and outputs relational data: +

+ + sql-dk --db "rdf-dbpedia" --formatter "xml" --sql-in | relpipe-in-xml + +

+ For accessing remote SPARQL endpoint this is a bit overkill with lot of dependencies (so we will use different approach in the next chapter). + But Jena JDBC driver is not only for accessing remote endpoints – we can use it as an embedded database, + either an in-memory one or regular DB backed by persistent files. +

+ +

+ The in-memory database loads some initial data and then operates on them. + So we configure such connection: +

+ + + rdf-in-memory + jdbc:jena:mem:dataset=/tmp/rdf-initial-data.ttl + + +]]> + +

It runs fine, but turtles are not at home:

+ +
 /tmp/rdf-initial-data.ttl
+$ echo "SELECT * WHERE { ?subject ?predicate ?object . }" | sql-dk --db rdf-in-memory --formatter tabular-prefetching --sql-in 
+ ╭──────────────────────────────────────┬────────────────────────────────────────┬─────────────────────────────────────╮
+ │ subject (org.apache.jena.graph.Node) │ predicate (org.apache.jena.graph.Node) │ object (org.apache.jena.graph.Node) │
+ ├──────────────────────────────────────┼────────────────────────────────────────┼─────────────────────────────────────┤
+ ╰──────────────────────────────────────┴────────────────────────────────────────┴─────────────────────────────────────╯
+Record count: 0]]>
+ +

+ If we are in a desperate need of turtles and have installed any LV2 plugins, + we can find some and put them in our initial data file or reconfigure the database connection: +

+ +
 /tmp/rdf-initial-data.ttl
+$ sed s@/tmp/rdf-initial-data.ttl@/usr/lib/lv2/fil4.lv2/manifest.ttl@g -i ~/.sql-dk/config.xml]]>
+ +

and look through Jena/RDF/SPARQL what is inside:

+ +
+ +

+ Now we can be sure that LV2 uses the Turtle format for plugin configurations, + which is quite ingenious and inspirational – + such configuration is well structured and its options (predicates in general) have globally unique identifiers (IRIs). + Also plugins are identified by IRIs which is great, because it avoids name collisions. +

+ +

+ Let us make some own turtles. + Reconfigure the database connection back: +

+ +
sed s@/usr/lib/lv2/fil4.lv2/manifest.ttl@/tmp/rdf-initial-data.ttl@g -i ~/.sql-dk/config.xml
+ +

and fill the /tmp/rdf-initial-data.ttl with some new data:

+ + + + .]]> + +

+ Turtle is a simple format that contains statements. + Subjects, predicates and objects are separated by spaces (tabs and line-ends are here just to make it more readable for us). + And statements end with full stop like ordinary sentences. +

+ +

+ To avoid repeating common parts of IRIs we can declare namespace prefixes: +

+ + . +@prefix predicate: . +@prefix thing: . + +person:you + predicate:have + thing:nice-day .]]> + +

+ This format is very concise. + If we describe the same subject, we use semicolon to avoid repeating it. + And if even the predicate is the same (multiple values), we use comma: +

+ + . +@prefix predicate: . +@prefix thing: . + +person:you + predicate:have + thing:nice-day, thing:much-fun; + predicate:read-about + thing:relational-pipes .]]> + +

+ Jena will parse our file and respond to our basic query with these data: +

+ +
+ +

Or if we prefer more vertical formats like Recfile:

+ +
+ +

Let us create some more data:

+ + + +

list them as statements:

+ + + +

and run some more SPARQL queries…

+ +

+ Note: + + we use The tag: URI scheme for our IRIs. + It makes URIs (IRIs) globally unique not only in space but also in time (domain owners change during time). + Which is great. + In the semantic web and linked data world, it is not common and locators (URLs) are used rather than pure identifiers (URIs, IRIs). + But here we want to emphasise that we work strictly with our local data + and make it clear that we do not depend on any on-line resources and nothing will be downloaded from remote servers. + And in a real project, we should use existing ontologies / vocabularies as much as possible instead of inventing new ones. + But we keep this example rather isolated from the complexity of the outer world and bit synthetic. + +

+ +

Find all quotes and names of their authors:

+ + +

List groups and counts of their members:

+ + +

Filter by a regular expression and list actor names rather than characters:

+ + +

Now imagine semantic model of Twin Peaks… How very!

+ +

Improvised relpipe-in-sparql tool

+ +

+ Starting the JVM and creating always a new database from scratch on each query is quite… heavy. + We can keep Jena running in the background and connect to its SPARQL endpoint – or connect to any other endpoint on the internet. + So we will hack together a light script and name it relpipe-in-sparql (in some future release there will be such official tool). +

+ +

+ Because SPARQL endpoints accept plain HTTP requests, support besides XML also CSV and we already have relpipe-in-csv + the script can be very simple: +

+ + + +

+ It becomes bit longer if we add some documentation, argument parsing and configuration: +

+ + + + +

+ Here we have even two implementations that could be switched using the RELPIPE_IN_SPARQL_IMPLEMENTATION environmental variable. + The XML one is more powerful and can be customized (e.g. to specifically handle localized strings or add some new attributes to the relational output). + On the other hand, the CSV one has fewer dependencies and support streaming of long result sets (XSLT needs to load whole document first). +

+ +

Both implementation should work:

+ + + +

and produce the same output:

+ +
+ +

And maybe somewhere nearby in the graph we will find:

+ +
It's a Unix System… I know this!
+ +

Sources of RDF data

+ +

+ +

+ The bad news are that we are not querying the real world. + We are querying an imperfect, incomplete and outdated snapshot of the reality stored in someone's database. + The good news are that we can improve the content of certain databases like we improve articles in Wikipedia. +

+ +

+ Some addresses have already leaked in the relpipe-in-sparql --help above. + Here is brief description of some publicly available sources of RDF data + that we can play with. +

+ + +

Wikidata

+ +

+ A free and open knowledge base, a sister project of Wikipedia. + Anyone can use and even edit its content. +

+ + + + +

DBpedia

+ +

+ They extract structured content from the information created in various Wikimedia projects. + And publish this knowledge graph for everyone. +

+ + + +

Czech government

+

+ Ministries and other institutions publish some data as open data and part of them as linked open data (LOD). +

+ + + + + + +

Running SPARQL queries as scripts

+ +

Besides piping SPARQL queries through relpipe-in-sparql like this:

+ + +

we can make them executable and run like a (Bash, Perl, PHP etc.) script:

+ + +

(see the Implementation page for complete list of available transformations and output filters)

+ +

+ We need to add the first line comment that points to the interpreter. + The endpoint and relation parameters + are optional – we can say, where this query will be executed and how the output relation will be named: +

+ + + +

+ Environmental variables RELPIPE_IN_SPARQL_ENDPOINT and RELPIPE_IN_SPARQL_RELATION + can be set to override the parameters from the file. + All the magic is done by this (bit hackish) helper script: +

+ + + +

+ This script requires the relpipe-in-sparql we put together earlier. + Both scripts are just examples (not part of any release yet). +

+ + + +

Samples of SPARQL queries

+ +

+ Hey kid, rock and roll, + let us list the films where both Coreys starred: +

+ + + +

+ So Mercedes has scratched our Cadillac, but it was still a great night. +

+ +

Now it is time to visit our friends from the club:

+ + + +

+ Not only pretty in pink, this is true wisdom and we could have much fun traversing this part of the graph. + But let us turn the globe around… there is also a lot to see in the Eastern Bloc. +

+ + + +

+ + Dad, what is this place? + Where are we? + Is there anyone here?
+ No. Just us. +
+

+ + + +

P.S.

+

+ + If you got an impression that RDF is just a poor relational database with a single table consisting of mere three columns + and with freaky SQL dialect, please be assured that this example shows just a small fraction of the wonderful RDF world. + +

+ + +
+ +
\ No newline at end of file