Home > Uncategorized > Java Semantic Web & Linked Open Data webapps – Part 3.0

Java Semantic Web & Linked Open Data webapps – Part 3.0

November 26, 2010

Available tools and technologies

(this section is unfinished, but taking a while to put together, so – more to come)

When you first start trying to find out what the Semantic Web is in technical terms, and then what the Linked Open Data web is, you soon find that you have a lot of reading to do – because you have lots of questions. That is not surprising since this is a new field (even though Jena for instance has been going 10 years) for the average Java web developer who is used to RDBMS, SOA, MVC, HTML, XML and so on. On the face of it, RDF is just XML right ? A semantic repository is some kind of storage do-dah and there’s bound to be an API for it, right ? Should be an easy thing to pick up, right ? But you need answers to these kind of questions before you can start describing what you want to do as technical requirements, understanding what the various tools and technologies can do, which ones are suitable and appropriate, and then select some for your application.

One pathway is to dive in, get dirty and see what comes out the other side. But that to me is just a little unstructured and open-ended, so I wanted to tackle what seemed to be fairly real scenarios (see Part 2 of this series) – a 2-tier web app built around a SPARQL endpoint with links to other datasets and a more corporate style web application that used a semantic repository instead of an RDBMS, delivering a high level API and a semantic “console”.

In general then it seems you need to cover in your reading the following areas

  • Metadata – this is at the heart of the Semantic Web and Linked Open Data web. What is it !! Is it just stuff about Things ? Can I just have a table of metadata associated with my “subjects” ? Do I need a special kind of database ? Do I need structures of metadata – are there different types of things or “buckets” I need to describe things as ? How does this all relate to how I model my things in my application – is it different than Object Relational Modelling ? Is there a specific way that I should write my metadata ?
  • RDF, RDFS and OWL – what is it, why is it used,  how is it different than just XML or RSS;  what is a namespace, what can you model with it, what tools there are and so on
  • SPARQL – what is it, how to write it, what makes it different from SQL; what can it NOT do for you, are there different flavours, where does it fit in a web architecture compared to where a SQL engine might sit ?
  • Description Logic – you’ll come across this and wonder, or worse give up – it can seem very arcane very quickly – but do you need to know it all or any of it ? Whats a graph, a node, a blank node dammit, a triple, a statement ?
  • Ontologies – isn’t this just a taxonomy ? Or a thesaurus ? Why do I need one, how does metadata fit into it ? Should I use RDFS or OWL or something else ? Is it XML ?
  • Artificial Intelligence, Machine Learning, Linguistics – what !? you mean this is robotics and grammer ? where does it fit in – whats going on, do I need to have a degree in cybernetics to make use of the semantic web ? Am I really creating a knowledge base here and not a data repository ? Or is it an information repository ?
  • Linked Open Data – what does it actually mean – it seems simple enough ? Do I have to have a SPARQL endpoint or can I just embed some metadata in my documents and wait for them to be crawled. What do I want my application to be able to do in the context of Linked Open Data ? Do I need my own URIs ? How do I make or “coin” them ? How does that fit in with my ontology ? How do I host my data set so someone else can use it ? Surely there is best practice and examples for all this ?
  • Support and community – this seems very academic, and very big – does my problem fit into this ? Why can I not just use traditional technolgies that I know and love ?  Where are all the “users” and applications if this is so cool and useful and groundbreaking ? Who can help me get comfortable, is anyone doing any work in this field ? Am I doing the right thing ? Help !

I’m going to describe these things before listing the tools I came across and ended up selecting for my applications. So – this is going to be a long post, but you can scan and skip the things you know already. Hopefully, you can get started more quickly than I did.

First the End

So you read and your read and come across tools and libraries and academic reports and W3C documents and you see it has been going on some time, that some things are available and current, others are available and dormant. Most are OpenSource thankfully and you can get your hands on them easily, but where to start ? What to try first – what is the core issue or risk to take on first ? Is the enough to decide that you should continue ?

What is my manager going to say to me when I start yapping on about all these unfamiliar things –

  • why do I need it ?
  • what problem is it solving ?
  • how will it make or save us money  ?
  • our information is our information – why would I want to make it public ?

Those are tough questions when you’re starting from scratch, and no one else seems to be using the technologies you think are cool and useful – who is going to believe you if you talk about sea-change, or “Web3.0” or paradigm shift, or an internet for “machines”. I believe you need to demonstrate-by-doing, and to get to the bottom of these questions so you know the answers before someone asks them of you. And you better end up believing what your saying so you that you are convincing and confident. Its risky….*

So – off I go – here is what I found, in simple, probably technically incorrect terms – but you’ll get the idea and work out the details later (if you even need to)

*see my answers to these questions at the end of this section

Metadata, RDF/S, OWL, Ontologies

Coarsely, RDF allows you to write linked lists. URIs allow you to create unique identifiers for anything. If you use the same URI twice, your saying that exact Thing is the same in both places. You create the URIs yourself, or when you want to identify a thing (“john smith”) or a property of a thing (eg “loginId”) that already exists, you reuse the URI that you or someone else created. You may well have a URI for a concept or idea, and another for one of its physical form – eg a URI for a person in your organisation, and another for the webpage that shows his photo and telephone number, another for his HR system details.

Imagine 3 columns in a spreadsheet called Subject, Object and Predicate. Imagine a statement like “John Smith is a user with loginId ‘john123’ and he is in the sales Department“. This ends up like

Subject Predicate Object
S-ID1 type User
S-ID1 name “John”
S-ID1 familyName “Smith”
S-ID1 loginId “john123”
S-ID2 type department
S-ID2 name “sales”
S-ID2 member ID1

That is it, simply – RDF allows you to say that a Thing with an ID we call S-ID1 has properties, and that those properties are either other Things (S-ID2/member/ID1) or literal things like strings “john123”.

So you can build a “graph” or a connected list of Things (nodes) where each Thing can be connected to another Thing. And once you look at one of those Things, you might find that it has other properties that link to different Things that you don’t know about or that aren’t related to what you are looking at – S-ID2 may have another “triple” or “statement” that links it with ID-99 say (another user) or ID-10039 (a car lot space, say). So you can wire up these graphs to represent whatever you want in terms of properties and values (Objects). A Subject, Property or Object can be a reference to another Thing.

Metadata are those properties you use to describe Things. And in the case of RDF each metadatum can be a Thing with its own properties (follow the property to its own definition), or a concrete fact – eg a string, a number. Why is metadata important – because it helps you contextualise and find things and to differentiate one thing from another even if they are called the same name. Some say “Content is King” but I say Metadata is !.

RDF has some predefined properties like “property” and “type”. Its pretty simple and you’ll pick it up easily [1]. Now RDFS extends RDF to add some more predefined properties that allow you to create a “schema” that describes your data or information – “class”, “domain”, “range”, “label”, “comment”. So if you start to formalise the relationships described above – a user has a name, familyName, loginID and so on – before you know it, you’ve got an ontology on your hands. That was easy, right ? No cyborgs, logic bombs, T-Box or A-Box in sight.(see the next section) And you can see the difference between an ontology and a taxonomy – the latter is a way of classifying or categorising things, but an ontology does that and also describes and relates them. So keep going, this isn’t hard ! (Hindsight is great too)

Next you might look at OWL because you need more expressiveness and control in your information model and you find out that it has different flavours – DL, LITE, FULL[2] What do you do now ? Well, happily, you don’t have to think about it too much, because it turns out that you can mix and match things in your ontology – use RDFS and OWL, and you can even use things from other ontologies. Mash it up – you don’t have to define these properties from scratch yourself. So go ahead and do it, and if you find that you end up in OWL-FULL instead of DL then you can investigate and see why. The point is, start, dig in and do what you need to do. You can revise and evolve at this stage.

A metadata specification called “Dublin Core”[3] comes up a lot – this is a useful vocabulary for describing things like “title”, “creator”, “relation”, “publisher”. Another, the XSD schema is useful for defining things like number types -integer, long and float – and is used as part of SPARQL for describing literals. You’ll also find that there are properties of things that you thought are so common that someone would have an ontology or a property defined for them already. I had a time looking for a definition of old English miles, but it turns out luckily that there was one[4,5]. On the other hand, there wasn’t one for a compass bearing of  “North” – or at least one that I could find, so I invented one, because it seemed important to me. Not all things in your dataset will need metadata – and in fact you might find that you, and someone working on another project have completely different views on whats important in a dataset – you might be interested in describing financial matters, and someone else might be more interested in the location information. If you think about it long enought a question might come to mind – should we still maintain our data somewhere in canonical, raw or system-of-record form, and have multiple views of what it is stored elsewhere ? (I dont have an answer for that one yet).

Once your start you soon see that the point of reusing properties from other ontologies is that you are creating connections between datasets and information just by using them – you may have a finance department that uses “creator” that you can now link records in the HR system with the same person – and because the value used  for the “creator” is in fact a unique URI (simply, an ID that looks like an URL) eg http://myCompany.com/people/john123. If you have another John in the company, he’ll have a different ID eg http://myCompany.com/people/john911, so you can be sure that the link is correct and precise – no ambiguity –  John123 will not get the payslip meant for John911. There are also other ways of connecting information – you could use owl:sameAs for instance – this makes a connection between two Things when a common vocabulary or ID is not available, or when you want to make a connection where one didn’t exist before. But think about these connections before you commit them to statements – the correctness, provenance and trust around that new connection has to be justifiable – you want your information and assertions about it to have integrity, right ?

I needed RDF and RDFS at least – this would be the means that I would express the definition and parameters of my concepts, and then also the statements to represent actual embodiments of those concepts – instances. It started that way, but I knew I might need OWL if I wanted to more controlled over the structure and integrity of my information – eg to say that John123 could only be a member of one department and one department only, that he had role of “salesman” but couldn’t also have a role of “paymaster”. So, if you need this kind of thing, read more about it [6,7]. If you don’t yet, just keep going, and you can still come back to it later.(turns out I did in fact)

The table above now looks like this when you use URIs – its the same information, just written down in a way that ensures things are unique, and connectable.

Namespaces
myCo:http://myCompany.com/people/
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
foaf:http://xmlns.com/foaf/0.1/
Subject Predicate Object
myCo:S-ID1 rdf:type myCo:User
myCo:S-ID1 rdfs:label “John”
myCo:S-ID1 foaf:family_name “Smith”
myCo:S-ID1 myCo:loginId “john123”
myCo:S-ID2 rdf:type department
myCo:S-ID2 rdfs:label “sales”
myCo:S-ID2 myCo:member myCo:S-ID1

The Namespaces at the top of the table mean that you can use shorthand in the three columns and don’t have to repeat the longer part of the URI each time. Makes things easier to read and take in too, especially if you’re a simple human. For predicates, I’ve changed name to rdfs:label and familyName to foaf:family_name[8]. In the Object column only the myCo namespace is used – in the first case it points to a Subject with a type defined elsewhere (in the ontology in fact). I say the ontology is defined elsewhere, but that doesnt haev to be physically elsewhere, its not uncommon to have a file on disk that contains the RDF to define the ontology but also contains the instances that make up the vocabulary or the information base.

So – why is this better than a database schema ? The simple broad answers are the best ones I think :

  • You have only 3 columns*
  • You can put anything you like in each column (almost – literals cant be predicates (?) ), and its possible to describe a binary property User->name->John as well as n-ary relationships[9] User->hasVehicle->car->withTransmission->automatic
  • You can define what you know about the things in those columns and use it to create a world view of things (a set of “schema rules”, an ontology).
  • You can (and should) use common properties – define a property called “address1” and use it in Finance and HR so you know you’re talking about the same property. But if you don’t, you can fix it later with some form of equivalence statement..
  • If there are properties on instances that aren’t in your ontology, they don’t break anything, but they might give you a surprise – this is called an “open world assumption” – that is to say just because it is not defined does not mean it cannot exist – this is a key difference from database schema modelling.
  • You use the same language to define different ontologies, rather than say MySQL DDL for one dataset and Oracle DDL for another
  • There is one language spec for querying any repository – SPARQL **. You use the same for yours and any others you can find – and over Http – no firewall dodging, no operations team objections, predictable, quick and easy to access
  • You don not have to keep creating new table designs for new information types
  • You can easily add information types that were not there before while preserving older data or facts
  • You can augment existing data with new information that allows you to refine it or expand it – eg provide aliases that allow you to get around OCR errors in extracted text, alternative language expressions
  • Any others ?

*Implementations may add one or two more, or break things up into partitioned tables for contextual or performance reasons
**there are different extensions in different implementations

[1]http://rdfabout.net/
[2]http://www.w3.org/TR/owl-features/
[3]http://dublincore.org/
[4]http://forge.morfeo-project.org/wiki_en/index.php/Units_of_measurement_ontology
[5]http://purl.oclc.org/NET/muo/ucum-instances.owl
[6]http://www.cs.man.ac.uk/~horrocks/ISWC2003/Tutorial/
[7]http://www.w3.org/TR/owl-ref/#sameAs-def
[8]http://xmlns.com/foaf/spec/
[9] http://www.w3.org/TR/swbp-n-aryRelations/

SPARQL, Description Logic (DL), Ontologies

SPARQL [10] aims to allow those familiar with querying relational data to query graph data without too much introduction. Its not too distant but needs a little getting used to. “Select * from users” looks like “select * from {?s rdf:type myCo:User}”, and then you get back 2 types of information rather than every column from a table. Of course this is because you have effectively 3 “columns” in the graph data and theyre populated with a bunch of different things. So you need to dig deeper[11] into tutorials and what others have written.[12,13]

One of the key things about SPARQL is that you can use it to find out what is in the graph data without having any idea before hand.[14] You can ask to find the types of data available, then ask for the properties of the types, then DESCRIBE or select a range of types for identified subjects. So, its possible to discover whats available to suit your needs, or for anyone else to do the same with your data.

Another useful thing is the ability (for some SPARQL engines – Jena’s ARQ [15] comes to mind) to federate queries either by using a “graph” (effectively just a named set of triples)  that is an URI to a remote dataset, or by using (in Jena’s) case, the SERVICE keyword. So you can have separate and independent datasets and query across them easily. Sesame[16] allows a similar kind of thing with Federated Sail but you predefine the federation you want, rather than specify it in-situ. Beware of runtime network calls in the Jena case, and consider hosting your independent data in a single store but under different graphs to avoid them. You’ll need more memory in one instance, but you should get better performance. And watch out for JVM memory limits and type size increases if you (probably) move to a 64bit JVM.[17,18]

While learning the syntax of SPARQL isn’t a huge matter, understanding that youre dealing with a graph of data and having to navigate or understand that graph before hand can be a challenge, especially if its not your data you want to federate or link with. Having ontologies and sample data (from your initial SPARQL queries) helps a lot, but it can be like trying to understand several foreign database schemas at once, visualising a chain rather than a hierarchy, taking on multiple-inheritance and perhaps cardinality rules, domain and range restrictions and maybe other advanced ontology capabilities.

SPARQL engines or libraries used by SPARQL engines that allow inferencing provide a unique selling point for the Semantic and Linked web of data. Operations you cannot easily do in SQL are possible. Derived statements with information that is not actually “asserted” in the physical data you may have loaded into your repository start to appear. You might for instance ask for all Subjects or things of a certain type. If the ontology of the information set says that one type is a subclass of another – say you ask for “cars” – then you’ll get back statements that say your results are cars, but you’ll also get statements saying they are also “vehicles”. If you did this with an information set that you were not familiar with, say a natural history data set, then when you ask for “kangaroos” you are also told that its an animal, a kangaroo, and a marsupial. The animal statement might be easy to understand, but perhaps you expected that it was a mammal. And you might not have expressly said that a Kangaroo was one or the other.

Once you get back results from a SPARQL query you can start to explore – you start looking for kangaroos, then you follow the marsupial link, and you end up with Opossum, then you see its in the USA and not Australia, and you compare the climates of the two continents. Alternatively of course, you may have started at the top end – asked for marsupials, and you get back all the kangaroos and koalas etc, then you drill down into living environment and so on. Another scenario deals with disambiguation – you ask for statements about eagles and the system might return you things named eagles, but you’ll be able to see that one is a band, one is a US football team, and the other a bird of prey. Then you might follow links up or down the classifications in the ontology.

Some engines have features or utilities that allow you to “forward-chain”[19] statements before loading – this can mean that using an ontology or a reasoning engine based on a language specification that derived statements about things are asserted and materialised for you before you load them into your repository. This is not only things to do with class hierarchy but also where a hierarchy isnt explicit, inference might create a statement – “if a Thing has a title, pages, book, and has a hardback coverthen it is ….a book”.  This saves the effort at runtime and should mean that you get a faster response to your query. Forward chaining (and backward-chaining[20]) are common reasoning methods used with inferrence rules in Artificial Intelligence and Logic systems.

It turns out, Description Logic or “DL” [21] is what we are concerned with here – a formal way of expressing or representing knowledge – things have properties that are a certain value. OWL is a DL representation for instance. And like Object oriented prorgammic languages – Java say – there are classes (ontology, T-Box statements) and instances (A-Box, instances, vocabularies). There are also notable differences from Java (eg multiple inheritance or typing),  and a higher level of formalism, and these can make mapping between your programming language and your ontology or modelling difficult or problematic. For some languages, ProLog or Lisp this mapping may not be such a problem, and indeed you’ll fnd many semantic tools and technologies built using them.

Despite the fact that DL and AI  can get quite heady once you start delving into these things, it is easy to start with the understanding that they allow you to describe or model your information expressively and formally without being bound to an implementation detail like the programning language you’ll use, and that once you do implement and make use of your formal knowledge representation – your ontology – that hidden information and relationships may well become clear where they may not have been before. Doing this with a network of information sets means that the scope of discovery and fact is broadened – for your business, this may well be the difference between a sale or not, or provide a competitive edge in a crowded market.

[10]http://www.w3.org/TR/rdf-sparql-query/
[11]http://www.w3.org/2009/sparql/wiki/Main_Page
[12] http://www.ibm.com/developerworks/xml/library/j-sparql/
[13]http://en.wikibooks.org/wiki/XQuery/SPARQL_Tutorial
[14]http://dallemang.typepad.com/my_weblog/2008/08/rdf-as-self-describing-data.html
[15]http://openjena.org/ARQ/
[16]http://wiki.aduna-software.org/confluence/display/SESDOC/Federation
[17]http://jroller.com/mert/entry/java_heapconfig_32_bit_vs
[18]http://portal.acm.org/citation.cfm?id=1107407
[19]http://en.wikipedia.org/wiki/Forward_chaining
[20]http://en.wikipedia.org/wiki/Backward_chaining
[21]http://en.wikipedia.org/wiki/Description_logic

Artificial intelligence, machine learning, linguistics

When you come across Description Logic and the Semantic Web in the context of identifying “things” or entities in documents – for example the name of a company or person, a pronoun or a verb – you’ll soon be taken back to memories of school – grammer, clauses, definitive articles and so on. And you’ll grow to love it Im sure, just like you used to 🙂
It’s a necessary evil, and its at the heart of a one side of the semantic web – information extraction(“IE”) as a part of information retrieval (“IR”)[22,23]). Here, we’re interested in the content of documents, tables, databases, pages, excel spreadsheets, pdfs, audio and video files, maps, etc etc. And because these “documents” are written largely for human consumption, in order to get at the content using “a stupid machine”, we have to be able to tell the stupid machine what to do and what to look for – it does not “know” about language characteristices – what the difference is between a noun and a verb – let alone how to recognise one in a stream of characters, with variations in position, capitalisation, context and so on. And what if you then want to say that a particular noun, used a particular way is a word about “politics” or “sport”; that its Englih rather than German; that it refers to another word two words previous, and that its qualified by an adjective immediately after it ? This is where Natural Language Processing (NLP) comes in.

You may be familiar with tokenising a string in a high level programming language, then writing a loop to look at each word and then do something with it. NLP will do this kind of thing but apply more sophisticated abstractions, actions and processing to the tokens it finds, even having a rule base or dictionary of tokens to look for, or allowing a user to dynamically define what that dictionary or gazeteer is. Automating this is where Machine Learning (ML) comes in. Combined, and making use of mathematical modelling and statistical analysis they look at sequences of words and then make a “best guess” at what each word is, and tell you how good that guess is.

You may need (probably) to “train” the machine learning algorithm or system with sample documents – manually identify and position the tokens you are interested in, tag them with categories (perhaps these categories themselves are from a structured vocabulary you have created or found, or bought) and then run the “trained” extractor over your corpus of documents. With luck, or actually, with a lot of training (maybe 20%-30% of the corpus size), you’ll get some output that says “rugby” is a “sports” term and “All Blacks” is a “rugby team”. Now you have a your robot, your artificial intelligence.

But the game is not up yet – for the Semantic and Linked web, you now you have to do something with that output – organise and transform into RDF – a related set of extracted entities – relate one entity to another into a statement “all blacks”-“type”-“rugby team”, and then collect your statements into a set of facts that mean something to you, or the user for whom you are creating your application. This may be defined or contextualised by some structure in your source document, but it may not be – you may have to provide and organising structure. At some point you need to define a start – a Subject you are going to describe, and one of the Subjects you come up will be the very beginning or root Thing of your new information base. You may also consider using an online service like OpenCalais[24], but you’re then limited to the range of entities and concepts that those services know about – in OpenCalais’ case its largely business and news topics – wide ranging for sure, but if you want to extract information about rugby teams and matches it may not be too successful. (There are others available and more becoming available). In my experience, most often and for now, you’ll have to start from scratch, or as near as damn-it. If you’re lucky there may be a set or list of terms for the concept you are interested in, but its a bit like writing software applications for business – no two are the same, even if they have the same pattern. Unlike software applications though, this will change over time – assuming that people will publish their ontologies, taxonomies, term sets, gazeteers and thesauri. Lets hope they do, but get ready to pay for them as well – they’re valuable stuff.

So

  1. Design and Define your concepts
    1. Define what you are interested in
    2. Define what things represent what you are interested in
    3. Define how those things are expressed – the terms, relations, ranges and so on – you may need to build up a gazeteer or thesaurus
    4. Understand how and where those things are used – the context, frequency, position
  2. Extract the concepts and metadata
    1. Now tell the “machine” about it, in fact, teach it what you know and what you are interested in – show it by example, or create a set or rules and relations that it understands
    2. Teach it some more – the more you tell it, the more variety, the more examples and repitition you can throw it, the better the quality of results you’ll get
    3. Get your output – do you need to organise the output, do you have multiple files and locations where things are stored, do  you need to feed the results from the first pass into your next one ?
    4. Fashion some RDF
    5. Create URIs for your output – perhaps the entities extracted are tagged with categories (that you provided to the trained system) or with your vocabulary, or perhaps not – but now you need to get from this output to URIs, Subjects, Properties, Objects – to match your ontology or your concept domain. Relate and collect them into “graphs” of information, into RDF.
    6. Stage them somewhere – on a filesystem say (one file, thousands ? versions, dates ? tests and trials, final runs; spaces, capitalisation, reserved characters, encoding – its the web afterall)
  3. Make it accessible
    1. Find a repository technology you like – if you dont know, if its your first time, pick one – suck it and see – if you have RDF on disk you might be able touse that directly (maybe slower than an online optimised repository). Initialise it, get familiar with it, consider size and performance implications. Do you need backup ?
    2. Load your RDF into the repository. (Or perhaps you want to modify some existing html docs you have with the metadata you’ve extracted – RDFa probably)
    3. Test what you’ve loaded matches what you had on disk – you need to be able to query it – how do you do that ? Is there a commandline tool – does it do SPARQL ? What about it you want to use it on the web, this is whole point isnt it ?Is there a sparql endpoint  – do you need to set up Tomcat or a Jetty say to talk to your repository ?
  4. Link it
    1. And what about those URIs – you have URIs for your concept instances (“All Blacks”), and URIs for their properties (“rdf:type”), and URIs for the Object of those properties (“myOnt:Team”), What happens now – what do you do with them ? If there for the web, if theyre URIs shouldnt I be able to click on them ? (Now were talking Linked Data – see next section).
    2. Link your RDF with other datasets (See next section) if you want to be found, to participate, and to add value by association, affiliation,connection – the network effect – the knowledge and the value (make some money, save some money)
  5. Build your application
    1. Now create your application around your information set. You used to have data, now you have information – your application turns that into knowledge and intelligence, and perhaps profit.

There are a few tools to help you in all this (see below) but you’ll  find that they dont do everything you need, and they  wont generate RDF for you without some help – so roll your sleeves up. Or – don’t – I decided against it, having looked at the amount of work involved in learning all about NLP & ML, in the arcane science (its new to me), in the amount of time needed to set up training and the quality of the output. I decided on the KISS principle – “Keep It Simple, Stupid”, so instead I opted to write something myself, based on grep !

I still had to do 1-5 above, but now I had to write my own code to do the extraction and “RDFication”. It also meant I got my hands dirty and learned hard lessons by doing rather than reading or trusting someone else’s code that I didnt understand. And the quality of the output and the meaning of it was all in my control still.  It is not real Machine Learning, it’s still in the tokenisation world I suppose, but I got what I wanted and in the process made something I can use again. It also gave me practical and valuable experience so that I can revisit the experts tools with a better perspective – not so daunting, more  comfortable and confident, something to compare to, patterns to witness and create, less to learn and take on, and, importantly, a much better chance of actual, deliverable success.

It was quite a decision to take – it felt dirty somehow – all that knowledge and science bound up in those tools, it was a shame not to use it – but I wanted to learn and to fail in some ways, I didn’t want to spend weeks training a “machine”, and it seemed better to fail with something I understood (grep) rather than take on a body of science that was alien. In the end – I succeeded – I extracted my terms with my custom-automated-grep-based-extractor and I created RDF and loaded it into a repository. Its not pretty, but it worked – I have gained lots of experience, and I know where to go next. I recommend it.

Finally, it’s worth noting here the value-add components

  • ontologies – domain expertise written down
  • vocabularies – these embody statements of knowledge
  • knowledge gathering – collecting a disparate set of facts, or describing and assembling a novel perspective
  • assurance, provenance, trust – certifying and guaranteeing levels of correctness and origin
  • links – connections, relationships, ranges, boundaries, domains, associations – the scaffolding of the brains !
  • the application – a means to access, present and use that knowledge to make decisions and choices

How many business opportunities are there here ?

[22] http://en.wikipedia.org/wiki/Information_extraction

[23] http://en.wikipedia.org/wiki/Information_retrieval

[24] http://www.opencalais.com/

Linked Open Data

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply”* embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

* This blog doesn’t contain RDFa because its just too hard to do – wordpress.com doesn’t have available pluginst, and the wordpress.org plugins may be limited for what you want to do. Drupal7 [50] does a better job, and Joomla [51] may get there in the end.

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data Comment
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/
[50] http://semantic-drupal.com/
[51] http://semanticweb.com/drupal-may-be-the-first-mainstream-semantic-web-winner_b568

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply” embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/

Advertisements
  1. Elvis
    December 2, 2010 at 9:49 am

    Hi Ultan,

    I really like your way to learn things by asking a lot of questions first, then answer them by exploring the materials outside. I’ve been exploring the internet to study the semantic web related technologies for several weeks and found that it is really a promising area to delve into. I’m attracted by the ability of it to provide more intelligent solutions to many business opportunities. By now, I came across many java frameworks/technologies, such as Jena, Sesame, and rdf stores like Jena TDB, SDB, Virtuoso, YARS2,BigOWLIM, etc. and found that even their commercial versions’ query performance is much slower than the RDBMS. Besides, I didn’t find any matured architecture to apply to the semantic web application. Do you have any suggestions?

    • December 2, 2010 at 12:59 pm

      I’m going to get to the architecture I have decided on for my two applications in the next while, but at the risk of short circuiting my own story – it’s really just plain old web architecture in the end and you can go with whatever suits you best in your own experience.

      That said, public SPARQL endpoints, or public queryable http based data services are somewhat novel for many programmers and using an API for a web service may be a more familiar style of integration. For a linked data app, where you want to mash up from “foreign” SPARQL endpoints as well as your own, and also traditional web services, you’ll probably need a web style MVC 3-tier architecture:

      1)- View/UI —> 2)-control logic —>3) data access, including calls to sparql endpoints

      1. View – render your UI,
      2. Control or handle requests from your UI and then into some code to make the calls to
      3. your sparql endpoints, webservices, databases.
      3) then hands back to 2) which filters and formats data before handing back to your UI code in 1).

      2) will be “thin” most likely.

      3) is the interesting part : here you either decide to talk SPARQL from your code, or to use a library that more suits the way you traditionally program. (Now, I’m a java head, so this conversation is geared around it, but the principles would be the same pretty much for any language)

      * For the latter, Jena[1] and Sesame[2] for instance have programming abstractions and shortcuts to help you code in an RDF/Sparql world – this avoids the need to get down and dirty with Sparql calls (much like avoiding having to hard code SQL into your traditional app). Some of the libs have further abstractions so that you can avoid the RDF world altogether and stick with the Object world. I have used JenaBean[3] and Empire[4]. JenaBean is Jena specific, Empire is built as a JPA implementation, so its “standardised” and works with lots of backends (but there are still specifics it needs to deal with). In order to keep you in the Object world they need to use reflection a lot, and with the help of the libraries needed to talk to the specific repositories they will try and minimise network calls and do lazy loading for collections and so on, just like a traditional ORM like hibernate. I started with Jenabean and moved to Empire. I prefer Empire and have even contributed some code to it. But there’s no avoiding the fact that you have an extra layer with translation and reflection logic
      * For the former, – perhaps you want to try it out, or perhaps you’re working in Ruby say and need to talk to a Semantic Repo that runs on Java/Tomcat – then you can code your app like a web-service integration application – and treat the SPARQL endpoints as just http services, and unmarshal xml . In theory, you can write the SPARQL once (perhaps externalise it and model objects around it, styled like iBatis[6] maybe) and then play with different kinds of endpoints – you might want to try OpenVirtuoso one day, and OWLIM the next. In the java world, Jena and Sesame can help you here also as they have code to wrap Sparql (xml marshalling, connection handling etc). But you may find from one lib or implementation to another that there are differences between how the SPARQL engine is implemented, and if you want to do things like count() the number of triples from a query, then you need one that implements that as an extension (its not available in SPARQL 1.0). So, YMMV.

      Another pattern may be to use database translation layers [7] – you have a traditional database and schema, but use a technology to make that appear as RDF to your application. You could then write an analytics/intelligence application around that, while your main application continues to run against the SQL RDBMS. I haven’t tried that because I wanted to dive into the semantic layer and didn’t want to have to knobble its performance with a translation middleware (a gut feeling, no more), and then add an ORdM style (“Object RDF Model”) layer as well – your needs may be different though.

      An alternative is to go with just 2 layers, like I have with the LTDI [8] application. This is Javascript/Ajax talking to SPARQL, with lots of hardcoded queries. Its arguably easier to do than a 3 tier architecture, but on the downside you’re in the Javascript world (ugh), and you are dealing with direct client-to-server communications where cross domain calls will be an issue for a linked data app. JSONP and CORS help here, but you need a server to implement some code for both to get them to work easily with you. If you can, and you are happy with hardcoded SPARQL and don’t have any business logic to apply, or you can factor your javascript appropriately, this may the way to go. PHP and the like may help you here, but you’re moving back into the 3 tier world there, which you may not want to do. If you like PHP, and you want to get going quickly as well, why not take a look at Structured Dynamics frameworks [9] ? Or how about Drupal7 and its Sparql integration which Lin at DERI has built ? [10, 11, 12] As your know, there’s a lot to choose from…

      Also, if it is within your capabilities, consider going “native”, and avoiding having to translate from RDF to some other lingo – I’ve not tried it, but ProLog and LISP, built for Language processing and AI may get you closer to the data or the repository and avoid some of the “translation” required. More to learn, but you might enjoy it ! The Europeana Digital Archive [13] project for instance makes use of a ProLog based Semantic server (CleoPatria) [14] – but I’m not sure if they have used ProLog in the UI. Anyone got other ProLog war stories ?

      In respect of performance, in absolute terms, prepare to be disappointed [15,16] SPARQL engines are slower than SQL engines in general. Thats an absolute testbed measurement and the speeds in question may not be an issue for you. (For me it was, even with small amounts of data and a local, but untuned, semantic repository).

      Semantic repositories may be quicker than RDBMS for many-to-many relationships at large amounts of data, but for a business type app with say 10m records, you may not come across this gain. On the other hand, depending on what your queries are, where you have your data and which engine you use, it may not be an issue. And ultimately, for now, it may boil down to need and advantage : if you are building a decision support or intelligence based tool, then the cost of performance over the availability of new intelligence from an inferring semantic linked application may be well worth it – sell this to your manager ! And for a linked app, just being able to across repositories maybe enough to crown your glory !

      A “third way” to deal with performance issues, is to split your app – you may be able to write data to both an RDBMS and a semantic repo, then read back from the RDBMS for most of your app, but use the Semantic repo for the analytics and intelligence functionality, where the payoff exceeds the performance degradation. It also gives you a means to migrate your application, and to try things out. Using a standards based technology stack, like Empire JPA, pays dividends in this case.

      In general you need to work around these performance factors :

      * you’ll most likely have to partition your data into Graphs so the Sparql engine doesnt have to deal with everything,
      * you need to think about sizing your data when you have lots of it, and how you might replicate, partition or shard [25] the data – hopefully this wont come up !
      * as well as data size, some sparql engines sit outside and are independent of the storage engine, so getting them to work with a distributed or partitioned storage engine may be a question – BigData[20] or NoSQL[26,27] technologies may have answers ?
      * not all sparql engines are equal – aggregation, extensibility, federation, inference are differentiating factors
      * inferrence can be very slow – some inferrence libs are better than others. Jena does some, more than Sesame, and Pellet [17] is a popular choice.
      * you’ll need lots of RAM, and as fast a CPU as possible. If you split your datasets then you need to be able to federate queries (at the engine level) or create linkage across them in your code
      * if you have a database backed repository, there will be a lot of network chat going on
      * if you have a dataabse backed repository you’ll most likely need an OSIV filter [5].
      * do you need textsearch – different engines have different capabilites, but perhaps you might even need a separate Search capability – how about getting RDF into SOLR[21] so that you can have faceted search, or can you do facets with the native capabilities ?? Is it possible to get RDF into SOLR [22] ? Can Siren[23] or Sindice[24] help you ?
      * what about concurrency – can you do multiple writes simultaneously on a semantic repository ? (Jena and Sesame are multiple-reader-single-writer, and with RDBMS backed repos you have to watch out for deadlocking on long running queries. I wonder if Allegro [18] and OpenVirtuoso [19], even at the commercial version level are any better ? Anyone ?)

      As my series of articles progresses I expect that because of the pace these things are changing, the number of players and the evolving patterns to work with them, that some of my thoughts will be obsolete by the time I finish ! Also, I may also have not done much to clear up your question about architecting an app on the semantic/linked web, but for me its a bit like being back in 1996 – Java wasn’t around too long, things like JDBC hadn’t really materialised, experiments with RMI were good fun but painful in a firewalled business network, spinning penguins made people go “ooh” and “aah”, and getting acceptance as a corporate technology hadn’t happened. Broadband, what was that ? So. the Semantic web is a bit like that: its young and exciting; it has a lot of promise, and if it turns into a world where data access across corporate boundaries is as accessible as html access is now (but perhaps, for the machines, the poor poor stupid machines (!), more structured) then you need to know how to deal with it, take advantage of it, and maybe even make some money out of it ! And if it doesn’t, you’ve not done anything wrong or lost anything, because something very much like it will still be needed to deal with the vast swathes of information out there (still like your Google results these days?), and the same techniques and science will still be needed. The implementation may be different, but it won’t be unfamiliar or alien to you anymore. So go on – get down to it – make lots and lots of mistakes, ask lots and lots of stupid questions, learn quickly and get ready; demonstrate the benefits and the advantages; condense that learning into your elevator pitch and then go evangelise and bore people with your enthusiasm. Submission is just one tiny triple away !

      [1] http://www.openjena.org/
      [2] http://www.openrdf.org/
      [3] http://code.google.com/p/jenabean/
      [4] http://groups.google.com/group/empire-rdf/
      [5] http://stackoverflow.com/questions/3445989/osiv-pattern-pros-and-cons-general-question-about-osiv-and-views
      [6] http://www.mybatis.org/
      [7] http://www.w3.org/2005/Incubator/rdb2rdf/
      [8] http://uoccou.endofinternet.net:8080/resources/sparql.html
      [9] http://www.structureddynamics.com/products.html
      [10] http://drupal.org/project/sparql_views
      [11] http://cph2010.drupal.org/sessions/using-rdf-semantic-web-modules
      [12] http://semantic-drupal.com/
      [13] http://www.europeana.eu/portal/
      [14] http://www.swi-prolog.org/web/ClioPatria.html
      [15] http://www4.wiwiss.fu-berlin.de/benchmarks-200801/
      [16] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#results
      [17] http://clarkparsia.com/pellet/
      [18] http://www.franz.com/downloads/clp/ag_survey
      [19] http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/
      [20] http://sourceforge.net/apps/mediawiki/bigdata/index.php?title=Main_Page
      [21] http://people.apache.org/~hossman/apachecon2006us/faceted-searching-with-solr.pdf
      [22] http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/
      [23] http://siren.sindice.com/
      [24] http://www.sindice.com/developers/api
      [25] http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
      [26] http://nosql-database.org/
      [27] http://www.drdobbs.com/database/224900500

      • Elvis
        December 3, 2010 at 4:02 am

        Hi Ultan,

        Thanks for your reply. Again your broad knowledge is really impressed me.
        The reason for me to come to the semantic web world is because of this website http://www.wanderfly.com , it’s a travel recommendation site. I like the idea very much. So I went around it and tried to find out the algorithm/idea to do this amazing thing. With the idea in my mind, I explored the search engine, lucene/solr and also did some research on recommendation engine/machine-learning library Mahout. Finally, I came across the semantic web technologies and found that they are what I need. I started to delve into those technologies, read several books, many articles and thesis in the internet. And found out it’s performance especially the RDF store is not so mature like you said it’s like the java of 1996. A little bit disappointed. 🙂
        My idea is to build up a intelligent tourism and activity recommendation system which is to answer “Where can I go?” like wanderfly do, but will go further to suit my Country’s situation. I’ve got many tourism/travel related ontologies from the web(by searching in swoogle) and will do some modification to the one I’m going to use. I’m working on extracting data from the web by using web-havest now. The choosing of the RDF Store stopped me here. May be I can just use one first, like jena TDB to setup my demo, if my idea is proved to be a lucrative business idea, then consider the commercial one. I really like your “third way” to deal with performance issues, which is to split my app and to write data to both an RDBMS and a semantic repo. As of the architecture I may choose the MVC 3-tier. For the UI, wicket and flex is what’s on my mind now. Controller, I may use scardf – a scala DSL wrapper for Jena. I found it really simple to deal with RDF operation. Scala can reduce a lot of code line for me and it can also coexist with java in a project. Any recommendation to my project? Thanks.

      • December 3, 2010 at 11:18 am

        It boils down to the usual things – define, design, identify risks, implement, test, fix, document. Get familiar with sparql and rdf first – do some basic things with it, you’ll have some surprises. Spend some time at the extraction task : do you really need NLP (you need to have a corpus and a large training set, and some time), or could you just use grep for now ? Pick a storage solution – all will load RDF/XML or RDF/TTL and such, so you can always try another later. Stay in the OO world if you can – you may need to write a layer to abstract your from the RDF with the assistance of Jena or Sesame. Performance test as you go along.

        Whats your application going to do – do you need security or access control – how will you tackle that ? Can you map your business objects to RDF in your mind – whats your object model ?

        How will you make recommendations – do you have a vocabulary you can use, or will it be generated from arbitrary user tags – if so, how will you deal with mis-spellings, ambiguity, aliases ? What if you want people to find “historical monuments” when they have tagged things with “medieval castle” ? Do you need to summarise the sentiments of what people write – “it was a good visitor attraction” vs “attraction, not bad; cafe expensive” in order to correlate them – there is a lot of work in this area, a lot of science, and its not an easy problem to solve, so you need to be clear which bits you are going to try to do and what you want to achieve.

        Performance : dont let it stop you, it will only get better – your starting out, and so is the technology (in relative terms). Maybe you’ll discover something to help everyone else too !

  1. December 7, 2010 at 1:24 pm
Comments are closed.
%d bloggers like this: