Archive

Posts Tagged ‘owl’

Whats the point : Semantic RESTful Web Services ?

March 9, 2011 4 comments

Well, I think its dawning on me, that what Roy Fielding talks about (rather abstractly) [1] is what Henry Story neatly summarises and provides examples of [2] – REST SOA, with connected semantics. I’m one of those who can be accused of implementing REST not in the Roy Fielding manner of the word, but in the anyting thats not WS* “meaning”. I’ve done request mapping, content negotiating, resource rendering in XML, Json (and a bunch of others), GET,PUT,POST,HEAD etc etc etc, but never all together, and never in the true Spirit of Roy. But when you add the semantic web, you can really see that theres something good going on here – “easy” and ubiquitous webservices.

Roy talks about representations, resources and connectedness, about agents or service consumers that deal with well-known media types and links, and nothing else – REST implies that a user agent is “thin”, understands basic-and-well-known types and protocols, and renders a look and feel and a behaviour that reacts to what it is fed. As he says it should work with the “follow your nose” principal (no need for WADL[3,4]).

For a browser this would mean that you point it an URL, it displays content suitably, that it receives and displays links with appropriate CRUD capabilities for it and and relations it is given.  For example, given a book resource, render it using the .book CSS class, and create links to add to shopping cart, get a contents list, add to a favourites list. For a chapter in the book, there may be link to print it, to relate it to a chapter in another similar book, to annotate it and send to a colleague. For a daemon or agent it might mean that it alters the time at which it performs an action against a resource, or what action it takes. The navigation and action controls aren’t determined by business or display logic, but by the resource and its relations – the agent consuming the resource knows it has to display or follow a link, the CSS may have display capabilities based on the resource type or context, the workflow steps will appear at the right time for the right user, under the right circumstances. Client logic is solely to deal with converting representations to appropriate media-types, and driving application state – using relations and verbs to make transitions with links.

But the thing that got me spinning, as I tried to understand the abstractedness, and as I looked into JAX-RS [5], and its various implementations (well, Spring* in particular TBH,which doesnt do JAX-RS in fact [6]) was that the connectedness and follow-your-nose principal seemed absent. Its all very well and cosy (and arguably easy) to create some platform code that maps URIs to classes and methods and HTTP verbs, and then to output XML or JSON or not (think JSP), or perhaps even Atom, OData, RDF, N3 or TTL but wheres the linked connectedness – the things we talk about and take for granted in Linked Open Semantic Data world ? And how does it know what links to create, how to generate them, and how they should be presented (if there’s a human involved) ?

Well, Henry blows that lightbulb for me when he illustrates from his foaf profile all the foaf:knows relations [2]. In a RESTful world where a service returns a foaf file and reads the foaf:knows elements it can decide what to do based on that predicate – it can deduce that the resource represented is a Person and can create the links it chooses using what it knows about foaf:knows and REST verbs – create/read/update/delete. It might allow addition of another foaf:knows with a PUT to the URI identifying the owner, an update to a mailing list so that all those foaf:knows objects are added, or automatically update a trust counter against a system resource because if Henry foaf:knows TBL in this context, then TBL must be “good for it” :-). In addition, it only knows that a URI represents that Person, and the URI could be a hypermedia link in the form of an URL, a ftp or webDav link, or some other protocol. Finally, this “knows” concept is really an upfront agreement about what representations are being used for the state of the application (it knows and XML schema, or an Ontology, or perhaps even looks them up on the fly), but navigating thru state is controlled by the interactions with the service (Http verbs) and the responses (status and agreed represenation in the body content) received.

At first sight those RESTful libraries don’t really need to know that much about the connectedness – they only need to map verbs and serve resources with those links embedded (RDF anyone? ) and using those well-known vocabularies, classes,relations and constraints – ie ontologies. But what about workflow : I post an object or resource, I get a response with the ID of that resource, and I need a link that tells me where to go for the next state transition ?

So, lo-and-behold, we have semantic linked data and REST superadditively combined, in a loosely coupled web (or “cloud”, if you like that keyword) of semantic links, intelligent user agents that understand those links or their context, web resolvable URIs, and value-added interlinked services – in effect a “Web Service Bus”. [7] !!

Now

  1. Point your People tool at the RESTful people+location web service and it “just works” to give you a social-network-mashup of connected people and interests (provenance, trust), and then
  2. switch over to your Energy consumption application and it also just works (based on what it has chosen to do and the well-known ontologies and resources it understands) – see how big your carbon footprint is when you meet TBL next week at Geneva if you fly,drive or take the train – and maybe you’ll be able to see who you can meet on the way and who else will be sitting beside you.

But your not out of the woods yet, doing semantic RESTfu web apps isnt a clear open space : your application still has to deal with authentication, input validation, long lived database transaction control, multithreading, performance, perhaps object relational mapping, but jax-rs/REST takes care of the object-message-mapping (the interface-to-implementation layer), your client or agent is thin but intelligent, and your middle tier contains your business logic.

Your application will need to honour the request-response state machine, perhaps checking availability using OPTIONS, or Etags.

You’ll need to decide how to transform from your programming model of choice – OO perhaps – to Resource. Some of the object to RDF mapping within libraries like Empire[15], JenaBean[16], Sommer [22]{defunct?), object-triple [17] may help. Perhaps this wont be an issue for you if you can foist the RESTful resource and linkage proposition onto an object model and remain in the object world – why waste processor and resource when you store data in RDF, convert to an Object on retrieval, process, convert to Xml-RDF or JSON on the way out, then parse and walk in a JSP before rendering as HTML ? As an OO programmer on the web you’re familiar with marshalling objects in and out of different serialization –  RDF/XML/JSON/HTML, but you do want and need to minimise those transitions. Perhapsfor “Big Data” we should stay in the Resource world : persist to a fast native RDF triplestore or HPC based system on a cluster of MapReduce or somesuch (CouchDB[20], Heart/HBase [21] ? perhaps BigData[18] or SHARD[19], AllegroGraph[23] ?), and talk to it with ProLog or some such – forget the Object paradigm and embrace the Linked, Open World Resources, and also do it with REST.

You also have to be clear that REST suits what you want to do (other architectures haven’t just been demoted to history) what your services are  -what you are interfacing with, what are your domain objects, what service operations are exposed when, what workflow do you need to encompass[13], and how granular you need to be – a shopping cart application will need to save items to a shopping list, rather than save the items themselves (or the cart resource probably), but it will also, behind the service, need to update a stock control or inventory – which isnt exposed to your end user.  So be clear about which service level CRUD operations you need to expose to your user or “agent”, and which if any domain objects you need to directly manipulate.

But in the end, hopefully, you’ve still followed your enterprise principles and patterns, but you’ve adopted a long lasting web-scale architecture, and if youve added the semantic vocabulary, you’ve got the basis for successful evolution, a network effect, adaptable clients and agents and a successful resolution to an important business case – thats why your doing this, isn’t it, not because its cool ?

Update : April 24 – read Otavios paper on RESTfulGrounding [25] but also read Alowisheq, Millard and Tiropanis EXPRESS RESTful services paper[26]. RESTfulGrounding does for REST and WADL what OWL-S does for WSDL – it gets Semantic descriptions into the syntactic descriptions that automated services might use to interact with a web service, and facilitates discovery, composition, monitoring and execution. EXPRESS takes a different approach and based on an existing RESTful web service allows you to create an OWL description that can also be RESTfully accessed to describe the services resources, relations and “parameters” (OWL DataTypeProprty and ObjectProperty). They describe an adaptation of Amazon S3 buckets and docs with EXPRESS and compare with SA-REST and OWL-S approaches.

I like EXPRESS more than RESTfulGrounding as the simplicity appeals : the way it in turn relies on REST to underpin the service description access and interaction, adheres to RESTful principles for message exchange – using TTL rather than XML – , follow-you-nose, and the fact that this in turn means I don’t have to learn much if I want to make use of it. It does need the use of a code generator for stubs and URIs and a manual step to define which methods apply to which URIs, and doesn’t do much for discovery and composition – but they acknowledge this and intend to work on it – and a real implementation with these tools needs to be made available so that people like me can try it out. Is there one ?

I need to understand more about WADL[27,28] (why is it needed in the first place ?) and how I might go about actually building a set of services that need to be described and then discovered and composed to provide some useful value, but EXPRESS fits nicely into web scale, lo-fi approaches that quickly gain traction and that might make use of a CPoA kind of approach for discovery and composition.

* You’ve got other choices :

  • Apache CXF – perhaps best if you come from the WS* camp or have a mixture [8]
  • GlassFish Jersey – seems to have good traction, with hooks into Spring et al [9]
  • RESTeasy – JBoss jax-rs implementation [10]
  • RESTlet – not sure about this, seems to have good support, taking a different approach apparently – eg RESTlet vs SERVlet, but I need more info to do it justice [11]
  • PLAY Framework – has good REST support I understand from others. [12]
  • Clerezza – Apache incubator project with RDF, jax-rs, scala and “renderlet” support. Looks interesting from a RDF PoV, but maybe not so interesting from an OOD PoV [14]

[1] http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven
[2] http://blogs.sun.com/bblfish/entry/rest_apis_must_be_hypertext
[3] http://wadl.java.net/
[4] http://bitworking.org/news/193/Do-we-need-WADL
[5] http://jcp.org/en/jsr/detail?id=311
[6] http://grzegorzborkowski.blogspot.com/2009/03/test-drive-of-spring-30-m2-rest-support.html
[7] http://wisdomofganesh.blogspot.com/2010/06/wanted-esc-not-esb.html

[8] http://cxf.apache.org/

[9] http://jersey.java.net/

[10] http://www.jboss.org/resteasy

[11] http://www.restlet.org/

[12] http://www.playframework.org/documentation/1.1/routes

[13] http://www.infoq.com/articles/webber-rest-workflow

[14] http://incubator.apache.org/clerezza/

[15] https://github.com/clarkparsia/Empire

[16] http://code.google.com/p/jenabean/

[17] http://code.google.com/p/object-triple/

[18] http://www.systap.com/bigdata.htm

[19] http://www.dist-systems.bbn.com/people/krohloff/shard_overview.shtml

[20] http://couchdb.apache.org/

[21] http://wiki.apache.org/incubator/HeartProposal

[22] http://java.net/projects/sommer/

[23] http://www.franz.com/agraph/allegrograph/

[24] http://blog.cubrid.org/web-2-0/database-technology-for-large-scale-data/
[25] http://www.fullsemanticweb.com/blog/ontologies/restfulgrounding/
[26] http://ebookbrowse.com/express-expressing-restful-semantic-services-using-domain-ontologies-pdf-d12806537
[27] http://java.net/projects/wadl/
[28] http://bitworking.org/news/193/Do-we-need-WADL

Java Semantic Web & Linked Open Data webapps – Part 3.0

November 26, 2010 5 comments

Available tools and technologies

(this section is unfinished, but taking a while to put together, so – more to come)

When you first start trying to find out what the Semantic Web is in technical terms, and then what the Linked Open Data web is, you soon find that you have a lot of reading to do – because you have lots of questions. That is not surprising since this is a new field (even though Jena for instance has been going 10 years) for the average Java web developer who is used to RDBMS, SOA, MVC, HTML, XML and so on. On the face of it, RDF is just XML right ? A semantic repository is some kind of storage do-dah and there’s bound to be an API for it, right ? Should be an easy thing to pick up, right ? But you need answers to these kind of questions before you can start describing what you want to do as technical requirements, understanding what the various tools and technologies can do, which ones are suitable and appropriate, and then select some for your application.

One pathway is to dive in, get dirty and see what comes out the other side. But that to me is just a little unstructured and open-ended, so I wanted to tackle what seemed to be fairly real scenarios (see Part 2 of this series) – a 2-tier web app built around a SPARQL endpoint with links to other datasets and a more corporate style web application that used a semantic repository instead of an RDBMS, delivering a high level API and a semantic “console”.

In general then it seems you need to cover in your reading the following areas

  • Metadata – this is at the heart of the Semantic Web and Linked Open Data web. What is it !! Is it just stuff about Things ? Can I just have a table of metadata associated with my “subjects” ? Do I need a special kind of database ? Do I need structures of metadata – are there different types of things or “buckets” I need to describe things as ? How does this all relate to how I model my things in my application – is it different than Object Relational Modelling ? Is there a specific way that I should write my metadata ?
  • RDF, RDFS and OWL – what is it, why is it used,  how is it different than just XML or RSS;  what is a namespace, what can you model with it, what tools there are and so on
  • SPARQL – what is it, how to write it, what makes it different from SQL; what can it NOT do for you, are there different flavours, where does it fit in a web architecture compared to where a SQL engine might sit ?
  • Description Logic – you’ll come across this and wonder, or worse give up – it can seem very arcane very quickly – but do you need to know it all or any of it ? Whats a graph, a node, a blank node dammit, a triple, a statement ?
  • Ontologies – isn’t this just a taxonomy ? Or a thesaurus ? Why do I need one, how does metadata fit into it ? Should I use RDFS or OWL or something else ? Is it XML ?
  • Artificial Intelligence, Machine Learning, Linguistics – what !? you mean this is robotics and grammer ? where does it fit in – whats going on, do I need to have a degree in cybernetics to make use of the semantic web ? Am I really creating a knowledge base here and not a data repository ? Or is it an information repository ?
  • Linked Open Data – what does it actually mean – it seems simple enough ? Do I have to have a SPARQL endpoint or can I just embed some metadata in my documents and wait for them to be crawled. What do I want my application to be able to do in the context of Linked Open Data ? Do I need my own URIs ? How do I make or “coin” them ? How does that fit in with my ontology ? How do I host my data set so someone else can use it ? Surely there is best practice and examples for all this ?
  • Support and community – this seems very academic, and very big – does my problem fit into this ? Why can I not just use traditional technolgies that I know and love ?  Where are all the “users” and applications if this is so cool and useful and groundbreaking ? Who can help me get comfortable, is anyone doing any work in this field ? Am I doing the right thing ? Help !

I’m going to describe these things before listing the tools I came across and ended up selecting for my applications. So – this is going to be a long post, but you can scan and skip the things you know already. Hopefully, you can get started more quickly than I did.

First the End

So you read and your read and come across tools and libraries and academic reports and W3C documents and you see it has been going on some time, that some things are available and current, others are available and dormant. Most are OpenSource thankfully and you can get your hands on them easily, but where to start ? What to try first – what is the core issue or risk to take on first ? Is the enough to decide that you should continue ?

What is my manager going to say to me when I start yapping on about all these unfamiliar things –

  • why do I need it ?
  • what problem is it solving ?
  • how will it make or save us money  ?
  • our information is our information – why would I want to make it public ?

Those are tough questions when you’re starting from scratch, and no one else seems to be using the technologies you think are cool and useful – who is going to believe you if you talk about sea-change, or “Web3.0” or paradigm shift, or an internet for “machines”. I believe you need to demonstrate-by-doing, and to get to the bottom of these questions so you know the answers before someone asks them of you. And you better end up believing what your saying so you that you are convincing and confident. Its risky….*

So – off I go – here is what I found, in simple, probably technically incorrect terms – but you’ll get the idea and work out the details later (if you even need to)

*see my answers to these questions at the end of this section

Metadata, RDF/S, OWL, Ontologies

Coarsely, RDF allows you to write linked lists. URIs allow you to create unique identifiers for anything. If you use the same URI twice, your saying that exact Thing is the same in both places. You create the URIs yourself, or when you want to identify a thing (“john smith”) or a property of a thing (eg “loginId”) that already exists, you reuse the URI that you or someone else created. You may well have a URI for a concept or idea, and another for one of its physical form – eg a URI for a person in your organisation, and another for the webpage that shows his photo and telephone number, another for his HR system details.

Imagine 3 columns in a spreadsheet called Subject, Object and Predicate. Imagine a statement like “John Smith is a user with loginId ‘john123’ and he is in the sales Department“. This ends up like

Subject Predicate Object
S-ID1 type User
S-ID1 name “John”
S-ID1 familyName “Smith”
S-ID1 loginId “john123”
S-ID2 type department
S-ID2 name “sales”
S-ID2 member ID1

That is it, simply – RDF allows you to say that a Thing with an ID we call S-ID1 has properties, and that those properties are either other Things (S-ID2/member/ID1) or literal things like strings “john123”.

So you can build a “graph” or a connected list of Things (nodes) where each Thing can be connected to another Thing. And once you look at one of those Things, you might find that it has other properties that link to different Things that you don’t know about or that aren’t related to what you are looking at – S-ID2 may have another “triple” or “statement” that links it with ID-99 say (another user) or ID-10039 (a car lot space, say). So you can wire up these graphs to represent whatever you want in terms of properties and values (Objects). A Subject, Property or Object can be a reference to another Thing.

Metadata are those properties you use to describe Things. And in the case of RDF each metadatum can be a Thing with its own properties (follow the property to its own definition), or a concrete fact – eg a string, a number. Why is metadata important – because it helps you contextualise and find things and to differentiate one thing from another even if they are called the same name. Some say “Content is King” but I say Metadata is !.

RDF has some predefined properties like “property” and “type”. Its pretty simple and you’ll pick it up easily [1]. Now RDFS extends RDF to add some more predefined properties that allow you to create a “schema” that describes your data or information – “class”, “domain”, “range”, “label”, “comment”. So if you start to formalise the relationships described above – a user has a name, familyName, loginID and so on – before you know it, you’ve got an ontology on your hands. That was easy, right ? No cyborgs, logic bombs, T-Box or A-Box in sight.(see the next section) And you can see the difference between an ontology and a taxonomy – the latter is a way of classifying or categorising things, but an ontology does that and also describes and relates them. So keep going, this isn’t hard ! (Hindsight is great too)

Next you might look at OWL because you need more expressiveness and control in your information model and you find out that it has different flavours – DL, LITE, FULL[2] What do you do now ? Well, happily, you don’t have to think about it too much, because it turns out that you can mix and match things in your ontology – use RDFS and OWL, and you can even use things from other ontologies. Mash it up – you don’t have to define these properties from scratch yourself. So go ahead and do it, and if you find that you end up in OWL-FULL instead of DL then you can investigate and see why. The point is, start, dig in and do what you need to do. You can revise and evolve at this stage.

A metadata specification called “Dublin Core”[3] comes up a lot – this is a useful vocabulary for describing things like “title”, “creator”, “relation”, “publisher”. Another, the XSD schema is useful for defining things like number types -integer, long and float – and is used as part of SPARQL for describing literals. You’ll also find that there are properties of things that you thought are so common that someone would have an ontology or a property defined for them already. I had a time looking for a definition of old English miles, but it turns out luckily that there was one[4,5]. On the other hand, there wasn’t one for a compass bearing of  “North” – or at least one that I could find, so I invented one, because it seemed important to me. Not all things in your dataset will need metadata – and in fact you might find that you, and someone working on another project have completely different views on whats important in a dataset – you might be interested in describing financial matters, and someone else might be more interested in the location information. If you think about it long enought a question might come to mind – should we still maintain our data somewhere in canonical, raw or system-of-record form, and have multiple views of what it is stored elsewhere ? (I dont have an answer for that one yet).

Once your start you soon see that the point of reusing properties from other ontologies is that you are creating connections between datasets and information just by using them – you may have a finance department that uses “creator” that you can now link records in the HR system with the same person – and because the value used  for the “creator” is in fact a unique URI (simply, an ID that looks like an URL) eg http://myCompany.com/people/john123. If you have another John in the company, he’ll have a different ID eg http://myCompany.com/people/john911, so you can be sure that the link is correct and precise – no ambiguity –  John123 will not get the payslip meant for John911. There are also other ways of connecting information – you could use owl:sameAs for instance – this makes a connection between two Things when a common vocabulary or ID is not available, or when you want to make a connection where one didn’t exist before. But think about these connections before you commit them to statements – the correctness, provenance and trust around that new connection has to be justifiable – you want your information and assertions about it to have integrity, right ?

I needed RDF and RDFS at least – this would be the means that I would express the definition and parameters of my concepts, and then also the statements to represent actual embodiments of those concepts – instances. It started that way, but I knew I might need OWL if I wanted to more controlled over the structure and integrity of my information – eg to say that John123 could only be a member of one department and one department only, that he had role of “salesman” but couldn’t also have a role of “paymaster”. So, if you need this kind of thing, read more about it [6,7]. If you don’t yet, just keep going, and you can still come back to it later.(turns out I did in fact)

The table above now looks like this when you use URIs – its the same information, just written down in a way that ensures things are unique, and connectable.

Namespaces
myCo:http://myCompany.com/people/
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
foaf:http://xmlns.com/foaf/0.1/
Subject Predicate Object
myCo:S-ID1 rdf:type myCo:User
myCo:S-ID1 rdfs:label “John”
myCo:S-ID1 foaf:family_name “Smith”
myCo:S-ID1 myCo:loginId “john123”
myCo:S-ID2 rdf:type department
myCo:S-ID2 rdfs:label “sales”
myCo:S-ID2 myCo:member myCo:S-ID1

The Namespaces at the top of the table mean that you can use shorthand in the three columns and don’t have to repeat the longer part of the URI each time. Makes things easier to read and take in too, especially if you’re a simple human. For predicates, I’ve changed name to rdfs:label and familyName to foaf:family_name[8]. In the Object column only the myCo namespace is used – in the first case it points to a Subject with a type defined elsewhere (in the ontology in fact). I say the ontology is defined elsewhere, but that doesnt haev to be physically elsewhere, its not uncommon to have a file on disk that contains the RDF to define the ontology but also contains the instances that make up the vocabulary or the information base.

So – why is this better than a database schema ? The simple broad answers are the best ones I think :

  • You have only 3 columns*
  • You can put anything you like in each column (almost – literals cant be predicates (?) ), and its possible to describe a binary property User->name->John as well as n-ary relationships[9] User->hasVehicle->car->withTransmission->automatic
  • You can define what you know about the things in those columns and use it to create a world view of things (a set of “schema rules”, an ontology).
  • You can (and should) use common properties – define a property called “address1” and use it in Finance and HR so you know you’re talking about the same property. But if you don’t, you can fix it later with some form of equivalence statement..
  • If there are properties on instances that aren’t in your ontology, they don’t break anything, but they might give you a surprise – this is called an “open world assumption” – that is to say just because it is not defined does not mean it cannot exist – this is a key difference from database schema modelling.
  • You use the same language to define different ontologies, rather than say MySQL DDL for one dataset and Oracle DDL for another
  • There is one language spec for querying any repository – SPARQL **. You use the same for yours and any others you can find – and over Http – no firewall dodging, no operations team objections, predictable, quick and easy to access
  • You don not have to keep creating new table designs for new information types
  • You can easily add information types that were not there before while preserving older data or facts
  • You can augment existing data with new information that allows you to refine it or expand it – eg provide aliases that allow you to get around OCR errors in extracted text, alternative language expressions
  • Any others ?

*Implementations may add one or two more, or break things up into partitioned tables for contextual or performance reasons
**there are different extensions in different implementations

[1]http://rdfabout.net/
[2]http://www.w3.org/TR/owl-features/
[3]http://dublincore.org/
[4]http://forge.morfeo-project.org/wiki_en/index.php/Units_of_measurement_ontology
[5]http://purl.oclc.org/NET/muo/ucum-instances.owl
[6]http://www.cs.man.ac.uk/~horrocks/ISWC2003/Tutorial/
[7]http://www.w3.org/TR/owl-ref/#sameAs-def
[8]http://xmlns.com/foaf/spec/
[9] http://www.w3.org/TR/swbp-n-aryRelations/

SPARQL, Description Logic (DL), Ontologies

SPARQL [10] aims to allow those familiar with querying relational data to query graph data without too much introduction. Its not too distant but needs a little getting used to. “Select * from users” looks like “select * from {?s rdf:type myCo:User}”, and then you get back 2 types of information rather than every column from a table. Of course this is because you have effectively 3 “columns” in the graph data and theyre populated with a bunch of different things. So you need to dig deeper[11] into tutorials and what others have written.[12,13]

One of the key things about SPARQL is that you can use it to find out what is in the graph data without having any idea before hand.[14] You can ask to find the types of data available, then ask for the properties of the types, then DESCRIBE or select a range of types for identified subjects. So, its possible to discover whats available to suit your needs, or for anyone else to do the same with your data.

Another useful thing is the ability (for some SPARQL engines – Jena’s ARQ [15] comes to mind) to federate queries either by using a “graph” (effectively just a named set of triples)  that is an URI to a remote dataset, or by using (in Jena’s) case, the SERVICE keyword. So you can have separate and independent datasets and query across them easily. Sesame[16] allows a similar kind of thing with Federated Sail but you predefine the federation you want, rather than specify it in-situ. Beware of runtime network calls in the Jena case, and consider hosting your independent data in a single store but under different graphs to avoid them. You’ll need more memory in one instance, but you should get better performance. And watch out for JVM memory limits and type size increases if you (probably) move to a 64bit JVM.[17,18]

While learning the syntax of SPARQL isn’t a huge matter, understanding that youre dealing with a graph of data and having to navigate or understand that graph before hand can be a challenge, especially if its not your data you want to federate or link with. Having ontologies and sample data (from your initial SPARQL queries) helps a lot, but it can be like trying to understand several foreign database schemas at once, visualising a chain rather than a hierarchy, taking on multiple-inheritance and perhaps cardinality rules, domain and range restrictions and maybe other advanced ontology capabilities.

SPARQL engines or libraries used by SPARQL engines that allow inferencing provide a unique selling point for the Semantic and Linked web of data. Operations you cannot easily do in SQL are possible. Derived statements with information that is not actually “asserted” in the physical data you may have loaded into your repository start to appear. You might for instance ask for all Subjects or things of a certain type. If the ontology of the information set says that one type is a subclass of another – say you ask for “cars” – then you’ll get back statements that say your results are cars, but you’ll also get statements saying they are also “vehicles”. If you did this with an information set that you were not familiar with, say a natural history data set, then when you ask for “kangaroos” you are also told that its an animal, a kangaroo, and a marsupial. The animal statement might be easy to understand, but perhaps you expected that it was a mammal. And you might not have expressly said that a Kangaroo was one or the other.

Once you get back results from a SPARQL query you can start to explore – you start looking for kangaroos, then you follow the marsupial link, and you end up with Opossum, then you see its in the USA and not Australia, and you compare the climates of the two continents. Alternatively of course, you may have started at the top end – asked for marsupials, and you get back all the kangaroos and koalas etc, then you drill down into living environment and so on. Another scenario deals with disambiguation – you ask for statements about eagles and the system might return you things named eagles, but you’ll be able to see that one is a band, one is a US football team, and the other a bird of prey. Then you might follow links up or down the classifications in the ontology.

Some engines have features or utilities that allow you to “forward-chain”[19] statements before loading – this can mean that using an ontology or a reasoning engine based on a language specification that derived statements about things are asserted and materialised for you before you load them into your repository. This is not only things to do with class hierarchy but also where a hierarchy isnt explicit, inference might create a statement – “if a Thing has a title, pages, book, and has a hardback coverthen it is ….a book”.  This saves the effort at runtime and should mean that you get a faster response to your query. Forward chaining (and backward-chaining[20]) are common reasoning methods used with inferrence rules in Artificial Intelligence and Logic systems.

It turns out, Description Logic or “DL” [21] is what we are concerned with here – a formal way of expressing or representing knowledge – things have properties that are a certain value. OWL is a DL representation for instance. And like Object oriented prorgammic languages – Java say – there are classes (ontology, T-Box statements) and instances (A-Box, instances, vocabularies). There are also notable differences from Java (eg multiple inheritance or typing),  and a higher level of formalism, and these can make mapping between your programming language and your ontology or modelling difficult or problematic. For some languages, ProLog or Lisp this mapping may not be such a problem, and indeed you’ll fnd many semantic tools and technologies built using them.

Despite the fact that DL and AI  can get quite heady once you start delving into these things, it is easy to start with the understanding that they allow you to describe or model your information expressively and formally without being bound to an implementation detail like the programning language you’ll use, and that once you do implement and make use of your formal knowledge representation – your ontology – that hidden information and relationships may well become clear where they may not have been before. Doing this with a network of information sets means that the scope of discovery and fact is broadened – for your business, this may well be the difference between a sale or not, or provide a competitive edge in a crowded market.

[10]http://www.w3.org/TR/rdf-sparql-query/
[11]http://www.w3.org/2009/sparql/wiki/Main_Page
[12] http://www.ibm.com/developerworks/xml/library/j-sparql/
[13]http://en.wikibooks.org/wiki/XQuery/SPARQL_Tutorial
[14]http://dallemang.typepad.com/my_weblog/2008/08/rdf-as-self-describing-data.html
[15]http://openjena.org/ARQ/
[16]http://wiki.aduna-software.org/confluence/display/SESDOC/Federation
[17]http://jroller.com/mert/entry/java_heapconfig_32_bit_vs
[18]http://portal.acm.org/citation.cfm?id=1107407
[19]http://en.wikipedia.org/wiki/Forward_chaining
[20]http://en.wikipedia.org/wiki/Backward_chaining
[21]http://en.wikipedia.org/wiki/Description_logic

Artificial intelligence, machine learning, linguistics

When you come across Description Logic and the Semantic Web in the context of identifying “things” or entities in documents – for example the name of a company or person, a pronoun or a verb – you’ll soon be taken back to memories of school – grammer, clauses, definitive articles and so on. And you’ll grow to love it Im sure, just like you used to 🙂
It’s a necessary evil, and its at the heart of a one side of the semantic web – information extraction(“IE”) as a part of information retrieval (“IR”)[22,23]). Here, we’re interested in the content of documents, tables, databases, pages, excel spreadsheets, pdfs, audio and video files, maps, etc etc. And because these “documents” are written largely for human consumption, in order to get at the content using “a stupid machine”, we have to be able to tell the stupid machine what to do and what to look for – it does not “know” about language characteristices – what the difference is between a noun and a verb – let alone how to recognise one in a stream of characters, with variations in position, capitalisation, context and so on. And what if you then want to say that a particular noun, used a particular way is a word about “politics” or “sport”; that its Englih rather than German; that it refers to another word two words previous, and that its qualified by an adjective immediately after it ? This is where Natural Language Processing (NLP) comes in.

You may be familiar with tokenising a string in a high level programming language, then writing a loop to look at each word and then do something with it. NLP will do this kind of thing but apply more sophisticated abstractions, actions and processing to the tokens it finds, even having a rule base or dictionary of tokens to look for, or allowing a user to dynamically define what that dictionary or gazeteer is. Automating this is where Machine Learning (ML) comes in. Combined, and making use of mathematical modelling and statistical analysis they look at sequences of words and then make a “best guess” at what each word is, and tell you how good that guess is.

You may need (probably) to “train” the machine learning algorithm or system with sample documents – manually identify and position the tokens you are interested in, tag them with categories (perhaps these categories themselves are from a structured vocabulary you have created or found, or bought) and then run the “trained” extractor over your corpus of documents. With luck, or actually, with a lot of training (maybe 20%-30% of the corpus size), you’ll get some output that says “rugby” is a “sports” term and “All Blacks” is a “rugby team”. Now you have a your robot, your artificial intelligence.

But the game is not up yet – for the Semantic and Linked web, you now you have to do something with that output – organise and transform into RDF – a related set of extracted entities – relate one entity to another into a statement “all blacks”-“type”-“rugby team”, and then collect your statements into a set of facts that mean something to you, or the user for whom you are creating your application. This may be defined or contextualised by some structure in your source document, but it may not be – you may have to provide and organising structure. At some point you need to define a start – a Subject you are going to describe, and one of the Subjects you come up will be the very beginning or root Thing of your new information base. You may also consider using an online service like OpenCalais[24], but you’re then limited to the range of entities and concepts that those services know about – in OpenCalais’ case its largely business and news topics – wide ranging for sure, but if you want to extract information about rugby teams and matches it may not be too successful. (There are others available and more becoming available). In my experience, most often and for now, you’ll have to start from scratch, or as near as damn-it. If you’re lucky there may be a set or list of terms for the concept you are interested in, but its a bit like writing software applications for business – no two are the same, even if they have the same pattern. Unlike software applications though, this will change over time – assuming that people will publish their ontologies, taxonomies, term sets, gazeteers and thesauri. Lets hope they do, but get ready to pay for them as well – they’re valuable stuff.

So

  1. Design and Define your concepts
    1. Define what you are interested in
    2. Define what things represent what you are interested in
    3. Define how those things are expressed – the terms, relations, ranges and so on – you may need to build up a gazeteer or thesaurus
    4. Understand how and where those things are used – the context, frequency, position
  2. Extract the concepts and metadata
    1. Now tell the “machine” about it, in fact, teach it what you know and what you are interested in – show it by example, or create a set or rules and relations that it understands
    2. Teach it some more – the more you tell it, the more variety, the more examples and repitition you can throw it, the better the quality of results you’ll get
    3. Get your output – do you need to organise the output, do you have multiple files and locations where things are stored, do  you need to feed the results from the first pass into your next one ?
    4. Fashion some RDF
    5. Create URIs for your output – perhaps the entities extracted are tagged with categories (that you provided to the trained system) or with your vocabulary, or perhaps not – but now you need to get from this output to URIs, Subjects, Properties, Objects – to match your ontology or your concept domain. Relate and collect them into “graphs” of information, into RDF.
    6. Stage them somewhere – on a filesystem say (one file, thousands ? versions, dates ? tests and trials, final runs; spaces, capitalisation, reserved characters, encoding – its the web afterall)
  3. Make it accessible
    1. Find a repository technology you like – if you dont know, if its your first time, pick one – suck it and see – if you have RDF on disk you might be able touse that directly (maybe slower than an online optimised repository). Initialise it, get familiar with it, consider size and performance implications. Do you need backup ?
    2. Load your RDF into the repository. (Or perhaps you want to modify some existing html docs you have with the metadata you’ve extracted – RDFa probably)
    3. Test what you’ve loaded matches what you had on disk – you need to be able to query it – how do you do that ? Is there a commandline tool – does it do SPARQL ? What about it you want to use it on the web, this is whole point isnt it ?Is there a sparql endpoint  – do you need to set up Tomcat or a Jetty say to talk to your repository ?
  4. Link it
    1. And what about those URIs – you have URIs for your concept instances (“All Blacks”), and URIs for their properties (“rdf:type”), and URIs for the Object of those properties (“myOnt:Team”), What happens now – what do you do with them ? If there for the web, if theyre URIs shouldnt I be able to click on them ? (Now were talking Linked Data – see next section).
    2. Link your RDF with other datasets (See next section) if you want to be found, to participate, and to add value by association, affiliation,connection – the network effect – the knowledge and the value (make some money, save some money)
  5. Build your application
    1. Now create your application around your information set. You used to have data, now you have information – your application turns that into knowledge and intelligence, and perhaps profit.

There are a few tools to help you in all this (see below) but you’ll  find that they dont do everything you need, and they  wont generate RDF for you without some help – so roll your sleeves up. Or – don’t – I decided against it, having looked at the amount of work involved in learning all about NLP & ML, in the arcane science (its new to me), in the amount of time needed to set up training and the quality of the output. I decided on the KISS principle – “Keep It Simple, Stupid”, so instead I opted to write something myself, based on grep !

I still had to do 1-5 above, but now I had to write my own code to do the extraction and “RDFication”. It also meant I got my hands dirty and learned hard lessons by doing rather than reading or trusting someone else’s code that I didnt understand. And the quality of the output and the meaning of it was all in my control still.  It is not real Machine Learning, it’s still in the tokenisation world I suppose, but I got what I wanted and in the process made something I can use again. It also gave me practical and valuable experience so that I can revisit the experts tools with a better perspective – not so daunting, more  comfortable and confident, something to compare to, patterns to witness and create, less to learn and take on, and, importantly, a much better chance of actual, deliverable success.

It was quite a decision to take – it felt dirty somehow – all that knowledge and science bound up in those tools, it was a shame not to use it – but I wanted to learn and to fail in some ways, I didn’t want to spend weeks training a “machine”, and it seemed better to fail with something I understood (grep) rather than take on a body of science that was alien. In the end – I succeeded – I extracted my terms with my custom-automated-grep-based-extractor and I created RDF and loaded it into a repository. Its not pretty, but it worked – I have gained lots of experience, and I know where to go next. I recommend it.

Finally, it’s worth noting here the value-add components

  • ontologies – domain expertise written down
  • vocabularies – these embody statements of knowledge
  • knowledge gathering – collecting a disparate set of facts, or describing and assembling a novel perspective
  • assurance, provenance, trust – certifying and guaranteeing levels of correctness and origin
  • links – connections, relationships, ranges, boundaries, domains, associations – the scaffolding of the brains !
  • the application – a means to access, present and use that knowledge to make decisions and choices

How many business opportunities are there here ?

[22] http://en.wikipedia.org/wiki/Information_extraction

[23] http://en.wikipedia.org/wiki/Information_retrieval

[24] http://www.opencalais.com/

Linked Open Data

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply”* embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

* This blog doesn’t contain RDFa because its just too hard to do – wordpress.com doesn’t have available pluginst, and the wordpress.org plugins may be limited for what you want to do. Drupal7 [50] does a better job, and Joomla [51] may get there in the end.

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data Comment
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/
[50] http://semantic-drupal.com/
[51] http://semanticweb.com/drupal-may-be-the-first-mainstream-semantic-web-winner_b568

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply” embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/