Archive

Posts Tagged ‘NLP’

Java Semantic & Linked Open Data webapps – Part 4

December 17, 2010 Comments off

What needs writing ?

Now that we have an idea about what tools and technologies are available and the kind of application we want to build we need to start considering architecture and what code we will write around those tools and technologies. The architecture I planned was broadly formed – but not completely – as I went about creating these applications. I was also going to tackle the Linked Open Data webapp first and then do the Semantic Backed J2EE app. I thought MVC first for both, but went in the end with a 2 tier approach for the former, and an n-tier component based approach for the latter. (More about this in the next section). I’m used to the Spring framework, so I thought I’d go with it, and for UI I’d use jQuery and HTML and/or JSP, perhaps Velocity. But nothing was set in stone, and I was going to try and explore and be flexible.

The tools and technologies cover

  • creating an ontology
  • entity extraction
  • RDF generation
  • using RDF with Java
  • Semantic repositories
  • querying sparql end points
    • inference
    • linking data
  • UI and render
Category Linked Open Data webapp Semantic Backed J2EE webapp
creating an ontology The ontology was going to be largely new as there is not much about to deal with historical content. Some bibliograpic ontologies are out there, but this isn’t about cataloguing books or chapters, but about the content within and across the sections in a single book. There are editions for Scotland, Wales and UK also, so I might get around to doing them at some stage. Some of the content is archaic – measurements are in Old English miles for instance. Geographic features needed to be described, along with population and natural resourcces. I wasn’t sure if I needed the expressiveness of OWL over RDFS, but thought that if I was going to start something fresh I might as well leave myself open to evolution and expansion – so OWL was the choice. Some editors dont to OWL, and in the end I settled for Protege. Same thoughts here as for the Linked Data app – why limit myself to RDFS ? I can still do RDFS within an OWL ontology. Protege it is
entity extraction Having played with GATE, OpenNLP, MinorThird and a foray into UIMA I settled on writing my own code. I needed close connections between my ontology, extracting the entities and generating RDF from those entities – most of these tools dont have this capability out of the box (perhaps they do now, 1 year on) – and I also wanted to minimise the number of independent steps at this point so that I could avoid writing conversion code, configure multiple parts in different ways and for different environments or OS. There is also a high barrier to entry and a long learing curve for some of these tools. I had read a lot, enough even, and wanted to get my hands dirty. I decided to build my own, based on grep – as most of these tools use regex at the bottom end and build upon it . It wasn’t going to be sophisticated, but it would be agile, best effort, experience based coding I’d be doing, and learning all the way – not a bad approach I think. I’d borrow techniques from the other tools around tokenisation and gazeteering, and if I was lucky, I might be able to use some of the ML libraries (I didnt in the end). So, with the help of Jena, I wrote components for

  • Processing files in directories using “tasks”, outputting to a single file, multiple files, multiple directories, different naming conventions, encoding, different RDF serialisations
  • Splitting single large file into sections based on a heading style used by the author. This was complicated by page indexing and numbering that a very similar style, and variations within sections that meant that end-of-section was hard to find. I got most entries out, but from time to time I find and embedded section wthin another. This can be treated individually, manually, and reimported into the repository to replace the original and create 2 in its place
  • Sentence tokenisation – I could have used some code from the available libraries and frameworks here, but its not too difficult, and when I did compare to the others eventually, I discovered that they also came a cropper in the same areas I did. Some manual corrections are still needed no matter how you do it, so I stuck with my own
  • Running regex patterns, accumulating hits in a cache. A “concept” or entity has a configuration element, and a relationship to other elements (a chain can be created).
    • The configuration marries an “Entity” with a “Tag”(URI). Entities are based on a delimiter, gazeteer.
    • Entities can be combined if they have a grouping characteristic.
    • An Entity can be “required” meaning that unless some “other” token is found in a sentence, the entity wont be matched. This can also be extended to having multiple required or ancialliary matches, so that a proportion need to be found (a likelihood measure) before an entity is extracted.
    • Some Entities can be non-matching – just echo whatever is in the input – good for debug, and for itemising raw content – I use this for echoing the sentences in the section that Im looking at – the output appears alongside the extracted entities.
    • The Required characteristic can also be used with Gazeteer based greps.
    • Entities have names that are used to match to Tags
  • Creating a Jena Model and adding those entities based on a configured mapping to an ontology element (URI, namespace, nested relationship, quantification (single or list, list type)
  • Outputting a file or appending to a file, with a configured serialisations scheme (xml/ttl/n3/…)
This was a different kind of application – here no data exists at the start, and all is created and borne digital. No extraction needed.
RDF generation I naively started the RDF generation code as a series of string manipulations and concatenations. I thought I could get away with it, and that it would be speedy ! The RDF generation code in Jena didnt seem particularly sophisticated – the parameters are string based in the end, and you have to declare namespaces as a string etc so what could possible go wrong ?? Well, things got unwieldy, and when I wanted to validate, integrate and reuse this string manipulation code it became tedious and fractious. Configuration was prone to error. Jena at higher stages of processing then needs proper URIs and other libraries operate on that basis. So, just in time, I switched – luckily I had built the code thinking that I might end up having to alter my URI definition and RDF generation strategy, so it ended up being a discrete replacement – a new interface implementation that I could plug in.
Tags can be

  • reference – always create the same URI – used with properties mostly – eg rdfs:type
  • append – a common and complete base, with just a value appended
  • complex – a base uri, intermediate path, ns prefix, type or subject path, a value URI different from the containing element
  • lookup – based on entity value, return a particular URI – like a reverse gazeteer
Here, RDF generation isnt driven by extraction or preexisting entites, but by the Object model I used. See the next row for details.
Using RDF with Java Fairly early on I settled with Jena as opposed to Sesame. There are some notes I found comparing Jena to Sesame1, but some of the arguments didnt mean anything to me at the early stages. There wasnt much between them I thought, but the Jena mailing list seemed a bit more active, and I noted Andy Seaborne’s name on the Sparql working group2. Both are fully featured with Sparql endpoints, repositories, text search and so on, but take different approaches3 . Since then I’ve learned a lot of course, and Ive compiled my own comparison matrix[110]. . So – I went for Jena, and I probably will in other cases, but Sesame may suit things better in others.

While Jena is Object oriented, working with it is based on RDF rather than objects. So if you have a class with properties – a bean – you have to create a Model, the Subject and add the properties and their values, along with the URIs and namespaces that they should be serialised with. You cannot hand Jena a Bean and say “give me the RDF for that object”.

For this project that wasn’t an issue – I wasnt modelling a class hierarchy, I wanted RDF from text, and then to be able to query it, and perhaps use inference. Being able to talk to Sparql endpoints and manipulate RDF was more important than modelling an Object hierarchy.

1. http://www.openrdf.org/forum/mvnforum/viewthread?thread=2043#7470
2. http://www.w3.org/2009/sparql/wiki/User:Andy_Seaborne
3. Theyre different because they can be  – this isn’t like programming against a standard like JDBC, there isnt a standard way of modelling RDF in Java or as an Object – there are domain differences that may well make that impossible, in entirety. Multiple inheritance, restrictions and Open World Assumption make for mismatches. ProLog and LISP may be different or more suited here, or perhaps some other language.

Here I needed to be able maintain parallel worlds – and Object base with a completely equivalent RDF representation. And I wanted to be able to program this from an enterprise Java developer’s perspective, rather than a logician or information analyst. How do I most easily get from Object to RDF without having to code for each triple combination [109]? Well it turns out there are 2 choices, and I ended up using one and then the other. It was also conceivable that I might not be able to do what I wanted, or that it wouldnt perform – I saw the impact of inference on query performance in the Linked Data application – so I wanted to code the app so that it would be decoupled from the persistence mechanism. I also needed to exert authorization control – could I do this with RDF ?

  • Java-RDF – I stuck with Jena – why give up a good thing ?
  • Object-RDF – Jena has 2 possibilties – JeanBean, and Jastor. I settled for JenaBean as it seemed to have support and wasnt about static class generation. This allows you to annotate your javabeans with URI and property assertions so that a layer of code can create the RDF for you dynamically, and then do the reverse when you want to query.
  • AdHoc Sparql – the libraries work OK when you are asking for Objects by ID, but if you want Objects that have certain property values orconditions then you need to write Sparql and submit that to the library.

So, I could build my app in an MVC style, and treat the domain objects much like I would if I used Hibernate or JDO say. In addition, I could put in a proxy layer so that the services werent concerned about which persistence approach I took – if I wanted, I could revert to traditional RDBMS persistence if I wanted. So I could haveView code, controllers, domain objects (DAO), service classes, a persistence layer consisting of a proxy and an Object to RDF implemenation.

I built this, and soon saw that RDF repositories, in particular Jena SDB, when used with JenaBean are slow. This boils down to the fact that SPARQL ultimatey is translated to SQL, and some SPARQL operations have to be performed client side. When you do this in an Object to RDF fashion, where every RDF statement ends up as a SQL join or independent query, you get a very very chatty storage layer. This isn’t uncommon in ORM land and lazy loading is used so that for instance, a property isnt retrieved until its actually needed – eg if a UI action needs to show a particular object property in addition to showing that an object exists. In the SPARQL case, there are more things that need to be done client side, like filtering, and this means that a query may retrieve (lots) more information than its actually going to need to create a query solution, and the processing of the solution is going to take place in your application JVM and not in the repository.

I wanted then to see if the performance was significantly better with a local repository even if it couldnt be addressed from multiple application instances (TDB), and if Sesame was any better. TDB turned out to be lots faster, but of course you cant have multiple webapps talking to it unless you use address it as a Sparql endpoint- not an Object in Java code. For Sesame tho, I needed to ditch JenaBean, and luckily, in the time I had been building the application a new Java Object-RDF middleware came out, called Empire-JPA[72].

This allows you to program your application in much the same way as JeanBean – annotations and configuration – but uses the JPA api to persist objects to a variety of backends. So I could mark up my beans with Empire Annotations (leaving the JenaBean ones in place) and in theory persist the RDF to TDB, SDB, any of the Sesame backends, FourStore and so on.

The implementation was slowed down because the SDB support wasn’t there, and the TDB support needed some work, but it was easy to work Mike Grove at ClarkParsia on this, and it was a breath of fresh air to get some good helpful support, an open attitude, and timely responses.

I discovered along the way that I couldn’t start with a JenaBean setup, persist my objects to TDB say, and switch seamlessly to Empire-JPA (or vice versa). It seems that JenaBean persists some configuration statements and these interfere with Empire in some fashion – but this is an unlikely thing to do in production, so I havent followed it thru.

Empire is also somewhat slower than JenaBean when it comes to complex object hierarchies, but Mike is working on this, and v 0.7 includes the first tranche of improvements.

Doing things with JPA has the added benefit of giving you the opportunity to revert to RDBMS or to start with RDBMS and try out RDF in parts, or do both. It also means that you have lots of documentation and patterns to follow, and you can work with a J2EE standard which you are familiar with.

But, in the end Semantic Repositories aren’t as quick as SQL-RDBMS, but if you want RDF storage for some of your data or for a subset of your functionality, a graph based dataset, a common schema, vocabulary (or parts of) for you and other departments or companies in your business circle, and the distinct advantage of inference for data mining, relationship expressiveness (“similar” or other soft equivalences rather than just “same”) and discovery.

A note about authorization (ACL) and security: None of the repositories I’ve come across have access control capabilities along the lines of what you might see with an RDBMS – grant authorities and restrictions just aren’t there. (OpenVirtuoso may have something as it has a basis in RDBMS (?)).

You might be able to do some query restriction based on graphs by making use of a username, but if you want to say make sure that a field containing a social securrty number is only visible to the owner or application administrator (or some other Role) but not to other users, then you need to do that ACL at the application level. I did this in Spring with Spring Security (Acegi), at the object level. Annotations and AOP can be used to set this up for Roles, controllers, Spring beans (that is beans under control of a Spring context) or beans dynamically created (eg Domain objects created by controllers) . ACL and authentication in Spring depend on a User definition, so I also had to create an implementation that retrieved User objects from the semantic repository, but once that was done, it was an ACL manipulation problem rather than an RDF one.

The result was a success, if you can ignore the large dataset performance concerns. A semantic respository can easily and successfully be used for persistence storage in a Java J2EE application built around DAO, JPA and Service patterns, with enterprise security and access control, while also providing a semantic query capability for advanced and novel information mining, discovery and exploration.

Semantic repositories This application ultimately needs to be able to support lots of concurrent queries – eg +20 per sec, per instance. Jena uses Multiple Reader Single Writer approach for this, so should be fine. But with inference things slow down a lot, and memory needs to be available to service concurrent queries and datasets. The Amazon instance I have for now uses a modest 600mB for Heap, but with inference could use lots more, and a lot of CPU. Early on I used a 4 year old Dell desktop to run TDB and Joseki, and queries would get lost in it and never return – or so I thought. Moving to a Pentium Duo made things better, but its easy to write queries that tie up the whole dataset when youre not a sparql expert and can in some cases can cause the JVM to OoM and/or bomb. SDB suffers (as mentioned in the previous section) and any general purpose RDBMS hosted semantic repository that has to convert from SPARQL to SQL and back-and-forth will have performance problems. But for this application, TDB currently suffices – I dont have multiple instances of a Java application and if did host the html/js on another instance (a tomcat cluster say) then it would work perfectly well with Joseki in front of TDB or SDB. On the downside, an alternative to Jena is not a real possibility here as the Sparql in the pagecode makes heavy use of Jena ARQ extensions for counts and other aggregate functions. Sparql 1.1 specifies these things, so perhaps in future it will be a possibility. As a real java web application one of the primary requirements here is that the repository is addressable using java code from multiple instances1. TDB doesnt allow this because you define it per JVM. Concurrent access leads to unpredictable results, to put it politely. SDB would do it, as the database takes care of the ACIDity, but its slow.

I also wanted to be able to demonstrate the application and test performance with RDBMS technology or Semantic Repository, or indeed NoSQL technology. The class hierarchy and componentisation allows this, but at this stage I’ve not tried going back to RDBMS or the NoSQL route. Empire-JPA allows a variety of repositories to be used, and those based on Sesame include OWLIM and BigData which seem to offer large scale and clustered repository capability. To use AllegroGraph or Rdf2Go would require another implementation of my Persitence Layer, and may require more bean annotations.

So, nothing is perfect, everything is “slow”, but flexibility is available.

1. It might be possible to treat the repository as remote datasource and use SPARQL Select and Insert/Update queries (and this may be more performant it turns out), but for this excerise I wanted to stick with tradition and build a J2EE application that didnt have hard coded queries (or externalised and mapped ones a la iBatis) but that encapsulated the business logic and entity as bean and service object base.

  • querying sparql end points
  • inference
  • linking data
More important here than in the J2EE webapp, being able to host a dataset on the Linked Data Web with 303 Redirect, permanent urls, slash rather than hash URIs and content negotiation meant that I ended up with Joseki as the Sparql endpoint, and a servlet filter within a base webapp that did the URI rewriting, 303 redirect and content negotiation. Ontology and instance URIs can be serviced by loading the Ontology into the TDB repository. The application is read only, so theres no need for the Joseki insert/update servlet. I also host an ancillariy dataset for townlands so that I can keep it distinct for use with other applications, but federate in with an ARQ Service keyword. Making links between extracted entities and geoNames, dbPedia and any other dataset is done as a decorator object in the extraction pipeline. Jena’s SPARQL objects are used for this, but in the case of the Geonames webservice, their Java client library is used.

One of the issues here of course is cross-domain scripting. Making client side requests to code from another domain (or making Ajax calls to another domain) isnt allowed by modern UserAgents unless they support JSONP or CORS. Both require an extra effort on the part of the remote data provider and  could do with some seamless support (or acknowledgement at least) from the UI javascript libraries. It happens that Jetty7 has a CORS filter (which I retrofitted to Joseki 3.4.2 [112]). JSONP can be fudged with jQuery it turns out, if the remote dataset provides JSON output – some don’t. The alternative is that for anyone wishing to use your dataset on the Linked Open Data web, that they must implement a server side proxy of some kind and (usually) work with RSF/XML. A lot of web developers and mashup artists will baulk at this, but astonishngly, post Web2.,0, they still seem to be out of the reach of many dataset publishers. Jetty7 with its CORS fitler goes a long way to improving this situation, but it would be great to see it in Tomcat too, so that publishers don’t have to implement what is a non-trivial filter (this is a security issue after all), and clients dont have to revert (or find/hire/blackmail) to server side code and another network hop.

Vladimir Dzhuvinov has another CORS filter [111], that adds request-tagging and Access-Control-Expose-Headers in the response.

The only need of Sparql endpoint here is for debug purposes. You need to be able to see the triples as the repository sees them when you use an ORdfM layer so that you can understand the queries that are generated, why some of your properties arent showing up and so on.

For query handling I needed a full featured console that would allow me inference (performance permitting) and allow me to render results efficiently. I also needed to be able to federate queries across datasets or endpoints – especially to UMBEL so that I could offer end users the ability to locate data tagged with an UMBEL URI that were “similar” to one they were intersted in (eg sharing a skos:broader statement) . Jena provides the best support here in terms of SPARQL extensions, but inference was too slow for me, and I could mimic some of the basic inference with targetted query writing for Sesame. Sesame doesnt do well with aggregate functions, and inference is per repository and on-write, so you need to adjust how you view the repository compared to how Jena does it. Sesame is faster with an in-memory database.

UI and render This is an exercise in HTML and Ajax. It’s easy to issue Sparql queries that are generated in Javascript based on the what needs to be done, but theres one for every action on the website, and its embedded in the code. Thats not a huge problem given the open nature of the dataset and the limited functionality thats being offered (the extraction process only deals with a small subset of the available information in the text). jQuery works well with Joseki, local or not [112] so the JSON/JSONP issue didnt arise for me. Getting a UI based on the Ontology was possible using the jOWL javascript library, but its not the prettiest or most intuituve to use. A more sophisticated UI would need lots more work, and someone with an eye for web page design 🙂 Here, the UI is generated with JSP code with embedded JS/Ajax calls back to the API. URLs are mapped to JSP and Role based access control enforced. Most URLs have to be authenticated. Spring has a Jackson JSON view layer so that the UI could just work with Javascript arrays, but this requires more annotations on the beans for some properties that cause circular references. The UI code is fairly unsophisticated and for the sake of genericity, it mostly just spits out what is in the array, assuming that the annotations have taken care of most of the filtering, and that the authorization code has done its work and cloaked location, identity and datetime information. The latter works perfectly well, but some beans have propoerties that a real user wouldnt be interested in.

Velocity is used in some places when a user sends a message or invitation, but this is done at the object layer.

The UI doesnt talk Sparql to any endpoint. Sparql queries are generated based on end user actions (the query and reporting console), but this is done at the Java level.

[109] http://www.mindswap.org/~aditkal/SEKE04.pdf
[110] https://uoccou.wordpress.com/wp-admin/post.php?post=241&action=edit
[111] http://blog.dzhuvinov.com/?p=685
[112] https://uoccou.wordpress.com/2010/11/29/cors-servlet-filter/

Advertisements

Final section in Java Semantic Webapps Part 3.1 completed

December 9, 2010 Comments off

I’ve filled out the tools matrix with the 60 or so tools, libraries and frameworks I looked at for the two projects I created. Not all are used of course, and only a few are used in both. Includes comments and opinion, which I used and why, and all referenced. Phew.

Java Semantic Web & Linked Open Data webapps – Part 3.0

November 26, 2010 5 comments

Available tools and technologies

(this section is unfinished, but taking a while to put together, so – more to come)

When you first start trying to find out what the Semantic Web is in technical terms, and then what the Linked Open Data web is, you soon find that you have a lot of reading to do – because you have lots of questions. That is not surprising since this is a new field (even though Jena for instance has been going 10 years) for the average Java web developer who is used to RDBMS, SOA, MVC, HTML, XML and so on. On the face of it, RDF is just XML right ? A semantic repository is some kind of storage do-dah and there’s bound to be an API for it, right ? Should be an easy thing to pick up, right ? But you need answers to these kind of questions before you can start describing what you want to do as technical requirements, understanding what the various tools and technologies can do, which ones are suitable and appropriate, and then select some for your application.

One pathway is to dive in, get dirty and see what comes out the other side. But that to me is just a little unstructured and open-ended, so I wanted to tackle what seemed to be fairly real scenarios (see Part 2 of this series) – a 2-tier web app built around a SPARQL endpoint with links to other datasets and a more corporate style web application that used a semantic repository instead of an RDBMS, delivering a high level API and a semantic “console”.

In general then it seems you need to cover in your reading the following areas

  • Metadata – this is at the heart of the Semantic Web and Linked Open Data web. What is it !! Is it just stuff about Things ? Can I just have a table of metadata associated with my “subjects” ? Do I need a special kind of database ? Do I need structures of metadata – are there different types of things or “buckets” I need to describe things as ? How does this all relate to how I model my things in my application – is it different than Object Relational Modelling ? Is there a specific way that I should write my metadata ?
  • RDF, RDFS and OWL – what is it, why is it used,  how is it different than just XML or RSS;  what is a namespace, what can you model with it, what tools there are and so on
  • SPARQL – what is it, how to write it, what makes it different from SQL; what can it NOT do for you, are there different flavours, where does it fit in a web architecture compared to where a SQL engine might sit ?
  • Description Logic – you’ll come across this and wonder, or worse give up – it can seem very arcane very quickly – but do you need to know it all or any of it ? Whats a graph, a node, a blank node dammit, a triple, a statement ?
  • Ontologies – isn’t this just a taxonomy ? Or a thesaurus ? Why do I need one, how does metadata fit into it ? Should I use RDFS or OWL or something else ? Is it XML ?
  • Artificial Intelligence, Machine Learning, Linguistics – what !? you mean this is robotics and grammer ? where does it fit in – whats going on, do I need to have a degree in cybernetics to make use of the semantic web ? Am I really creating a knowledge base here and not a data repository ? Or is it an information repository ?
  • Linked Open Data – what does it actually mean – it seems simple enough ? Do I have to have a SPARQL endpoint or can I just embed some metadata in my documents and wait for them to be crawled. What do I want my application to be able to do in the context of Linked Open Data ? Do I need my own URIs ? How do I make or “coin” them ? How does that fit in with my ontology ? How do I host my data set so someone else can use it ? Surely there is best practice and examples for all this ?
  • Support and community – this seems very academic, and very big – does my problem fit into this ? Why can I not just use traditional technolgies that I know and love ?  Where are all the “users” and applications if this is so cool and useful and groundbreaking ? Who can help me get comfortable, is anyone doing any work in this field ? Am I doing the right thing ? Help !

I’m going to describe these things before listing the tools I came across and ended up selecting for my applications. So – this is going to be a long post, but you can scan and skip the things you know already. Hopefully, you can get started more quickly than I did.

First the End

So you read and your read and come across tools and libraries and academic reports and W3C documents and you see it has been going on some time, that some things are available and current, others are available and dormant. Most are OpenSource thankfully and you can get your hands on them easily, but where to start ? What to try first – what is the core issue or risk to take on first ? Is the enough to decide that you should continue ?

What is my manager going to say to me when I start yapping on about all these unfamiliar things –

  • why do I need it ?
  • what problem is it solving ?
  • how will it make or save us money  ?
  • our information is our information – why would I want to make it public ?

Those are tough questions when you’re starting from scratch, and no one else seems to be using the technologies you think are cool and useful – who is going to believe you if you talk about sea-change, or “Web3.0” or paradigm shift, or an internet for “machines”. I believe you need to demonstrate-by-doing, and to get to the bottom of these questions so you know the answers before someone asks them of you. And you better end up believing what your saying so you that you are convincing and confident. Its risky….*

So – off I go – here is what I found, in simple, probably technically incorrect terms – but you’ll get the idea and work out the details later (if you even need to)

*see my answers to these questions at the end of this section

Metadata, RDF/S, OWL, Ontologies

Coarsely, RDF allows you to write linked lists. URIs allow you to create unique identifiers for anything. If you use the same URI twice, your saying that exact Thing is the same in both places. You create the URIs yourself, or when you want to identify a thing (“john smith”) or a property of a thing (eg “loginId”) that already exists, you reuse the URI that you or someone else created. You may well have a URI for a concept or idea, and another for one of its physical form – eg a URI for a person in your organisation, and another for the webpage that shows his photo and telephone number, another for his HR system details.

Imagine 3 columns in a spreadsheet called Subject, Object and Predicate. Imagine a statement like “John Smith is a user with loginId ‘john123’ and he is in the sales Department“. This ends up like

Subject Predicate Object
S-ID1 type User
S-ID1 name “John”
S-ID1 familyName “Smith”
S-ID1 loginId “john123”
S-ID2 type department
S-ID2 name “sales”
S-ID2 member ID1

That is it, simply – RDF allows you to say that a Thing with an ID we call S-ID1 has properties, and that those properties are either other Things (S-ID2/member/ID1) or literal things like strings “john123”.

So you can build a “graph” or a connected list of Things (nodes) where each Thing can be connected to another Thing. And once you look at one of those Things, you might find that it has other properties that link to different Things that you don’t know about or that aren’t related to what you are looking at – S-ID2 may have another “triple” or “statement” that links it with ID-99 say (another user) or ID-10039 (a car lot space, say). So you can wire up these graphs to represent whatever you want in terms of properties and values (Objects). A Subject, Property or Object can be a reference to another Thing.

Metadata are those properties you use to describe Things. And in the case of RDF each metadatum can be a Thing with its own properties (follow the property to its own definition), or a concrete fact – eg a string, a number. Why is metadata important – because it helps you contextualise and find things and to differentiate one thing from another even if they are called the same name. Some say “Content is King” but I say Metadata is !.

RDF has some predefined properties like “property” and “type”. Its pretty simple and you’ll pick it up easily [1]. Now RDFS extends RDF to add some more predefined properties that allow you to create a “schema” that describes your data or information – “class”, “domain”, “range”, “label”, “comment”. So if you start to formalise the relationships described above – a user has a name, familyName, loginID and so on – before you know it, you’ve got an ontology on your hands. That was easy, right ? No cyborgs, logic bombs, T-Box or A-Box in sight.(see the next section) And you can see the difference between an ontology and a taxonomy – the latter is a way of classifying or categorising things, but an ontology does that and also describes and relates them. So keep going, this isn’t hard ! (Hindsight is great too)

Next you might look at OWL because you need more expressiveness and control in your information model and you find out that it has different flavours – DL, LITE, FULL[2] What do you do now ? Well, happily, you don’t have to think about it too much, because it turns out that you can mix and match things in your ontology – use RDFS and OWL, and you can even use things from other ontologies. Mash it up – you don’t have to define these properties from scratch yourself. So go ahead and do it, and if you find that you end up in OWL-FULL instead of DL then you can investigate and see why. The point is, start, dig in and do what you need to do. You can revise and evolve at this stage.

A metadata specification called “Dublin Core”[3] comes up a lot – this is a useful vocabulary for describing things like “title”, “creator”, “relation”, “publisher”. Another, the XSD schema is useful for defining things like number types -integer, long and float – and is used as part of SPARQL for describing literals. You’ll also find that there are properties of things that you thought are so common that someone would have an ontology or a property defined for them already. I had a time looking for a definition of old English miles, but it turns out luckily that there was one[4,5]. On the other hand, there wasn’t one for a compass bearing of  “North” – or at least one that I could find, so I invented one, because it seemed important to me. Not all things in your dataset will need metadata – and in fact you might find that you, and someone working on another project have completely different views on whats important in a dataset – you might be interested in describing financial matters, and someone else might be more interested in the location information. If you think about it long enought a question might come to mind – should we still maintain our data somewhere in canonical, raw or system-of-record form, and have multiple views of what it is stored elsewhere ? (I dont have an answer for that one yet).

Once your start you soon see that the point of reusing properties from other ontologies is that you are creating connections between datasets and information just by using them – you may have a finance department that uses “creator” that you can now link records in the HR system with the same person – and because the value used  for the “creator” is in fact a unique URI (simply, an ID that looks like an URL) eg http://myCompany.com/people/john123. If you have another John in the company, he’ll have a different ID eg http://myCompany.com/people/john911, so you can be sure that the link is correct and precise – no ambiguity –  John123 will not get the payslip meant for John911. There are also other ways of connecting information – you could use owl:sameAs for instance – this makes a connection between two Things when a common vocabulary or ID is not available, or when you want to make a connection where one didn’t exist before. But think about these connections before you commit them to statements – the correctness, provenance and trust around that new connection has to be justifiable – you want your information and assertions about it to have integrity, right ?

I needed RDF and RDFS at least – this would be the means that I would express the definition and parameters of my concepts, and then also the statements to represent actual embodiments of those concepts – instances. It started that way, but I knew I might need OWL if I wanted to more controlled over the structure and integrity of my information – eg to say that John123 could only be a member of one department and one department only, that he had role of “salesman” but couldn’t also have a role of “paymaster”. So, if you need this kind of thing, read more about it [6,7]. If you don’t yet, just keep going, and you can still come back to it later.(turns out I did in fact)

The table above now looks like this when you use URIs – its the same information, just written down in a way that ensures things are unique, and connectable.

Namespaces
myCo:http://myCompany.com/people/
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
foaf:http://xmlns.com/foaf/0.1/
Subject Predicate Object
myCo:S-ID1 rdf:type myCo:User
myCo:S-ID1 rdfs:label “John”
myCo:S-ID1 foaf:family_name “Smith”
myCo:S-ID1 myCo:loginId “john123”
myCo:S-ID2 rdf:type department
myCo:S-ID2 rdfs:label “sales”
myCo:S-ID2 myCo:member myCo:S-ID1

The Namespaces at the top of the table mean that you can use shorthand in the three columns and don’t have to repeat the longer part of the URI each time. Makes things easier to read and take in too, especially if you’re a simple human. For predicates, I’ve changed name to rdfs:label and familyName to foaf:family_name[8]. In the Object column only the myCo namespace is used – in the first case it points to a Subject with a type defined elsewhere (in the ontology in fact). I say the ontology is defined elsewhere, but that doesnt haev to be physically elsewhere, its not uncommon to have a file on disk that contains the RDF to define the ontology but also contains the instances that make up the vocabulary or the information base.

So – why is this better than a database schema ? The simple broad answers are the best ones I think :

  • You have only 3 columns*
  • You can put anything you like in each column (almost – literals cant be predicates (?) ), and its possible to describe a binary property User->name->John as well as n-ary relationships[9] User->hasVehicle->car->withTransmission->automatic
  • You can define what you know about the things in those columns and use it to create a world view of things (a set of “schema rules”, an ontology).
  • You can (and should) use common properties – define a property called “address1” and use it in Finance and HR so you know you’re talking about the same property. But if you don’t, you can fix it later with some form of equivalence statement..
  • If there are properties on instances that aren’t in your ontology, they don’t break anything, but they might give you a surprise – this is called an “open world assumption” – that is to say just because it is not defined does not mean it cannot exist – this is a key difference from database schema modelling.
  • You use the same language to define different ontologies, rather than say MySQL DDL for one dataset and Oracle DDL for another
  • There is one language spec for querying any repository – SPARQL **. You use the same for yours and any others you can find – and over Http – no firewall dodging, no operations team objections, predictable, quick and easy to access
  • You don not have to keep creating new table designs for new information types
  • You can easily add information types that were not there before while preserving older data or facts
  • You can augment existing data with new information that allows you to refine it or expand it – eg provide aliases that allow you to get around OCR errors in extracted text, alternative language expressions
  • Any others ?

*Implementations may add one or two more, or break things up into partitioned tables for contextual or performance reasons
**there are different extensions in different implementations

[1]http://rdfabout.net/
[2]http://www.w3.org/TR/owl-features/
[3]http://dublincore.org/
[4]http://forge.morfeo-project.org/wiki_en/index.php/Units_of_measurement_ontology
[5]http://purl.oclc.org/NET/muo/ucum-instances.owl
[6]http://www.cs.man.ac.uk/~horrocks/ISWC2003/Tutorial/
[7]http://www.w3.org/TR/owl-ref/#sameAs-def
[8]http://xmlns.com/foaf/spec/
[9] http://www.w3.org/TR/swbp-n-aryRelations/

SPARQL, Description Logic (DL), Ontologies

SPARQL [10] aims to allow those familiar with querying relational data to query graph data without too much introduction. Its not too distant but needs a little getting used to. “Select * from users” looks like “select * from {?s rdf:type myCo:User}”, and then you get back 2 types of information rather than every column from a table. Of course this is because you have effectively 3 “columns” in the graph data and theyre populated with a bunch of different things. So you need to dig deeper[11] into tutorials and what others have written.[12,13]

One of the key things about SPARQL is that you can use it to find out what is in the graph data without having any idea before hand.[14] You can ask to find the types of data available, then ask for the properties of the types, then DESCRIBE or select a range of types for identified subjects. So, its possible to discover whats available to suit your needs, or for anyone else to do the same with your data.

Another useful thing is the ability (for some SPARQL engines – Jena’s ARQ [15] comes to mind) to federate queries either by using a “graph” (effectively just a named set of triples)  that is an URI to a remote dataset, or by using (in Jena’s) case, the SERVICE keyword. So you can have separate and independent datasets and query across them easily. Sesame[16] allows a similar kind of thing with Federated Sail but you predefine the federation you want, rather than specify it in-situ. Beware of runtime network calls in the Jena case, and consider hosting your independent data in a single store but under different graphs to avoid them. You’ll need more memory in one instance, but you should get better performance. And watch out for JVM memory limits and type size increases if you (probably) move to a 64bit JVM.[17,18]

While learning the syntax of SPARQL isn’t a huge matter, understanding that youre dealing with a graph of data and having to navigate or understand that graph before hand can be a challenge, especially if its not your data you want to federate or link with. Having ontologies and sample data (from your initial SPARQL queries) helps a lot, but it can be like trying to understand several foreign database schemas at once, visualising a chain rather than a hierarchy, taking on multiple-inheritance and perhaps cardinality rules, domain and range restrictions and maybe other advanced ontology capabilities.

SPARQL engines or libraries used by SPARQL engines that allow inferencing provide a unique selling point for the Semantic and Linked web of data. Operations you cannot easily do in SQL are possible. Derived statements with information that is not actually “asserted” in the physical data you may have loaded into your repository start to appear. You might for instance ask for all Subjects or things of a certain type. If the ontology of the information set says that one type is a subclass of another – say you ask for “cars” – then you’ll get back statements that say your results are cars, but you’ll also get statements saying they are also “vehicles”. If you did this with an information set that you were not familiar with, say a natural history data set, then when you ask for “kangaroos” you are also told that its an animal, a kangaroo, and a marsupial. The animal statement might be easy to understand, but perhaps you expected that it was a mammal. And you might not have expressly said that a Kangaroo was one or the other.

Once you get back results from a SPARQL query you can start to explore – you start looking for kangaroos, then you follow the marsupial link, and you end up with Opossum, then you see its in the USA and not Australia, and you compare the climates of the two continents. Alternatively of course, you may have started at the top end – asked for marsupials, and you get back all the kangaroos and koalas etc, then you drill down into living environment and so on. Another scenario deals with disambiguation – you ask for statements about eagles and the system might return you things named eagles, but you’ll be able to see that one is a band, one is a US football team, and the other a bird of prey. Then you might follow links up or down the classifications in the ontology.

Some engines have features or utilities that allow you to “forward-chain”[19] statements before loading – this can mean that using an ontology or a reasoning engine based on a language specification that derived statements about things are asserted and materialised for you before you load them into your repository. This is not only things to do with class hierarchy but also where a hierarchy isnt explicit, inference might create a statement – “if a Thing has a title, pages, book, and has a hardback coverthen it is ….a book”.  This saves the effort at runtime and should mean that you get a faster response to your query. Forward chaining (and backward-chaining[20]) are common reasoning methods used with inferrence rules in Artificial Intelligence and Logic systems.

It turns out, Description Logic or “DL” [21] is what we are concerned with here – a formal way of expressing or representing knowledge – things have properties that are a certain value. OWL is a DL representation for instance. And like Object oriented prorgammic languages – Java say – there are classes (ontology, T-Box statements) and instances (A-Box, instances, vocabularies). There are also notable differences from Java (eg multiple inheritance or typing),  and a higher level of formalism, and these can make mapping between your programming language and your ontology or modelling difficult or problematic. For some languages, ProLog or Lisp this mapping may not be such a problem, and indeed you’ll fnd many semantic tools and technologies built using them.

Despite the fact that DL and AI  can get quite heady once you start delving into these things, it is easy to start with the understanding that they allow you to describe or model your information expressively and formally without being bound to an implementation detail like the programning language you’ll use, and that once you do implement and make use of your formal knowledge representation – your ontology – that hidden information and relationships may well become clear where they may not have been before. Doing this with a network of information sets means that the scope of discovery and fact is broadened – for your business, this may well be the difference between a sale or not, or provide a competitive edge in a crowded market.

[10]http://www.w3.org/TR/rdf-sparql-query/
[11]http://www.w3.org/2009/sparql/wiki/Main_Page
[12] http://www.ibm.com/developerworks/xml/library/j-sparql/
[13]http://en.wikibooks.org/wiki/XQuery/SPARQL_Tutorial
[14]http://dallemang.typepad.com/my_weblog/2008/08/rdf-as-self-describing-data.html
[15]http://openjena.org/ARQ/
[16]http://wiki.aduna-software.org/confluence/display/SESDOC/Federation
[17]http://jroller.com/mert/entry/java_heapconfig_32_bit_vs
[18]http://portal.acm.org/citation.cfm?id=1107407
[19]http://en.wikipedia.org/wiki/Forward_chaining
[20]http://en.wikipedia.org/wiki/Backward_chaining
[21]http://en.wikipedia.org/wiki/Description_logic

Artificial intelligence, machine learning, linguistics

When you come across Description Logic and the Semantic Web in the context of identifying “things” or entities in documents – for example the name of a company or person, a pronoun or a verb – you’ll soon be taken back to memories of school – grammer, clauses, definitive articles and so on. And you’ll grow to love it Im sure, just like you used to 🙂
It’s a necessary evil, and its at the heart of a one side of the semantic web – information extraction(“IE”) as a part of information retrieval (“IR”)[22,23]). Here, we’re interested in the content of documents, tables, databases, pages, excel spreadsheets, pdfs, audio and video files, maps, etc etc. And because these “documents” are written largely for human consumption, in order to get at the content using “a stupid machine”, we have to be able to tell the stupid machine what to do and what to look for – it does not “know” about language characteristices – what the difference is between a noun and a verb – let alone how to recognise one in a stream of characters, with variations in position, capitalisation, context and so on. And what if you then want to say that a particular noun, used a particular way is a word about “politics” or “sport”; that its Englih rather than German; that it refers to another word two words previous, and that its qualified by an adjective immediately after it ? This is where Natural Language Processing (NLP) comes in.

You may be familiar with tokenising a string in a high level programming language, then writing a loop to look at each word and then do something with it. NLP will do this kind of thing but apply more sophisticated abstractions, actions and processing to the tokens it finds, even having a rule base or dictionary of tokens to look for, or allowing a user to dynamically define what that dictionary or gazeteer is. Automating this is where Machine Learning (ML) comes in. Combined, and making use of mathematical modelling and statistical analysis they look at sequences of words and then make a “best guess” at what each word is, and tell you how good that guess is.

You may need (probably) to “train” the machine learning algorithm or system with sample documents – manually identify and position the tokens you are interested in, tag them with categories (perhaps these categories themselves are from a structured vocabulary you have created or found, or bought) and then run the “trained” extractor over your corpus of documents. With luck, or actually, with a lot of training (maybe 20%-30% of the corpus size), you’ll get some output that says “rugby” is a “sports” term and “All Blacks” is a “rugby team”. Now you have a your robot, your artificial intelligence.

But the game is not up yet – for the Semantic and Linked web, you now you have to do something with that output – organise and transform into RDF – a related set of extracted entities – relate one entity to another into a statement “all blacks”-“type”-“rugby team”, and then collect your statements into a set of facts that mean something to you, or the user for whom you are creating your application. This may be defined or contextualised by some structure in your source document, but it may not be – you may have to provide and organising structure. At some point you need to define a start – a Subject you are going to describe, and one of the Subjects you come up will be the very beginning or root Thing of your new information base. You may also consider using an online service like OpenCalais[24], but you’re then limited to the range of entities and concepts that those services know about – in OpenCalais’ case its largely business and news topics – wide ranging for sure, but if you want to extract information about rugby teams and matches it may not be too successful. (There are others available and more becoming available). In my experience, most often and for now, you’ll have to start from scratch, or as near as damn-it. If you’re lucky there may be a set or list of terms for the concept you are interested in, but its a bit like writing software applications for business – no two are the same, even if they have the same pattern. Unlike software applications though, this will change over time – assuming that people will publish their ontologies, taxonomies, term sets, gazeteers and thesauri. Lets hope they do, but get ready to pay for them as well – they’re valuable stuff.

So

  1. Design and Define your concepts
    1. Define what you are interested in
    2. Define what things represent what you are interested in
    3. Define how those things are expressed – the terms, relations, ranges and so on – you may need to build up a gazeteer or thesaurus
    4. Understand how and where those things are used – the context, frequency, position
  2. Extract the concepts and metadata
    1. Now tell the “machine” about it, in fact, teach it what you know and what you are interested in – show it by example, or create a set or rules and relations that it understands
    2. Teach it some more – the more you tell it, the more variety, the more examples and repitition you can throw it, the better the quality of results you’ll get
    3. Get your output – do you need to organise the output, do you have multiple files and locations where things are stored, do  you need to feed the results from the first pass into your next one ?
    4. Fashion some RDF
    5. Create URIs for your output – perhaps the entities extracted are tagged with categories (that you provided to the trained system) or with your vocabulary, or perhaps not – but now you need to get from this output to URIs, Subjects, Properties, Objects – to match your ontology or your concept domain. Relate and collect them into “graphs” of information, into RDF.
    6. Stage them somewhere – on a filesystem say (one file, thousands ? versions, dates ? tests and trials, final runs; spaces, capitalisation, reserved characters, encoding – its the web afterall)
  3. Make it accessible
    1. Find a repository technology you like – if you dont know, if its your first time, pick one – suck it and see – if you have RDF on disk you might be able touse that directly (maybe slower than an online optimised repository). Initialise it, get familiar with it, consider size and performance implications. Do you need backup ?
    2. Load your RDF into the repository. (Or perhaps you want to modify some existing html docs you have with the metadata you’ve extracted – RDFa probably)
    3. Test what you’ve loaded matches what you had on disk – you need to be able to query it – how do you do that ? Is there a commandline tool – does it do SPARQL ? What about it you want to use it on the web, this is whole point isnt it ?Is there a sparql endpoint  – do you need to set up Tomcat or a Jetty say to talk to your repository ?
  4. Link it
    1. And what about those URIs – you have URIs for your concept instances (“All Blacks”), and URIs for their properties (“rdf:type”), and URIs for the Object of those properties (“myOnt:Team”), What happens now – what do you do with them ? If there for the web, if theyre URIs shouldnt I be able to click on them ? (Now were talking Linked Data – see next section).
    2. Link your RDF with other datasets (See next section) if you want to be found, to participate, and to add value by association, affiliation,connection – the network effect – the knowledge and the value (make some money, save some money)
  5. Build your application
    1. Now create your application around your information set. You used to have data, now you have information – your application turns that into knowledge and intelligence, and perhaps profit.

There are a few tools to help you in all this (see below) but you’ll  find that they dont do everything you need, and they  wont generate RDF for you without some help – so roll your sleeves up. Or – don’t – I decided against it, having looked at the amount of work involved in learning all about NLP & ML, in the arcane science (its new to me), in the amount of time needed to set up training and the quality of the output. I decided on the KISS principle – “Keep It Simple, Stupid”, so instead I opted to write something myself, based on grep !

I still had to do 1-5 above, but now I had to write my own code to do the extraction and “RDFication”. It also meant I got my hands dirty and learned hard lessons by doing rather than reading or trusting someone else’s code that I didnt understand. And the quality of the output and the meaning of it was all in my control still.  It is not real Machine Learning, it’s still in the tokenisation world I suppose, but I got what I wanted and in the process made something I can use again. It also gave me practical and valuable experience so that I can revisit the experts tools with a better perspective – not so daunting, more  comfortable and confident, something to compare to, patterns to witness and create, less to learn and take on, and, importantly, a much better chance of actual, deliverable success.

It was quite a decision to take – it felt dirty somehow – all that knowledge and science bound up in those tools, it was a shame not to use it – but I wanted to learn and to fail in some ways, I didn’t want to spend weeks training a “machine”, and it seemed better to fail with something I understood (grep) rather than take on a body of science that was alien. In the end – I succeeded – I extracted my terms with my custom-automated-grep-based-extractor and I created RDF and loaded it into a repository. Its not pretty, but it worked – I have gained lots of experience, and I know where to go next. I recommend it.

Finally, it’s worth noting here the value-add components

  • ontologies – domain expertise written down
  • vocabularies – these embody statements of knowledge
  • knowledge gathering – collecting a disparate set of facts, or describing and assembling a novel perspective
  • assurance, provenance, trust – certifying and guaranteeing levels of correctness and origin
  • links – connections, relationships, ranges, boundaries, domains, associations – the scaffolding of the brains !
  • the application – a means to access, present and use that knowledge to make decisions and choices

How many business opportunities are there here ?

[22] http://en.wikipedia.org/wiki/Information_extraction

[23] http://en.wikipedia.org/wiki/Information_retrieval

[24] http://www.opencalais.com/

Linked Open Data

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply”* embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

* This blog doesn’t contain RDFa because its just too hard to do – wordpress.com doesn’t have available pluginst, and the wordpress.org plugins may be limited for what you want to do. Drupal7 [50] does a better job, and Joomla [51] may get there in the end.

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data Comment
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/
[50] http://semantic-drupal.com/
[51] http://semanticweb.com/drupal-may-be-the-first-mainstream-semantic-web-winner_b568

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply” embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/

Lewis Topographical Dictionary Ireland and SkyTwenty on EC2

November 17, 2010 Comments off

Both applications are now running on Amazon EC2 in a micro instance.

  • AMI 32bit Ubuntu 10.04 (ami-480df921)
  • OpenJDK6
  • Tomcat6
  • MySQL5
  • Joseki 3.4.2
  • Jena 3.6.2
  • Sesame2
  • Empire 0.7
  • DynDNS ddclient (see [1])

Dont try installing sun-java6-jdk, it wont work. You might get it installed if you try running instance as m1.small, and do it as the first task on the AMI instance. Didnt suit me, as I discovered too late, and my motivation to want to install it turned out to be no-propagation of JAVA_OPTS, not the jdk. See earlier post on setting up Ubuntu.

  • Lewist Topographical Dictionary of Ireland
    • Javascript/Ajax to sparql endpoint. Speedy.
    • Extraction and RDF generation from unstructured text with custom software.
    • Sparql endpoint on Joseki, with custom content negotiation
    • Ontology for location, roads, related locations, administrative description, natural resources, populations, peerage.
    • Onotology for Peerage – Nobility, Gentry, Commoner.
    • Find locations where peers have more than one seat
    • Did one peer know another, in what locations, degree of separation
    • Linked Open Data connections to dbPedia, GeoNames (uberblic and sindice to come) – find people in dbPedia born in 1842 for your selected location. Map on google maps with geoNames sourced wgs84 lat/long.
  • SkyTwenty
    • Location based service built JPA based Enterprise app on Semantic repo (sesame native).
    • Spring with SpringSec ACL, OpenID Authorisation.
    • Location and profile tagging with Umbel Subject Concepts.
    • FOAF and SIOC based ontology
    • Semantic query console – “find locations tagged like this”, “find locations posted by people like me”
    • Scheduled queries, with customisable action on success or failure
    • Location sharing and messaging with ACL groups – – identity hidden and location and date time cloaked to medium accuracy.
    • Commercial apps possible – identity hidden and location and date time cloaked to low accuracy
    • Data mining across all data for aggregate queries – very low accuracy, no app/group/person identifiable
    • To come
      • OpenAuth for application federation,
      • split/dual JPA – to rdbms for typical app behaviour, to semantic repo for query console
      • API documentation

A report on how these were developed and the things learned is planned, warts and all.

[1]http://blog.codesta.com/codesta_weblog/2008/02/amazon-ec2—wh.html – not everything needs to be done, but you’ll get the idea. Install ddclient and follow instructions.

Semantic Progress

December 4, 2009 Comments off

Update to LewisT. Added a new tokening concept to pull the sequence of MPs names and locations (Lewis refers to them as “Gentlemen’s seats“). Fixed the empty response and #URI handling in the redirector. Added lang attribute to untyped Literals.

However, in using a Bag to hold the members for the civil administration, SPARQL gets a little shaky. But using ARQ functions you can get better, if Tabulator still doesnt render it exactly as you expect, information :

Now just need to link these to Hansard 1803-2005 🙂

Semantic Progress

September 29, 2009 Comments off

Or not as the case may be. Back to the old question of NE extraction. Cant get Gate to work (NPE). ML doesnt seem to work. JAPE is a bit of a pain, and doesnt have much support if any for nested annotations. And Im not a cunning  linguist. Once Ive banged my head against a wall enough I’ll try asking for help, although support looks scant – must check again. Pity.

MinorThird half works but the analysis and validation always returns zero scores. Downloaded latest version out of svn tonight, will check tomorrow if it fixes things. Had the same trouble with KEA. If I can get over this hurdle, or find a lib thats up-to-date,current and openSource (rather than lingPipe which inevitably if I buy into it will mean license fees), then I can get on with the ontology-thesaurus mapping, the external linking and backchaining etc etc. Or I give up, on the NE, and work on the linking for now, and come back to it. Doing my head in.

And must get to grips with the Hidden Markov Model. Just like this damnded Dell Touchpad that keeps resetting itself….

Categories: Uncategorized Tags: , , , , ,

Semantic Progress

September 21, 2009 Comments off

Have to consider usage of SUMO now that Ive got location info out of Lewis. But before this want to see how long it takes to define entities for another concept, geoFeatures or Commerce. This should also reinforce the validity or not of what Ive done to date. This may take me into keyword extraction (eg KEA). Still wonder of ML and/or NLP would be any less effort. In any case, building up a range of gazeteers at least.

Categories: Work Tags: , , , ,