What needs writing ?
Now that we have an idea about what tools and technologies are available and the kind of application we want to build we need to start considering architecture and what code we will write around those tools and technologies. The architecture I planned was broadly formed – but not completely – as I went about creating these applications. I was also going to tackle the Linked Open Data webapp first and then do the Semantic Backed J2EE app. I thought MVC first for both, but went in the end with a 2 tier approach for the former, and an n-tier component based approach for the latter. (More about this in the next section). I’m used to the Spring framework, so I thought I’d go with it, and for UI I’d use jQuery and HTML and/or JSP, perhaps Velocity. But nothing was set in stone, and I was going to try and explore and be flexible.
The tools and technologies cover
- creating an ontology
- entity extraction
- RDF generation
- using RDF with Java
- Semantic repositories
- querying sparql end points
- linking data
- UI and render
|Category||Linked Open Data webapp||Semantic Backed J2EE webapp|
|creating an ontology||The ontology was going to be largely new as there is not much about to deal with historical content. Some bibliograpic ontologies are out there, but this isn’t about cataloguing books or chapters, but about the content within and across the sections in a single book. There are editions for Scotland, Wales and UK also, so I might get around to doing them at some stage. Some of the content is archaic – measurements are in Old English miles for instance. Geographic features needed to be described, along with population and natural resourcces. I wasn’t sure if I needed the expressiveness of OWL over RDFS, but thought that if I was going to start something fresh I might as well leave myself open to evolution and expansion – so OWL was the choice. Some editors dont to OWL, and in the end I settled for Protege.||Same thoughts here as for the Linked Data app – why limit myself to RDFS ? I can still do RDFS within an OWL ontology. Protege it is|
|entity extraction||Having played with GATE, OpenNLP, MinorThird and a foray into UIMA I settled on writing my own code. I needed close connections between my ontology, extracting the entities and generating RDF from those entities – most of these tools dont have this capability out of the box (perhaps they do now, 1 year on) – and I also wanted to minimise the number of independent steps at this point so that I could avoid writing conversion code, configure multiple parts in different ways and for different environments or OS. There is also a high barrier to entry and a long learing curve for some of these tools. I had read a lot, enough even, and wanted to get my hands dirty. I decided to build my own, based on grep – as most of these tools use regex at the bottom end and build upon it . It wasn’t going to be sophisticated, but it would be agile, best effort, experience based coding I’d be doing, and learning all the way – not a bad approach I think. I’d borrow techniques from the other tools around tokenisation and gazeteering, and if I was lucky, I might be able to use some of the ML libraries (I didnt in the end). So, with the help of Jena, I wrote components for
||This was a different kind of application – here no data exists at the start, and all is created and borne digital. No extraction needed.|
|RDF generation||I naively started the RDF generation code as a series of string manipulations and concatenations. I thought I could get away with it, and that it would be speedy ! The RDF generation code in Jena didnt seem particularly sophisticated – the parameters are string based in the end, and you have to declare namespaces as a string etc so what could possible go wrong ?? Well, things got unwieldy, and when I wanted to validate, integrate and reuse this string manipulation code it became tedious and fractious. Configuration was prone to error. Jena at higher stages of processing then needs proper URIs and other libraries operate on that basis. So, just in time, I switched – luckily I had built the code thinking that I might end up having to alter my URI definition and RDF generation strategy, so it ended up being a discrete replacement – a new interface implementation that I could plug in.
Tags can be
|Here, RDF generation isnt driven by extraction or preexisting entites, but by the Object model I used. See the next row for details.|
|Using RDF with Java||Fairly early on I settled with Jena as opposed to Sesame. There are some notes I found comparing Jena to Sesame1, but some of the arguments didnt mean anything to me at the early stages. There wasnt much between them I thought, but the Jena mailing list seemed a bit more active, and I noted Andy Seaborne’s name on the Sparql working group2. Both are fully featured with Sparql endpoints, repositories, text search and so on, but take different approaches3 . Since then I’ve learned a lot of course, and Ive compiled my own comparison matrix. . So – I went for Jena, and I probably will in other cases, but Sesame may suit things better in others.
While Jena is Object oriented, working with it is based on RDF rather than objects. So if you have a class with properties – a bean – you have to create a Model, the Subject and add the properties and their values, along with the URIs and namespaces that they should be serialised with. You cannot hand Jena a Bean and say “give me the RDF for that object”.
For this project that wasn’t an issue – I wasnt modelling a class hierarchy, I wanted RDF from text, and then to be able to query it, and perhaps use inference. Being able to talk to Sparql endpoints and manipulate RDF was more important than modelling an Object hierarchy.
|Here I needed to be able maintain parallel worlds – and Object base with a completely equivalent RDF representation. And I wanted to be able to program this from an enterprise Java developer’s perspective, rather than a logician or information analyst. How do I most easily get from Object to RDF without having to code for each triple combination ? Well it turns out there are 2 choices, and I ended up using one and then the other. It was also conceivable that I might not be able to do what I wanted, or that it wouldnt perform – I saw the impact of inference on query performance in the Linked Data application – so I wanted to code the app so that it would be decoupled from the persistence mechanism. I also needed to exert authorization control – could I do this with RDF ?
So, I could build my app in an MVC style, and treat the domain objects much like I would if I used Hibernate or JDO say. In addition, I could put in a proxy layer so that the services werent concerned about which persistence approach I took – if I wanted, I could revert to traditional RDBMS persistence if I wanted. So I could haveView code, controllers, domain objects (DAO), service classes, a persistence layer consisting of a proxy and an Object to RDF implemenation.
I built this, and soon saw that RDF repositories, in particular Jena SDB, when used with JenaBean are slow. This boils down to the fact that SPARQL ultimatey is translated to SQL, and some SPARQL operations have to be performed client side. When you do this in an Object to RDF fashion, where every RDF statement ends up as a SQL join or independent query, you get a very very chatty storage layer. This isn’t uncommon in ORM land and lazy loading is used so that for instance, a property isnt retrieved until its actually needed – eg if a UI action needs to show a particular object property in addition to showing that an object exists. In the SPARQL case, there are more things that need to be done client side, like filtering, and this means that a query may retrieve (lots) more information than its actually going to need to create a query solution, and the processing of the solution is going to take place in your application JVM and not in the repository.
I wanted then to see if the performance was significantly better with a local repository even if it couldnt be addressed from multiple application instances (TDB), and if Sesame was any better. TDB turned out to be lots faster, but of course you cant have multiple webapps talking to it unless you use address it as a Sparql endpoint- not an Object in Java code. For Sesame tho, I needed to ditch JenaBean, and luckily, in the time I had been building the application a new Java Object-RDF middleware came out, called Empire-JPA.
This allows you to program your application in much the same way as JeanBean – annotations and configuration – but uses the JPA api to persist objects to a variety of backends. So I could mark up my beans with Empire Annotations (leaving the JenaBean ones in place) and in theory persist the RDF to TDB, SDB, any of the Sesame backends, FourStore and so on.
The implementation was slowed down because the SDB support wasn’t there, and the TDB support needed some work, but it was easy to work Mike Grove at ClarkParsia on this, and it was a breath of fresh air to get some good helpful support, an open attitude, and timely responses.
I discovered along the way that I couldn’t start with a JenaBean setup, persist my objects to TDB say, and switch seamlessly to Empire-JPA (or vice versa). It seems that JenaBean persists some configuration statements and these interfere with Empire in some fashion – but this is an unlikely thing to do in production, so I havent followed it thru.
Empire is also somewhat slower than JenaBean when it comes to complex object hierarchies, but Mike is working on this, and v 0.7 includes the first tranche of improvements.
Doing things with JPA has the added benefit of giving you the opportunity to revert to RDBMS or to start with RDBMS and try out RDF in parts, or do both. It also means that you have lots of documentation and patterns to follow, and you can work with a J2EE standard which you are familiar with.
But, in the end Semantic Repositories aren’t as quick as SQL-RDBMS, but if you want RDF storage for some of your data or for a subset of your functionality, a graph based dataset, a common schema, vocabulary (or parts of) for you and other departments or companies in your business circle, and the distinct advantage of inference for data mining, relationship expressiveness (“similar” or other soft equivalences rather than just “same”) and discovery.
A note about authorization (ACL) and security: None of the repositories I’ve come across have access control capabilities along the lines of what you might see with an RDBMS – grant authorities and restrictions just aren’t there. (OpenVirtuoso may have something as it has a basis in RDBMS (?)).
You might be able to do some query restriction based on graphs by making use of a username, but if you want to say make sure that a field containing a social securrty number is only visible to the owner or application administrator (or some other Role) but not to other users, then you need to do that ACL at the application level. I did this in Spring with Spring Security (Acegi), at the object level. Annotations and AOP can be used to set this up for Roles, controllers, Spring beans (that is beans under control of a Spring context) or beans dynamically created (eg Domain objects created by controllers) . ACL and authentication in Spring depend on a User definition, so I also had to create an implementation that retrieved User objects from the semantic repository, but once that was done, it was an ACL manipulation problem rather than an RDF one.
The result was a success, if you can ignore the large dataset performance concerns. A semantic respository can easily and successfully be used for persistence storage in a Java J2EE application built around DAO, JPA and Service patterns, with enterprise security and access control, while also providing a semantic query capability for advanced and novel information mining, discovery and exploration.
|Semantic repositories||This application ultimately needs to be able to support lots of concurrent queries – eg +20 per sec, per instance. Jena uses Multiple Reader Single Writer approach for this, so should be fine. But with inference things slow down a lot, and memory needs to be available to service concurrent queries and datasets. The Amazon instance I have for now uses a modest 600mB for Heap, but with inference could use lots more, and a lot of CPU. Early on I used a 4 year old Dell desktop to run TDB and Joseki, and queries would get lost in it and never return – or so I thought. Moving to a Pentium Duo made things better, but its easy to write queries that tie up the whole dataset when youre not a sparql expert and can in some cases can cause the JVM to OoM and/or bomb. SDB suffers (as mentioned in the previous section) and any general purpose RDBMS hosted semantic repository that has to convert from SPARQL to SQL and back-and-forth will have performance problems. But for this application, TDB currently suffices – I dont have multiple instances of a Java application and if did host the html/js on another instance (a tomcat cluster say) then it would work perfectly well with Joseki in front of TDB or SDB. On the downside, an alternative to Jena is not a real possibility here as the Sparql in the pagecode makes heavy use of Jena ARQ extensions for counts and other aggregate functions. Sparql 1.1 specifies these things, so perhaps in future it will be a possibility.||As a real java web application one of the primary requirements here is that the repository is addressable using java code from multiple instances1. TDB doesnt allow this because you define it per JVM. Concurrent access leads to unpredictable results, to put it politely. SDB would do it, as the database takes care of the ACIDity, but its slow.
I also wanted to be able to demonstrate the application and test performance with RDBMS technology or Semantic Repository, or indeed NoSQL technology. The class hierarchy and componentisation allows this, but at this stage I’ve not tried going back to RDBMS or the NoSQL route. Empire-JPA allows a variety of repositories to be used, and those based on Sesame include OWLIM and BigData which seem to offer large scale and clustered repository capability. To use AllegroGraph or Rdf2Go would require another implementation of my Persitence Layer, and may require more bean annotations.
So, nothing is perfect, everything is “slow”, but flexibility is available.
1. It might be possible to treat the repository as remote datasource and use SPARQL Select and Insert/Update queries (and this may be more performant it turns out), but for this excerise I wanted to stick with tradition and build a J2EE application that didnt have hard coded queries (or externalised and mapped ones a la iBatis) but that encapsulated the business logic and entity as bean and service object base.
||More important here than in the J2EE webapp, being able to host a dataset on the Linked Data Web with 303 Redirect, permanent urls, slash rather than hash URIs and content negotiation meant that I ended up with Joseki as the Sparql endpoint, and a servlet filter within a base webapp that did the URI rewriting, 303 redirect and content negotiation. Ontology and instance URIs can be serviced by loading the Ontology into the TDB repository. The application is read only, so theres no need for the Joseki insert/update servlet. I also host an ancillariy dataset for townlands so that I can keep it distinct for use with other applications, but federate in with an ARQ Service keyword. Making links between extracted entities and geoNames, dbPedia and any other dataset is done as a decorator object in the extraction pipeline. Jena’s SPARQL objects are used for this, but in the case of the Geonames webservice, their Java client library is used.
Vladimir Dzhuvinov has another CORS filter , that adds request-tagging and Access-Control-Expose-Headers in the response.
|The only need of Sparql endpoint here is for debug purposes. You need to be able to see the triples as the repository sees them when you use an ORdfM layer so that you can understand the queries that are generated, why some of your properties arent showing up and so on.
For query handling I needed a full featured console that would allow me inference (performance permitting) and allow me to render results efficiently. I also needed to be able to federate queries across datasets or endpoints – especially to UMBEL so that I could offer end users the ability to locate data tagged with an UMBEL URI that were “similar” to one they were intersted in (eg sharing a skos:broader statement) . Jena provides the best support here in terms of SPARQL extensions, but inference was too slow for me, and I could mimic some of the basic inference with targetted query writing for Sesame. Sesame doesnt do well with aggregate functions, and inference is per repository and on-write, so you need to adjust how you view the repository compared to how Jena does it. Sesame is faster with an in-memory database.
Velocity is used in some places when a user sends a message or invitation, but this is done at the object layer.
The UI doesnt talk Sparql to any endpoint. Sparql queries are generated based on end user actions (the query and reporting console), but this is done at the Java level.
I’ve filled out the tools matrix with the 60 or so tools, libraries and frameworks I looked at for the two projects I created. Not all are used of course, and only a few are used in both. Includes comments and opinion, which I used and why, and all referenced. Phew.
Or not as the case may be. Back to the old question of NE extraction. Cant get Gate to work (NPE). ML doesnt seem to work. JAPE is a bit of a pain, and doesnt have much support if any for nested annotations. And Im not a cunning linguist. Once Ive banged my head against a wall enough I’ll try asking for help, although support looks scant – must check again. Pity.
MinorThird half works but the analysis and validation always returns zero scores. Downloaded latest version out of svn tonight, will check tomorrow if it fixes things. Had the same trouble with KEA. If I can get over this hurdle, or find a lib thats up-to-date,current and openSource (rather than lingPipe which inevitably if I buy into it will mean license fees), then I can get on with the ontology-thesaurus mapping, the external linking and backchaining etc etc. Or I give up, on the NE, and work on the linking for now, and come back to it. Doing my head in.
And must get to grips with the Hidden Markov Model. Just like this damnded Dell Touchpad that keeps resetting itself….
Making some progress with lo-fi rdf-ication job Im working on.
Need to test more, then work on ontology, OpenVirtuoso investigation, linkages, reference rdf for links, link algorithms, etc etc etc.
And then I’ll give ML and NLP another go…