Archive

Posts Tagged ‘architecture’

DERI (LATC) launch schema.rdfs.org

June 18, 2011 Comments off

Some of the DERI people (and others) involved in LATC have launched schema.rdfs.org to counter the lack of rdfs in schema.org – the Microsoft/Google/Yahoo attempt to kickstart some RDFa publishing so their search engines can try and improve result relevancy. Some of the items in schema.org are quite simple, but thats probably a good thing : a large term set or number of properties is going to look daunting to anyone interested or someone starting out for the first time – and indeed this is the reason cited that it is not RDF (its microdata). And while I agree with Michael Bergman that it is more than likely another step towards structured/linked/common/open data, adopters urgently need a combination of

  1. Tools (or better still no tools, just an unobtrusive natural way to author microdata or rdfa) and
  2. a Reason to do it – payback
  3. Support in search UIs to specify vocabulary items

I’d like a wordpress plugin for instance, but then I’d need to host an instance myself or find a hoster that allows plugins because wordpress.com doesnt allow it. I’d also like to think that if I placed some RDFa in my blog that it would get higher a ranking in Search results (it should) but this blog is pretty specialised anyway and its not commercially oriented so Im happy enough with keyword based results anyway.

So, I’m not going to be doing it too soon, and thats the problem really. Or is it ? This post isn’t data really, but it does have links and it does talk about concepts, people, technology problems. If I could mark them up with tags and attributes that define what I am talking about then it would mean that I could tell those search engines and crawlers what I am talking about rather than hoping they can work it out from the title, the links I have chosen, then feedback comments and so on. Then people looking for these particular topics could find or stumble upon this post more easily. So, while there is some data here, arguably I don’t see it that way, and even if I think there might be a good Reason to do it, it’s too hard without the Tools

So I wonder finally, if I was to mark up one of these people mentioned in this post with name,address, affiliation,organisation and so on, would the search engine UIs allow me to use this vocabulary directly – I want to find articles about DERI say, would the search drop down prompt me with itemprop="EducationalOrganization" – so that I’d then only get results that have been marked up with this microdata property and not with things that are about the Deri vineyard in wales, punto deri, courtney deri and so on ?

Sindice kinda does this, couldn’t the Goog do it too ??? Or indicate which results are microdata’d, or allow a keyword predicate (like site: say), or allow the results to be filtered (like Search Tools in the left column). The point for me is that schema.org is only half or less than half the story – the search engines need to Support the initiative by making it available at query time, and to allow their results to manipulated in terms of microdata/rdfs too. Then I might be more tempted to markup my posts in microdata,rdfs,microformat or whatever, and I might create some extensions to the schemas and contribute a bit more, and my post might get more traffic in the long tail, that traffic would be more valuable, my ad revenue might go up (if I had ads for myself !), and the ECB might drop their interest rates. Well, maybe not, but they’re not listening to anythine else, perhaps some structured data might persuade them. It is the future after all.

Whats the point : Semantic RESTful Web Services ?

March 9, 2011 4 comments

Well, I think its dawning on me, that what Roy Fielding talks about (rather abstractly) [1] is what Henry Story neatly summarises and provides examples of [2] – REST SOA, with connected semantics. I’m one of those who can be accused of implementing REST not in the Roy Fielding manner of the word, but in the anyting thats not WS* “meaning”. I’ve done request mapping, content negotiating, resource rendering in XML, Json (and a bunch of others), GET,PUT,POST,HEAD etc etc etc, but never all together, and never in the true Spirit of Roy. But when you add the semantic web, you can really see that theres something good going on here – “easy” and ubiquitous webservices.

Roy talks about representations, resources and connectedness, about agents or service consumers that deal with well-known media types and links, and nothing else – REST implies that a user agent is “thin”, understands basic-and-well-known types and protocols, and renders a look and feel and a behaviour that reacts to what it is fed. As he says it should work with the “follow your nose” principal (no need for WADL[3,4]).

For a browser this would mean that you point it an URL, it displays content suitably, that it receives and displays links with appropriate CRUD capabilities for it and and relations it is given.  For example, given a book resource, render it using the .book CSS class, and create links to add to shopping cart, get a contents list, add to a favourites list. For a chapter in the book, there may be link to print it, to relate it to a chapter in another similar book, to annotate it and send to a colleague. For a daemon or agent it might mean that it alters the time at which it performs an action against a resource, or what action it takes. The navigation and action controls aren’t determined by business or display logic, but by the resource and its relations – the agent consuming the resource knows it has to display or follow a link, the CSS may have display capabilities based on the resource type or context, the workflow steps will appear at the right time for the right user, under the right circumstances. Client logic is solely to deal with converting representations to appropriate media-types, and driving application state – using relations and verbs to make transitions with links.

But the thing that got me spinning, as I tried to understand the abstractedness, and as I looked into JAX-RS [5], and its various implementations (well, Spring* in particular TBH,which doesnt do JAX-RS in fact [6]) was that the connectedness and follow-your-nose principal seemed absent. Its all very well and cosy (and arguably easy) to create some platform code that maps URIs to classes and methods and HTTP verbs, and then to output XML or JSON or not (think JSP), or perhaps even Atom, OData, RDF, N3 or TTL but wheres the linked connectedness – the things we talk about and take for granted in Linked Open Semantic Data world ? And how does it know what links to create, how to generate them, and how they should be presented (if there’s a human involved) ?

Well, Henry blows that lightbulb for me when he illustrates from his foaf profile all the foaf:knows relations [2]. In a RESTful world where a service returns a foaf file and reads the foaf:knows elements it can decide what to do based on that predicate – it can deduce that the resource represented is a Person and can create the links it chooses using what it knows about foaf:knows and REST verbs – create/read/update/delete. It might allow addition of another foaf:knows with a PUT to the URI identifying the owner, an update to a mailing list so that all those foaf:knows objects are added, or automatically update a trust counter against a system resource because if Henry foaf:knows TBL in this context, then TBL must be “good for it” :-). In addition, it only knows that a URI represents that Person, and the URI could be a hypermedia link in the form of an URL, a ftp or webDav link, or some other protocol. Finally, this “knows” concept is really an upfront agreement about what representations are being used for the state of the application (it knows and XML schema, or an Ontology, or perhaps even looks them up on the fly), but navigating thru state is controlled by the interactions with the service (Http verbs) and the responses (status and agreed represenation in the body content) received.

At first sight those RESTful libraries don’t really need to know that much about the connectedness – they only need to map verbs and serve resources with those links embedded (RDF anyone? ) and using those well-known vocabularies, classes,relations and constraints – ie ontologies. But what about workflow : I post an object or resource, I get a response with the ID of that resource, and I need a link that tells me where to go for the next state transition ?

So, lo-and-behold, we have semantic linked data and REST superadditively combined, in a loosely coupled web (or “cloud”, if you like that keyword) of semantic links, intelligent user agents that understand those links or their context, web resolvable URIs, and value-added interlinked services – in effect a “Web Service Bus”. [7] !!

Now

  1. Point your People tool at the RESTful people+location web service and it “just works” to give you a social-network-mashup of connected people and interests (provenance, trust), and then
  2. switch over to your Energy consumption application and it also just works (based on what it has chosen to do and the well-known ontologies and resources it understands) – see how big your carbon footprint is when you meet TBL next week at Geneva if you fly,drive or take the train – and maybe you’ll be able to see who you can meet on the way and who else will be sitting beside you.

But your not out of the woods yet, doing semantic RESTfu web apps isnt a clear open space : your application still has to deal with authentication, input validation, long lived database transaction control, multithreading, performance, perhaps object relational mapping, but jax-rs/REST takes care of the object-message-mapping (the interface-to-implementation layer), your client or agent is thin but intelligent, and your middle tier contains your business logic.

Your application will need to honour the request-response state machine, perhaps checking availability using OPTIONS, or Etags.

You’ll need to decide how to transform from your programming model of choice – OO perhaps – to Resource. Some of the object to RDF mapping within libraries like Empire[15], JenaBean[16], Sommer [22]{defunct?), object-triple [17] may help. Perhaps this wont be an issue for you if you can foist the RESTful resource and linkage proposition onto an object model and remain in the object world – why waste processor and resource when you store data in RDF, convert to an Object on retrieval, process, convert to Xml-RDF or JSON on the way out, then parse and walk in a JSP before rendering as HTML ? As an OO programmer on the web you’re familiar with marshalling objects in and out of different serialization –  RDF/XML/JSON/HTML, but you do want and need to minimise those transitions. Perhapsfor “Big Data” we should stay in the Resource world : persist to a fast native RDF triplestore or HPC based system on a cluster of MapReduce or somesuch (CouchDB[20], Heart/HBase [21] ? perhaps BigData[18] or SHARD[19], AllegroGraph[23] ?), and talk to it with ProLog or some such – forget the Object paradigm and embrace the Linked, Open World Resources, and also do it with REST.

You also have to be clear that REST suits what you want to do (other architectures haven’t just been demoted to history) what your services are  -what you are interfacing with, what are your domain objects, what service operations are exposed when, what workflow do you need to encompass[13], and how granular you need to be – a shopping cart application will need to save items to a shopping list, rather than save the items themselves (or the cart resource probably), but it will also, behind the service, need to update a stock control or inventory – which isnt exposed to your end user.  So be clear about which service level CRUD operations you need to expose to your user or “agent”, and which if any domain objects you need to directly manipulate.

But in the end, hopefully, you’ve still followed your enterprise principles and patterns, but you’ve adopted a long lasting web-scale architecture, and if youve added the semantic vocabulary, you’ve got the basis for successful evolution, a network effect, adaptable clients and agents and a successful resolution to an important business case – thats why your doing this, isn’t it, not because its cool ?

Update : April 24 – read Otavios paper on RESTfulGrounding [25] but also read Alowisheq, Millard and Tiropanis EXPRESS RESTful services paper[26]. RESTfulGrounding does for REST and WADL what OWL-S does for WSDL – it gets Semantic descriptions into the syntactic descriptions that automated services might use to interact with a web service, and facilitates discovery, composition, monitoring and execution. EXPRESS takes a different approach and based on an existing RESTful web service allows you to create an OWL description that can also be RESTfully accessed to describe the services resources, relations and “parameters” (OWL DataTypeProprty and ObjectProperty). They describe an adaptation of Amazon S3 buckets and docs with EXPRESS and compare with SA-REST and OWL-S approaches.

I like EXPRESS more than RESTfulGrounding as the simplicity appeals : the way it in turn relies on REST to underpin the service description access and interaction, adheres to RESTful principles for message exchange – using TTL rather than XML – , follow-you-nose, and the fact that this in turn means I don’t have to learn much if I want to make use of it. It does need the use of a code generator for stubs and URIs and a manual step to define which methods apply to which URIs, and doesn’t do much for discovery and composition – but they acknowledge this and intend to work on it – and a real implementation with these tools needs to be made available so that people like me can try it out. Is there one ?

I need to understand more about WADL[27,28] (why is it needed in the first place ?) and how I might go about actually building a set of services that need to be described and then discovered and composed to provide some useful value, but EXPRESS fits nicely into web scale, lo-fi approaches that quickly gain traction and that might make use of a CPoA kind of approach for discovery and composition.

* You’ve got other choices :

  • Apache CXF – perhaps best if you come from the WS* camp or have a mixture [8]
  • GlassFish Jersey – seems to have good traction, with hooks into Spring et al [9]
  • RESTeasy – JBoss jax-rs implementation [10]
  • RESTlet – not sure about this, seems to have good support, taking a different approach apparently – eg RESTlet vs SERVlet, but I need more info to do it justice [11]
  • PLAY Framework – has good REST support I understand from others. [12]
  • Clerezza – Apache incubator project with RDF, jax-rs, scala and “renderlet” support. Looks interesting from a RDF PoV, but maybe not so interesting from an OOD PoV [14]

[1] http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven
[2] http://blogs.sun.com/bblfish/entry/rest_apis_must_be_hypertext
[3] http://wadl.java.net/
[4] http://bitworking.org/news/193/Do-we-need-WADL
[5] http://jcp.org/en/jsr/detail?id=311
[6] http://grzegorzborkowski.blogspot.com/2009/03/test-drive-of-spring-30-m2-rest-support.html
[7] http://wisdomofganesh.blogspot.com/2010/06/wanted-esc-not-esb.html

[8] http://cxf.apache.org/

[9] http://jersey.java.net/

[10] http://www.jboss.org/resteasy

[11] http://www.restlet.org/

[12] http://www.playframework.org/documentation/1.1/routes

[13] http://www.infoq.com/articles/webber-rest-workflow

[14] http://incubator.apache.org/clerezza/

[15] https://github.com/clarkparsia/Empire

[16] http://code.google.com/p/jenabean/

[17] http://code.google.com/p/object-triple/

[18] http://www.systap.com/bigdata.htm

[19] http://www.dist-systems.bbn.com/people/krohloff/shard_overview.shtml

[20] http://couchdb.apache.org/

[21] http://wiki.apache.org/incubator/HeartProposal

[22] http://java.net/projects/sommer/

[23] http://www.franz.com/agraph/allegrograph/

[24] http://blog.cubrid.org/web-2-0/database-technology-for-large-scale-data/
[25] http://www.fullsemanticweb.com/blog/ontologies/restfulgrounding/
[26] http://ebookbrowse.com/express-expressing-restful-semantic-services-using-domain-ontologies-pdf-d12806537
[27] http://java.net/projects/wadl/
[28] http://bitworking.org/news/193/Do-we-need-WADL

Java Semantic & Linked Open Data webapps – Part 5.2

February 8, 2011 Comments off

Continuation from previous article in series

The overall architecture for the Semantic backed J2EE app is different from the Linked data app already discussed because we need a business logic layer and a decoupling from the persistence layer. We also want to create a Java app rather than a semantic application so that the programming paradigms and patterns are familiar to the Enterprise java developer.

 

Semantically backed J2EE webApp - System diagram

Here we see a fairly standard 3 tier MVC application. Browser requests URIs from the appserver, or makes an Ajax call and gets html from server side JSPs or JSON formatted data in response, respectively. The application server contains java code that maps URIs and API calls to controllers, which make calls to service classes and DAO code. The DAO code makes call via a persistence proxy to get data from the server that is unmarshalled from RDF to java objects (or makes writes in the other direction). The persistence layer is configured to use an implementation that takes care of the Object to RDF mapping – two implementations are available (JenaBean and EmpireJPA). These in turn use their own protocols to talk to native or location repositories, or typically JDBC talk with standard DBMS. Spring and Spring security provide infrastructure level services for dependency injection, component wiring, MVC abstractions, and role, method and data level security for beans and dynamically created object instances. These technologies are shown below in the AppServer layer cake.

 

Technology libraries and tools used in Semantically backed J2EE WebApp

Technology libraries and tools used in Semantically backed J2EE WebApp

Obviously, there are many things going on here, and they’ll need some discussion

  • Basic building blocks, tool selection
  • Security considerations and restrictions
    • authentication – OpenID, admin login, facebook connect
    • authorisation role,uri,method,data levels
    • registration process
    • ownership, group (friend) and application membership, resolution (date & location cloaking)
    • ACL – data and dynamic object level authorisation
    • Syndication –
      • json,jsonp (get/post),window.name,ajax,
      • cors
      • oauth
      • API and result formats
  • Scale, concurrency, transactions,  failures, and performance
  • URIs, ontology, linkage
  • input, output interfaces
  • ontology to object/interface mapping

Java Semantic & Linked Open Data webapps – Part 5.1

January 18, 2011 1 comment

How to Architect ?

Well – what before how  – this is firstly about requirements, and then about treatment

Linked Open Data app

Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.

So – a fairly tall order : in summary

  • define ontology
  • extract entites from digital text and transform to rdf defined by ontology
  • create an RDF dataset and host in a repository.
  • provide a sparql endpoint
  • create a URI namespace and resolution capability. ensure persistence and decoupling of possible
  • provide content negotiation for human and machine addressing
  • create a UI with client side code only
  • create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces
  • link to other datasets – geonames, dbpedia, any others meaningful – demonstrate promise and capability of linkage
  • build an ontology driven UI so that a human can navigate data, with appropriate display based on type, and appropriate form to drive exploration

Here’s what we end up

Lewis Topographical Dictionary linked data app - system diagram

  1. UserAgent – a browser navigates to Lewis TDI homepage – http://uoccou.endofinternet.net:8080/resources/sparql – and
  2. the webserver (tomcat in fact) returns html and javascript. This is the “application”.
  3. interactions on the webpage invoke javascript that either makes direct calls to Joseki (6) or makes use or permanent URIs (at purl.org) for subject instances from the ontology
  4. purl.org redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.
  5. dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.
  6. Joseki as a sparql endpoint provides RDF query processing with extensions for freetext search, aggregates, federation, inferencing
  7. TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).

Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.

Java Semantic & Linked Open Data webapps – Part 4

December 17, 2010 Comments off

What needs writing ?

Now that we have an idea about what tools and technologies are available and the kind of application we want to build we need to start considering architecture and what code we will write around those tools and technologies. The architecture I planned was broadly formed – but not completely – as I went about creating these applications. I was also going to tackle the Linked Open Data webapp first and then do the Semantic Backed J2EE app. I thought MVC first for both, but went in the end with a 2 tier approach for the former, and an n-tier component based approach for the latter. (More about this in the next section). I’m used to the Spring framework, so I thought I’d go with it, and for UI I’d use jQuery and HTML and/or JSP, perhaps Velocity. But nothing was set in stone, and I was going to try and explore and be flexible.

The tools and technologies cover

  • creating an ontology
  • entity extraction
  • RDF generation
  • using RDF with Java
  • Semantic repositories
  • querying sparql end points
    • inference
    • linking data
  • UI and render
Category Linked Open Data webapp Semantic Backed J2EE webapp
creating an ontology The ontology was going to be largely new as there is not much about to deal with historical content. Some bibliograpic ontologies are out there, but this isn’t about cataloguing books or chapters, but about the content within and across the sections in a single book. There are editions for Scotland, Wales and UK also, so I might get around to doing them at some stage. Some of the content is archaic – measurements are in Old English miles for instance. Geographic features needed to be described, along with population and natural resourcces. I wasn’t sure if I needed the expressiveness of OWL over RDFS, but thought that if I was going to start something fresh I might as well leave myself open to evolution and expansion – so OWL was the choice. Some editors dont to OWL, and in the end I settled for Protege. Same thoughts here as for the Linked Data app – why limit myself to RDFS ? I can still do RDFS within an OWL ontology. Protege it is
entity extraction Having played with GATE, OpenNLP, MinorThird and a foray into UIMA I settled on writing my own code. I needed close connections between my ontology, extracting the entities and generating RDF from those entities – most of these tools dont have this capability out of the box (perhaps they do now, 1 year on) – and I also wanted to minimise the number of independent steps at this point so that I could avoid writing conversion code, configure multiple parts in different ways and for different environments or OS. There is also a high barrier to entry and a long learing curve for some of these tools. I had read a lot, enough even, and wanted to get my hands dirty. I decided to build my own, based on grep – as most of these tools use regex at the bottom end and build upon it . It wasn’t going to be sophisticated, but it would be agile, best effort, experience based coding I’d be doing, and learning all the way – not a bad approach I think. I’d borrow techniques from the other tools around tokenisation and gazeteering, and if I was lucky, I might be able to use some of the ML libraries (I didnt in the end). So, with the help of Jena, I wrote components for

  • Processing files in directories using “tasks”, outputting to a single file, multiple files, multiple directories, different naming conventions, encoding, different RDF serialisations
  • Splitting single large file into sections based on a heading style used by the author. This was complicated by page indexing and numbering that a very similar style, and variations within sections that meant that end-of-section was hard to find. I got most entries out, but from time to time I find and embedded section wthin another. This can be treated individually, manually, and reimported into the repository to replace the original and create 2 in its place
  • Sentence tokenisation – I could have used some code from the available libraries and frameworks here, but its not too difficult, and when I did compare to the others eventually, I discovered that they also came a cropper in the same areas I did. Some manual corrections are still needed no matter how you do it, so I stuck with my own
  • Running regex patterns, accumulating hits in a cache. A “concept” or entity has a configuration element, and a relationship to other elements (a chain can be created).
    • The configuration marries an “Entity” with a “Tag”(URI). Entities are based on a delimiter, gazeteer.
    • Entities can be combined if they have a grouping characteristic.
    • An Entity can be “required” meaning that unless some “other” token is found in a sentence, the entity wont be matched. This can also be extended to having multiple required or ancialliary matches, so that a proportion need to be found (a likelihood measure) before an entity is extracted.
    • Some Entities can be non-matching – just echo whatever is in the input – good for debug, and for itemising raw content – I use this for echoing the sentences in the section that Im looking at – the output appears alongside the extracted entities.
    • The Required characteristic can also be used with Gazeteer based greps.
    • Entities have names that are used to match to Tags
  • Creating a Jena Model and adding those entities based on a configured mapping to an ontology element (URI, namespace, nested relationship, quantification (single or list, list type)
  • Outputting a file or appending to a file, with a configured serialisations scheme (xml/ttl/n3/…)
This was a different kind of application – here no data exists at the start, and all is created and borne digital. No extraction needed.
RDF generation I naively started the RDF generation code as a series of string manipulations and concatenations. I thought I could get away with it, and that it would be speedy ! The RDF generation code in Jena didnt seem particularly sophisticated – the parameters are string based in the end, and you have to declare namespaces as a string etc so what could possible go wrong ?? Well, things got unwieldy, and when I wanted to validate, integrate and reuse this string manipulation code it became tedious and fractious. Configuration was prone to error. Jena at higher stages of processing then needs proper URIs and other libraries operate on that basis. So, just in time, I switched – luckily I had built the code thinking that I might end up having to alter my URI definition and RDF generation strategy, so it ended up being a discrete replacement – a new interface implementation that I could plug in.
Tags can be

  • reference – always create the same URI – used with properties mostly – eg rdfs:type
  • append – a common and complete base, with just a value appended
  • complex – a base uri, intermediate path, ns prefix, type or subject path, a value URI different from the containing element
  • lookup – based on entity value, return a particular URI – like a reverse gazeteer
Here, RDF generation isnt driven by extraction or preexisting entites, but by the Object model I used. See the next row for details.
Using RDF with Java Fairly early on I settled with Jena as opposed to Sesame. There are some notes I found comparing Jena to Sesame1, but some of the arguments didnt mean anything to me at the early stages. There wasnt much between them I thought, but the Jena mailing list seemed a bit more active, and I noted Andy Seaborne’s name on the Sparql working group2. Both are fully featured with Sparql endpoints, repositories, text search and so on, but take different approaches3 . Since then I’ve learned a lot of course, and Ive compiled my own comparison matrix[110]. . So – I went for Jena, and I probably will in other cases, but Sesame may suit things better in others.

While Jena is Object oriented, working with it is based on RDF rather than objects. So if you have a class with properties – a bean – you have to create a Model, the Subject and add the properties and their values, along with the URIs and namespaces that they should be serialised with. You cannot hand Jena a Bean and say “give me the RDF for that object”.

For this project that wasn’t an issue – I wasnt modelling a class hierarchy, I wanted RDF from text, and then to be able to query it, and perhaps use inference. Being able to talk to Sparql endpoints and manipulate RDF was more important than modelling an Object hierarchy.

1. http://www.openrdf.org/forum/mvnforum/viewthread?thread=2043#7470
2. http://www.w3.org/2009/sparql/wiki/User:Andy_Seaborne
3. Theyre different because they can be  – this isn’t like programming against a standard like JDBC, there isnt a standard way of modelling RDF in Java or as an Object – there are domain differences that may well make that impossible, in entirety. Multiple inheritance, restrictions and Open World Assumption make for mismatches. ProLog and LISP may be different or more suited here, or perhaps some other language.

Here I needed to be able maintain parallel worlds – and Object base with a completely equivalent RDF representation. And I wanted to be able to program this from an enterprise Java developer’s perspective, rather than a logician or information analyst. How do I most easily get from Object to RDF without having to code for each triple combination [109]? Well it turns out there are 2 choices, and I ended up using one and then the other. It was also conceivable that I might not be able to do what I wanted, or that it wouldnt perform – I saw the impact of inference on query performance in the Linked Data application – so I wanted to code the app so that it would be decoupled from the persistence mechanism. I also needed to exert authorization control – could I do this with RDF ?

  • Java-RDF – I stuck with Jena – why give up a good thing ?
  • Object-RDF – Jena has 2 possibilties – JeanBean, and Jastor. I settled for JenaBean as it seemed to have support and wasnt about static class generation. This allows you to annotate your javabeans with URI and property assertions so that a layer of code can create the RDF for you dynamically, and then do the reverse when you want to query.
  • AdHoc Sparql – the libraries work OK when you are asking for Objects by ID, but if you want Objects that have certain property values orconditions then you need to write Sparql and submit that to the library.

So, I could build my app in an MVC style, and treat the domain objects much like I would if I used Hibernate or JDO say. In addition, I could put in a proxy layer so that the services werent concerned about which persistence approach I took – if I wanted, I could revert to traditional RDBMS persistence if I wanted. So I could haveView code, controllers, domain objects (DAO), service classes, a persistence layer consisting of a proxy and an Object to RDF implemenation.

I built this, and soon saw that RDF repositories, in particular Jena SDB, when used with JenaBean are slow. This boils down to the fact that SPARQL ultimatey is translated to SQL, and some SPARQL operations have to be performed client side. When you do this in an Object to RDF fashion, where every RDF statement ends up as a SQL join or independent query, you get a very very chatty storage layer. This isn’t uncommon in ORM land and lazy loading is used so that for instance, a property isnt retrieved until its actually needed – eg if a UI action needs to show a particular object property in addition to showing that an object exists. In the SPARQL case, there are more things that need to be done client side, like filtering, and this means that a query may retrieve (lots) more information than its actually going to need to create a query solution, and the processing of the solution is going to take place in your application JVM and not in the repository.

I wanted then to see if the performance was significantly better with a local repository even if it couldnt be addressed from multiple application instances (TDB), and if Sesame was any better. TDB turned out to be lots faster, but of course you cant have multiple webapps talking to it unless you use address it as a Sparql endpoint- not an Object in Java code. For Sesame tho, I needed to ditch JenaBean, and luckily, in the time I had been building the application a new Java Object-RDF middleware came out, called Empire-JPA[72].

This allows you to program your application in much the same way as JeanBean – annotations and configuration – but uses the JPA api to persist objects to a variety of backends. So I could mark up my beans with Empire Annotations (leaving the JenaBean ones in place) and in theory persist the RDF to TDB, SDB, any of the Sesame backends, FourStore and so on.

The implementation was slowed down because the SDB support wasn’t there, and the TDB support needed some work, but it was easy to work Mike Grove at ClarkParsia on this, and it was a breath of fresh air to get some good helpful support, an open attitude, and timely responses.

I discovered along the way that I couldn’t start with a JenaBean setup, persist my objects to TDB say, and switch seamlessly to Empire-JPA (or vice versa). It seems that JenaBean persists some configuration statements and these interfere with Empire in some fashion – but this is an unlikely thing to do in production, so I havent followed it thru.

Empire is also somewhat slower than JenaBean when it comes to complex object hierarchies, but Mike is working on this, and v 0.7 includes the first tranche of improvements.

Doing things with JPA has the added benefit of giving you the opportunity to revert to RDBMS or to start with RDBMS and try out RDF in parts, or do both. It also means that you have lots of documentation and patterns to follow, and you can work with a J2EE standard which you are familiar with.

But, in the end Semantic Repositories aren’t as quick as SQL-RDBMS, but if you want RDF storage for some of your data or for a subset of your functionality, a graph based dataset, a common schema, vocabulary (or parts of) for you and other departments or companies in your business circle, and the distinct advantage of inference for data mining, relationship expressiveness (“similar” or other soft equivalences rather than just “same”) and discovery.

A note about authorization (ACL) and security: None of the repositories I’ve come across have access control capabilities along the lines of what you might see with an RDBMS – grant authorities and restrictions just aren’t there. (OpenVirtuoso may have something as it has a basis in RDBMS (?)).

You might be able to do some query restriction based on graphs by making use of a username, but if you want to say make sure that a field containing a social securrty number is only visible to the owner or application administrator (or some other Role) but not to other users, then you need to do that ACL at the application level. I did this in Spring with Spring Security (Acegi), at the object level. Annotations and AOP can be used to set this up for Roles, controllers, Spring beans (that is beans under control of a Spring context) or beans dynamically created (eg Domain objects created by controllers) . ACL and authentication in Spring depend on a User definition, so I also had to create an implementation that retrieved User objects from the semantic repository, but once that was done, it was an ACL manipulation problem rather than an RDF one.

The result was a success, if you can ignore the large dataset performance concerns. A semantic respository can easily and successfully be used for persistence storage in a Java J2EE application built around DAO, JPA and Service patterns, with enterprise security and access control, while also providing a semantic query capability for advanced and novel information mining, discovery and exploration.

Semantic repositories This application ultimately needs to be able to support lots of concurrent queries – eg +20 per sec, per instance. Jena uses Multiple Reader Single Writer approach for this, so should be fine. But with inference things slow down a lot, and memory needs to be available to service concurrent queries and datasets. The Amazon instance I have for now uses a modest 600mB for Heap, but with inference could use lots more, and a lot of CPU. Early on I used a 4 year old Dell desktop to run TDB and Joseki, and queries would get lost in it and never return – or so I thought. Moving to a Pentium Duo made things better, but its easy to write queries that tie up the whole dataset when youre not a sparql expert and can in some cases can cause the JVM to OoM and/or bomb. SDB suffers (as mentioned in the previous section) and any general purpose RDBMS hosted semantic repository that has to convert from SPARQL to SQL and back-and-forth will have performance problems. But for this application, TDB currently suffices – I dont have multiple instances of a Java application and if did host the html/js on another instance (a tomcat cluster say) then it would work perfectly well with Joseki in front of TDB or SDB. On the downside, an alternative to Jena is not a real possibility here as the Sparql in the pagecode makes heavy use of Jena ARQ extensions for counts and other aggregate functions. Sparql 1.1 specifies these things, so perhaps in future it will be a possibility. As a real java web application one of the primary requirements here is that the repository is addressable using java code from multiple instances1. TDB doesnt allow this because you define it per JVM. Concurrent access leads to unpredictable results, to put it politely. SDB would do it, as the database takes care of the ACIDity, but its slow.

I also wanted to be able to demonstrate the application and test performance with RDBMS technology or Semantic Repository, or indeed NoSQL technology. The class hierarchy and componentisation allows this, but at this stage I’ve not tried going back to RDBMS or the NoSQL route. Empire-JPA allows a variety of repositories to be used, and those based on Sesame include OWLIM and BigData which seem to offer large scale and clustered repository capability. To use AllegroGraph or Rdf2Go would require another implementation of my Persitence Layer, and may require more bean annotations.

So, nothing is perfect, everything is “slow”, but flexibility is available.

1. It might be possible to treat the repository as remote datasource and use SPARQL Select and Insert/Update queries (and this may be more performant it turns out), but for this excerise I wanted to stick with tradition and build a J2EE application that didnt have hard coded queries (or externalised and mapped ones a la iBatis) but that encapsulated the business logic and entity as bean and service object base.

  • querying sparql end points
  • inference
  • linking data
More important here than in the J2EE webapp, being able to host a dataset on the Linked Data Web with 303 Redirect, permanent urls, slash rather than hash URIs and content negotiation meant that I ended up with Joseki as the Sparql endpoint, and a servlet filter within a base webapp that did the URI rewriting, 303 redirect and content negotiation. Ontology and instance URIs can be serviced by loading the Ontology into the TDB repository. The application is read only, so theres no need for the Joseki insert/update servlet. I also host an ancillariy dataset for townlands so that I can keep it distinct for use with other applications, but federate in with an ARQ Service keyword. Making links between extracted entities and geoNames, dbPedia and any other dataset is done as a decorator object in the extraction pipeline. Jena’s SPARQL objects are used for this, but in the case of the Geonames webservice, their Java client library is used.

One of the issues here of course is cross-domain scripting. Making client side requests to code from another domain (or making Ajax calls to another domain) isnt allowed by modern UserAgents unless they support JSONP or CORS. Both require an extra effort on the part of the remote data provider and  could do with some seamless support (or acknowledgement at least) from the UI javascript libraries. It happens that Jetty7 has a CORS filter (which I retrofitted to Joseki 3.4.2 [112]). JSONP can be fudged with jQuery it turns out, if the remote dataset provides JSON output – some don’t. The alternative is that for anyone wishing to use your dataset on the Linked Open Data web, that they must implement a server side proxy of some kind and (usually) work with RSF/XML. A lot of web developers and mashup artists will baulk at this, but astonishngly, post Web2.,0, they still seem to be out of the reach of many dataset publishers. Jetty7 with its CORS fitler goes a long way to improving this situation, but it would be great to see it in Tomcat too, so that publishers don’t have to implement what is a non-trivial filter (this is a security issue after all), and clients dont have to revert (or find/hire/blackmail) to server side code and another network hop.

Vladimir Dzhuvinov has another CORS filter [111], that adds request-tagging and Access-Control-Expose-Headers in the response.

The only need of Sparql endpoint here is for debug purposes. You need to be able to see the triples as the repository sees them when you use an ORdfM layer so that you can understand the queries that are generated, why some of your properties arent showing up and so on.

For query handling I needed a full featured console that would allow me inference (performance permitting) and allow me to render results efficiently. I also needed to be able to federate queries across datasets or endpoints – especially to UMBEL so that I could offer end users the ability to locate data tagged with an UMBEL URI that were “similar” to one they were intersted in (eg sharing a skos:broader statement) . Jena provides the best support here in terms of SPARQL extensions, but inference was too slow for me, and I could mimic some of the basic inference with targetted query writing for Sesame. Sesame doesnt do well with aggregate functions, and inference is per repository and on-write, so you need to adjust how you view the repository compared to how Jena does it. Sesame is faster with an in-memory database.

UI and render This is an exercise in HTML and Ajax. It’s easy to issue Sparql queries that are generated in Javascript based on the what needs to be done, but theres one for every action on the website, and its embedded in the code. Thats not a huge problem given the open nature of the dataset and the limited functionality thats being offered (the extraction process only deals with a small subset of the available information in the text). jQuery works well with Joseki, local or not [112] so the JSON/JSONP issue didnt arise for me. Getting a UI based on the Ontology was possible using the jOWL javascript library, but its not the prettiest or most intuituve to use. A more sophisticated UI would need lots more work, and someone with an eye for web page design 🙂 Here, the UI is generated with JSP code with embedded JS/Ajax calls back to the API. URLs are mapped to JSP and Role based access control enforced. Most URLs have to be authenticated. Spring has a Jackson JSON view layer so that the UI could just work with Javascript arrays, but this requires more annotations on the beans for some properties that cause circular references. The UI code is fairly unsophisticated and for the sake of genericity, it mostly just spits out what is in the array, assuming that the annotations have taken care of most of the filtering, and that the authorization code has done its work and cloaked location, identity and datetime information. The latter works perfectly well, but some beans have propoerties that a real user wouldnt be interested in.

Velocity is used in some places when a user sends a message or invitation, but this is done at the object layer.

The UI doesnt talk Sparql to any endpoint. Sparql queries are generated based on end user actions (the query and reporting console), but this is done at the Java level.

[109] http://www.mindswap.org/~aditkal/SEKE04.pdf
[110] https://uoccou.wordpress.com/wp-admin/post.php?post=241&action=edit
[111] http://blog.dzhuvinov.com/?p=685
[112] https://uoccou.wordpress.com/2010/11/29/cors-servlet-filter/

Final section in Java Semantic Webapps Part 3.1 completed

December 9, 2010 Comments off

I’ve filled out the tools matrix with the 60 or so tools, libraries and frameworks I looked at for the two projects I created. Not all are used of course, and only a few are used in both. Includes comments and opinion, which I used and why, and all referenced. Phew.

CORS Servlet Filter (…and jQuery JSONP tricks)

November 29, 2010 5 comments

I was just about to start writing a CORS[1] servlet filter so that I can move one of my apps onto an independent EC2 host and give it more memory when I came across the CometD project [2] (DOJO event bus in Ajax, interesting in itself), which makes use of Jetty7’s CrossOriginFilter [3].

This seems to do all you need to allow your servlet interact with cross domain requests and built Javascript RIAs that mash up and link data, semantic or not. The filter allows a list of allowable domains to be set, among other things, so that you can add it to any of your servers, map it to any of your servlets, and allow different clients you trust and want access to your data to get to it.

Saves me having to write it, and it looked like it was going to be painful to do fully and correctly, so its a real relief to see in Jetty7. All credit to the developers there.

Not sure about the licensing aspects (EPL1 + Apache2) , but you can lift the source, remove the Eclipse logging dependency, and alter as you see fit for your version of servlet engine. I’m trying this now with Tomcat6 and another Jetty6 instance, just as soon as I can get my apps separated and onto different domains (without the filter, a request from localhost to a remote domain using jQuery seems to get thru just fine for some reason)

[1] http://www.w3.org/TR/access-control/
[2] http://cometd.org/
[3] http://download.eclipse.org/jetty/stable-7/apidocs/org/eclipse/jetty/servlets/CrossOriginFilter.html

Separating services onto different hosts on EC2

I wanted to move one of my http services (Joseki) onto a new host so as to be able to give the JVM more memory and avoid EC2 unceremoniously killing it when it asked for too much.  The tomcat service with the webapps would remain put. So I

  • created an AMI from my running instance
  • create a new instance from it
  • resinstalled ddclient because it didnt seem to work
  • created a new DynDNS account thinking it was tied to my account rather than the host, but that didnt make any difference
  • checked the ddclient cache file – it seemed to have the right ip addresses – ie one for the tomcat services, and another for the joseki host. However dyndns showed that all hostnames were backed by the same ip address. I suspected cache so i did a ‘sudo ddclient -force ‘ and this seems to have updated DynDNS correctly
  • changed my js files so that all sparql would be directed to the new Joseki host, and started testing

Getting JSONP where there is only JSON

Now I expected that things wouldn’t work – Joseki is on a different host than where the js files have been loaded from – making an Ajax call there shouldnt work should it – unless I was using jsonp – but Joseki doesnt do jsonp !

So, I checked my code, and it is making JSONP calls – I’m doing a jQuery $ajax call like this

var options = {type:"GET", url:remoteurl, success:callback, timeout:"300", dataType:"jsonp"};
var resp;
try {
    resp = $.ajax( options );
}catch (failed){
    alert("Remote call failed : " + failed);
}

So, whats going on ? Have I not understood this whole cross-domain thing [4], or is jQuery doing something strange ?

Well turns out that Ive forgotten a trick I used to get this to work before : whats actually going in is this

  • jQuery, rather than using xmlhttp request,  is making a DOM call to insert a <script> tag (which can make cross domain calls), because Ive specified dataType:”jsonp”.
  • The url for the script (specified in url:remoteurl) uses my new hostname – but – happens to include (specified in url:remoteurl) “&output=json” and the necessary SPARQL of course. [5]
  • The script tag gets processed making the GET call to the remote URL, the sparql runs on the remote/cross-domain server, and the JSON response is processed by the callback specified in the success:callback option
  • if you change it to a json call dataType:"json"
  • the request is made (and visible in remote access logs) but the response is aborted – Im not sure if its the browser doing this, or jQuery.

So, hey presto, JSONP where the server does not explicitely support it. However, it won’t work with POST of course (script tags), so for SPARQL update or insert it will be an issue. CORS really should be used here for that…..

Back to the real topic though, CORS. I installed the code [3], modified with some more debug logging so I could see what was going on. Having changed the client javascript to make a jQuery.$ajax call with dataType:json rather than jsonp I expected this to work straight out – after all, Gecko on Firefox 3.6 does the hard work with the headers [6] for “simple” requests (no credentials, not a POST, no custom [non http1.1] headers), so jQuery using XmlHttpRequest should be fine – but it was not. This turned out to be a false negative tho, as my broadband provider is being rubbish today, and when I switched to my rubbish 3G dongle it time out so often that it looked like failure.

Now the strange thing is that when I remove the CORSFilter servlet mapping from web.xml, jQuery still sends the dataType:json request, Joseki receives it, but the response is never processed by the callback. A Mozilla Hacks post [7] says this :

In Firefox 3.5 and Safari 4, a cross-site XMLHttpRequest will not successfully obtain the resource if the server doesn’t provide the appropriate CORS headers (notably the Access-Control-Allow-Origin header) back with the resource, although the request will go through. And in older browsers, an attempt to make a cross-site XMLHttpRequest will simply fail (a request won’t be sent at all).

Reverting to having the filter in place seems to fix things, but Im still not 100% convinced that things are correct.  I suppose the “will not successfully obtain the resource” is vague enough to be an acceptable explanantion for when I dont have CORSFilter in place, but I would have thought that sending the request and putting load on the network and target server wasn’t something that Mozilla really want to happen.

But when in place, the CORSFilter is getting the origin header, and setting the AC-AO response header, so its behaving.  I expect it will be different in a range of other browsers (ie Internet Exploder). So for now, its not broken, I’m not going to fix it any more. YMMV 🙂

And by the way, a t1.micro on EC2 isnt really up to it for even a smallish dataset of 340k triples. It does, just, but you get what you pay for here.

[4]http://en.wikipedia.org/wiki/Same_origin_policy
[5] http://sparql.dyndns-web.com:2020/lewist?output=json&query=PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt; … …&callback=jsonp1291128738335
[6] https://developer.mozilla.org/en/HTTP_access_control

[7] http://hacks.mozilla.org/2009/07/cross-site-xmlhttprequest-with-cors/