How to Architect ?
Well – what before how – this is firstly about requirements, and then about treatment
Linked Open Data app
Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.
So – a fairly tall order : in summary
- define ontology
- extract entites from digital text and transform to rdf defined by ontology
- create an RDF dataset and host in a repository.
- provide a sparql endpoint
- create a URI namespace and resolution capability. ensure persistence and decoupling of possible
- provide content negotiation for human and machine addressing
- create a UI with client side code only
- create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces
- link to other datasets – geonames, dbpedia, any others meaningful – demonstrate promise and capability of linkage
- build an ontology driven UI so that a human can navigate data, with appropriate display based on type, and appropriate form to drive exploration
Here’s what we end up
- UserAgent – a browser navigates to Lewis TDI homepage – http://uoccou.endofinternet.net:8080/resources/sparql – and
- purl.org redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.
- dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.
- Joseki as a sparql endpoint provides RDF query processing with extensions for freetext search, aggregates, federation, inferencing
- TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).
Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.
Since you have decided that you also want to make sure it can fulfill all the usual use-cases – performance, concurrent updating, ease of maintenance, basic functional capability (for a business app perhaps), and importantly, whats the benefit – how do you sell this to your manager. For you it’s cool, its being talked about in all the hard to reach places that only you know, and it has more and more promise the more you look into it. You want to do it, but if it doesn’t cut-the-mustard, whats the point – wait a while, let someone else solve the problems, then come back to it. You can still work in the web app world, doing what youve done for the last while, make some money, pay the bills. So, before you even start, the Semantic Web is up against it – it has a lot to prove against a long embedded technology where experience has already paid and dividends been reaped. It better be good……
First question – where to start ? Its a hard question to answer if you’re not familiar with the Semantic Web, and even if you are – if you can grasp the basic and simple ideas behind it – then you’re still going to have difficulty. And whats the Linked Open Data web ? How do I use it, why should I bother ? Why would my boss be interested ? How can I make them interested ? How can they make money from it ? Doesn’t his mean my companies data is going to be available to the public, and my competitors ?
So, over the next while, Im going to try and relate how and why I did what I did to build two different kinds of web application –
1) a read-only reference data application that makes use content loaded into a repository and fronted with a sparql end point, talks to geonames, dbpedia, sindice, uberblic and google maps.
2) a “White Label” J2EE Location Based Service that uses JPA behind its DAOs to talk to a semantic repository. It also makes use of OpenID to provide some anonymity, Spring Security to provide security and ACL controlled authorization (all stored in a semantic repository) and integrates the first app using JSONP (Facebook Authentication and OAuth authorization are also on the cards).
It has taken a while, and there have been some good and bad choices, so tune in next time for the first installment in the series, selected from this bunch :
- Target Functionality – what is my application and what parts of the semantic web do I want and/or need to deliver. Does the application want/need to be part of the Linked Open Data web ?
- Selection criteria – technologies, content, delivery
- Available tools and technologies. Can I avoid SOA pitfalls or is this just the same old story ?
- What needs writing ?
- How to architect ?
- Do I need an ontology – how to create ? Whats OWL-DL/Lite/RDFS/DAML/XYZ ?
- Text vs RDF vs XML – Reading, generating, parsing, APIs.
- Size, performance, scale
- Output – files, databse, rdf, URIs, linkage, API ?
- Do I need freetext search ? How do I do it ? Why not just use SOLR and faceted search ?
- Content Negotiation – who/what is my audience ? browser, script, machine, API ?
- Mapping – I’ve got location, lets use it – show it and link it
- Ajax, js – can I use semantic web/rdf/rdfa libs in my UI – do I have to do it all ? What help is there ?
- UI – can I build a UI to display and collect queries (forms) using my ontology ? How do I allow a human to easily navigate thru my strutured, but infinitely (hey, what about closure ?) graphable set of nodes ?.
Which do you want to hear about first ?
Like I said, I built two web applications for the Semantic and Linked Open Data web. The articles following this introduction will talk about both in parallel.
Almost there. Have to do a run over Lewis and find which records I cant match to DBPedia or GeoNames and create manual links, then work out how to get into TDB (or perhaps SDB). Wonder how the mapping for 303 Concept-to-Document is going to work.
Will then get the semantic pedants to have a look-see, before a friends-and-family and a general announcement. Hope my hardware can cope.
Have to track access and usage as well. Need an opensource analytics package.
And then I can go back to NLP and ML. Will contact Getty about the AAT thesaurus.
Getting closer. Integrated some fixed dbpedia lod links, now looking at rest of toplevel info in repository items for possible ontology matching and lod links. Will also need a static decorator for dc publish info, and perhaps foaf links to here.
Getty AAT thesaurus looks like the job for some of the other information yet to conceptize, but there are licensing costs. Need to consider some more.
Drupal7 looks interesting, esp for up-and-running publishing needs – might be quick way to build app on top of LewisT. Shame (for me) its in PHP!
Got to get the hosting machine set up some more: damned intel i845 graphics chip causing all kinds if issues with package installation for me. I might need something more powerful as well. we’ll see.
- not enough time in the day !
- Drupal7 vs ?
- Calendering, Todo, fix SCM
- All things RDFized
- dc into lewist
- foaf into lewist
- more geonames into lewist
- publish and be damned !
- Getty – AAT
- domainnames to set up
- machine setup
- openVirtuoso, pubby, sesame ? jena repos with ARQ ?
- HI2111 – elphin, dunraven, newspaper.
Links to geonames in beta. Now for dbpedia., but having trouble with http 404s and the sparql endpoint….
Have to add a proxy host and port as system properties before calling. Not sure why, havent had to do this before, perhaps because Jena is casting to HttpUrlConnection rather than plain URLConnection.
Turns out hostname url i was using was “wrong”.
ie. it should be all lowercase.
Now just have to get them to increase their mem allocation so my Sparql query does not cause an out of memory error….