Archive

Posts Tagged ‘history’

Java Semantic & Linked Open Data webapps – Part 5.1

January 18, 2011 1 comment

How to Architect ?

Well – what before how  – this is firstly about requirements, and then about treatment

Linked Open Data app

Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.

So – a fairly tall order : in summary

  • define ontology
  • extract entites from digital text and transform to rdf defined by ontology
  • create an RDF dataset and host in a repository.
  • provide a sparql endpoint
  • create a URI namespace and resolution capability. ensure persistence and decoupling of possible
  • provide content negotiation for human and machine addressing
  • create a UI with client side code only
  • create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces
  • link to other datasets – geonames, dbpedia, any others meaningful – demonstrate promise and capability of linkage
  • build an ontology driven UI so that a human can navigate data, with appropriate display based on type, and appropriate form to drive exploration

Here’s what we end up

Lewis Topographical Dictionary linked data app - system diagram

  1. UserAgent – a browser navigates to Lewis TDI homepage – http://uoccou.endofinternet.net:8080/resources/sparql – and
  2. the webserver (tomcat in fact) returns html and javascript. This is the “application”.
  3. interactions on the webpage invoke javascript that either makes direct calls to Joseki (6) or makes use or permanent URIs (at purl.org) for subject instances from the ontology
  4. purl.org redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.
  5. dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.
  6. Joseki as a sparql endpoint provides RDF query processing with extensions for freetext search, aggregates, federation, inferencing
  7. TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).

Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.

Java Semantic & Linked Open Data webapps – Part 4

December 17, 2010 Comments off

What needs writing ?

Now that we have an idea about what tools and technologies are available and the kind of application we want to build we need to start considering architecture and what code we will write around those tools and technologies. The architecture I planned was broadly formed – but not completely – as I went about creating these applications. I was also going to tackle the Linked Open Data webapp first and then do the Semantic Backed J2EE app. I thought MVC first for both, but went in the end with a 2 tier approach for the former, and an n-tier component based approach for the latter. (More about this in the next section). I’m used to the Spring framework, so I thought I’d go with it, and for UI I’d use jQuery and HTML and/or JSP, perhaps Velocity. But nothing was set in stone, and I was going to try and explore and be flexible.

The tools and technologies cover

  • creating an ontology
  • entity extraction
  • RDF generation
  • using RDF with Java
  • Semantic repositories
  • querying sparql end points
    • inference
    • linking data
  • UI and render
Category Linked Open Data webapp Semantic Backed J2EE webapp
creating an ontology The ontology was going to be largely new as there is not much about to deal with historical content. Some bibliograpic ontologies are out there, but this isn’t about cataloguing books or chapters, but about the content within and across the sections in a single book. There are editions for Scotland, Wales and UK also, so I might get around to doing them at some stage. Some of the content is archaic – measurements are in Old English miles for instance. Geographic features needed to be described, along with population and natural resourcces. I wasn’t sure if I needed the expressiveness of OWL over RDFS, but thought that if I was going to start something fresh I might as well leave myself open to evolution and expansion – so OWL was the choice. Some editors dont to OWL, and in the end I settled for Protege. Same thoughts here as for the Linked Data app – why limit myself to RDFS ? I can still do RDFS within an OWL ontology. Protege it is
entity extraction Having played with GATE, OpenNLP, MinorThird and a foray into UIMA I settled on writing my own code. I needed close connections between my ontology, extracting the entities and generating RDF from those entities – most of these tools dont have this capability out of the box (perhaps they do now, 1 year on) – and I also wanted to minimise the number of independent steps at this point so that I could avoid writing conversion code, configure multiple parts in different ways and for different environments or OS. There is also a high barrier to entry and a long learing curve for some of these tools. I had read a lot, enough even, and wanted to get my hands dirty. I decided to build my own, based on grep – as most of these tools use regex at the bottom end and build upon it . It wasn’t going to be sophisticated, but it would be agile, best effort, experience based coding I’d be doing, and learning all the way – not a bad approach I think. I’d borrow techniques from the other tools around tokenisation and gazeteering, and if I was lucky, I might be able to use some of the ML libraries (I didnt in the end). So, with the help of Jena, I wrote components for

  • Processing files in directories using “tasks”, outputting to a single file, multiple files, multiple directories, different naming conventions, encoding, different RDF serialisations
  • Splitting single large file into sections based on a heading style used by the author. This was complicated by page indexing and numbering that a very similar style, and variations within sections that meant that end-of-section was hard to find. I got most entries out, but from time to time I find and embedded section wthin another. This can be treated individually, manually, and reimported into the repository to replace the original and create 2 in its place
  • Sentence tokenisation – I could have used some code from the available libraries and frameworks here, but its not too difficult, and when I did compare to the others eventually, I discovered that they also came a cropper in the same areas I did. Some manual corrections are still needed no matter how you do it, so I stuck with my own
  • Running regex patterns, accumulating hits in a cache. A “concept” or entity has a configuration element, and a relationship to other elements (a chain can be created).
    • The configuration marries an “Entity” with a “Tag”(URI). Entities are based on a delimiter, gazeteer.
    • Entities can be combined if they have a grouping characteristic.
    • An Entity can be “required” meaning that unless some “other” token is found in a sentence, the entity wont be matched. This can also be extended to having multiple required or ancialliary matches, so that a proportion need to be found (a likelihood measure) before an entity is extracted.
    • Some Entities can be non-matching – just echo whatever is in the input – good for debug, and for itemising raw content – I use this for echoing the sentences in the section that Im looking at – the output appears alongside the extracted entities.
    • The Required characteristic can also be used with Gazeteer based greps.
    • Entities have names that are used to match to Tags
  • Creating a Jena Model and adding those entities based on a configured mapping to an ontology element (URI, namespace, nested relationship, quantification (single or list, list type)
  • Outputting a file or appending to a file, with a configured serialisations scheme (xml/ttl/n3/…)
This was a different kind of application – here no data exists at the start, and all is created and borne digital. No extraction needed.
RDF generation I naively started the RDF generation code as a series of string manipulations and concatenations. I thought I could get away with it, and that it would be speedy ! The RDF generation code in Jena didnt seem particularly sophisticated – the parameters are string based in the end, and you have to declare namespaces as a string etc so what could possible go wrong ?? Well, things got unwieldy, and when I wanted to validate, integrate and reuse this string manipulation code it became tedious and fractious. Configuration was prone to error. Jena at higher stages of processing then needs proper URIs and other libraries operate on that basis. So, just in time, I switched – luckily I had built the code thinking that I might end up having to alter my URI definition and RDF generation strategy, so it ended up being a discrete replacement – a new interface implementation that I could plug in.
Tags can be

  • reference – always create the same URI – used with properties mostly – eg rdfs:type
  • append – a common and complete base, with just a value appended
  • complex – a base uri, intermediate path, ns prefix, type or subject path, a value URI different from the containing element
  • lookup – based on entity value, return a particular URI – like a reverse gazeteer
Here, RDF generation isnt driven by extraction or preexisting entites, but by the Object model I used. See the next row for details.
Using RDF with Java Fairly early on I settled with Jena as opposed to Sesame. There are some notes I found comparing Jena to Sesame1, but some of the arguments didnt mean anything to me at the early stages. There wasnt much between them I thought, but the Jena mailing list seemed a bit more active, and I noted Andy Seaborne’s name on the Sparql working group2. Both are fully featured with Sparql endpoints, repositories, text search and so on, but take different approaches3 . Since then I’ve learned a lot of course, and Ive compiled my own comparison matrix[110]. . So – I went for Jena, and I probably will in other cases, but Sesame may suit things better in others.

While Jena is Object oriented, working with it is based on RDF rather than objects. So if you have a class with properties – a bean – you have to create a Model, the Subject and add the properties and their values, along with the URIs and namespaces that they should be serialised with. You cannot hand Jena a Bean and say “give me the RDF for that object”.

For this project that wasn’t an issue – I wasnt modelling a class hierarchy, I wanted RDF from text, and then to be able to query it, and perhaps use inference. Being able to talk to Sparql endpoints and manipulate RDF was more important than modelling an Object hierarchy.

1. http://www.openrdf.org/forum/mvnforum/viewthread?thread=2043#7470
2. http://www.w3.org/2009/sparql/wiki/User:Andy_Seaborne
3. Theyre different because they can be  – this isn’t like programming against a standard like JDBC, there isnt a standard way of modelling RDF in Java or as an Object – there are domain differences that may well make that impossible, in entirety. Multiple inheritance, restrictions and Open World Assumption make for mismatches. ProLog and LISP may be different or more suited here, or perhaps some other language.

Here I needed to be able maintain parallel worlds – and Object base with a completely equivalent RDF representation. And I wanted to be able to program this from an enterprise Java developer’s perspective, rather than a logician or information analyst. How do I most easily get from Object to RDF without having to code for each triple combination [109]? Well it turns out there are 2 choices, and I ended up using one and then the other. It was also conceivable that I might not be able to do what I wanted, or that it wouldnt perform – I saw the impact of inference on query performance in the Linked Data application – so I wanted to code the app so that it would be decoupled from the persistence mechanism. I also needed to exert authorization control – could I do this with RDF ?

  • Java-RDF – I stuck with Jena – why give up a good thing ?
  • Object-RDF – Jena has 2 possibilties – JeanBean, and Jastor. I settled for JenaBean as it seemed to have support and wasnt about static class generation. This allows you to annotate your javabeans with URI and property assertions so that a layer of code can create the RDF for you dynamically, and then do the reverse when you want to query.
  • AdHoc Sparql – the libraries work OK when you are asking for Objects by ID, but if you want Objects that have certain property values orconditions then you need to write Sparql and submit that to the library.

So, I could build my app in an MVC style, and treat the domain objects much like I would if I used Hibernate or JDO say. In addition, I could put in a proxy layer so that the services werent concerned about which persistence approach I took – if I wanted, I could revert to traditional RDBMS persistence if I wanted. So I could haveView code, controllers, domain objects (DAO), service classes, a persistence layer consisting of a proxy and an Object to RDF implemenation.

I built this, and soon saw that RDF repositories, in particular Jena SDB, when used with JenaBean are slow. This boils down to the fact that SPARQL ultimatey is translated to SQL, and some SPARQL operations have to be performed client side. When you do this in an Object to RDF fashion, where every RDF statement ends up as a SQL join or independent query, you get a very very chatty storage layer. This isn’t uncommon in ORM land and lazy loading is used so that for instance, a property isnt retrieved until its actually needed – eg if a UI action needs to show a particular object property in addition to showing that an object exists. In the SPARQL case, there are more things that need to be done client side, like filtering, and this means that a query may retrieve (lots) more information than its actually going to need to create a query solution, and the processing of the solution is going to take place in your application JVM and not in the repository.

I wanted then to see if the performance was significantly better with a local repository even if it couldnt be addressed from multiple application instances (TDB), and if Sesame was any better. TDB turned out to be lots faster, but of course you cant have multiple webapps talking to it unless you use address it as a Sparql endpoint- not an Object in Java code. For Sesame tho, I needed to ditch JenaBean, and luckily, in the time I had been building the application a new Java Object-RDF middleware came out, called Empire-JPA[72].

This allows you to program your application in much the same way as JeanBean – annotations and configuration – but uses the JPA api to persist objects to a variety of backends. So I could mark up my beans with Empire Annotations (leaving the JenaBean ones in place) and in theory persist the RDF to TDB, SDB, any of the Sesame backends, FourStore and so on.

The implementation was slowed down because the SDB support wasn’t there, and the TDB support needed some work, but it was easy to work Mike Grove at ClarkParsia on this, and it was a breath of fresh air to get some good helpful support, an open attitude, and timely responses.

I discovered along the way that I couldn’t start with a JenaBean setup, persist my objects to TDB say, and switch seamlessly to Empire-JPA (or vice versa). It seems that JenaBean persists some configuration statements and these interfere with Empire in some fashion – but this is an unlikely thing to do in production, so I havent followed it thru.

Empire is also somewhat slower than JenaBean when it comes to complex object hierarchies, but Mike is working on this, and v 0.7 includes the first tranche of improvements.

Doing things with JPA has the added benefit of giving you the opportunity to revert to RDBMS or to start with RDBMS and try out RDF in parts, or do both. It also means that you have lots of documentation and patterns to follow, and you can work with a J2EE standard which you are familiar with.

But, in the end Semantic Repositories aren’t as quick as SQL-RDBMS, but if you want RDF storage for some of your data or for a subset of your functionality, a graph based dataset, a common schema, vocabulary (or parts of) for you and other departments or companies in your business circle, and the distinct advantage of inference for data mining, relationship expressiveness (“similar” or other soft equivalences rather than just “same”) and discovery.

A note about authorization (ACL) and security: None of the repositories I’ve come across have access control capabilities along the lines of what you might see with an RDBMS – grant authorities and restrictions just aren’t there. (OpenVirtuoso may have something as it has a basis in RDBMS (?)).

You might be able to do some query restriction based on graphs by making use of a username, but if you want to say make sure that a field containing a social securrty number is only visible to the owner or application administrator (or some other Role) but not to other users, then you need to do that ACL at the application level. I did this in Spring with Spring Security (Acegi), at the object level. Annotations and AOP can be used to set this up for Roles, controllers, Spring beans (that is beans under control of a Spring context) or beans dynamically created (eg Domain objects created by controllers) . ACL and authentication in Spring depend on a User definition, so I also had to create an implementation that retrieved User objects from the semantic repository, but once that was done, it was an ACL manipulation problem rather than an RDF one.

The result was a success, if you can ignore the large dataset performance concerns. A semantic respository can easily and successfully be used for persistence storage in a Java J2EE application built around DAO, JPA and Service patterns, with enterprise security and access control, while also providing a semantic query capability for advanced and novel information mining, discovery and exploration.

Semantic repositories This application ultimately needs to be able to support lots of concurrent queries – eg +20 per sec, per instance. Jena uses Multiple Reader Single Writer approach for this, so should be fine. But with inference things slow down a lot, and memory needs to be available to service concurrent queries and datasets. The Amazon instance I have for now uses a modest 600mB for Heap, but with inference could use lots more, and a lot of CPU. Early on I used a 4 year old Dell desktop to run TDB and Joseki, and queries would get lost in it and never return – or so I thought. Moving to a Pentium Duo made things better, but its easy to write queries that tie up the whole dataset when youre not a sparql expert and can in some cases can cause the JVM to OoM and/or bomb. SDB suffers (as mentioned in the previous section) and any general purpose RDBMS hosted semantic repository that has to convert from SPARQL to SQL and back-and-forth will have performance problems. But for this application, TDB currently suffices – I dont have multiple instances of a Java application and if did host the html/js on another instance (a tomcat cluster say) then it would work perfectly well with Joseki in front of TDB or SDB. On the downside, an alternative to Jena is not a real possibility here as the Sparql in the pagecode makes heavy use of Jena ARQ extensions for counts and other aggregate functions. Sparql 1.1 specifies these things, so perhaps in future it will be a possibility. As a real java web application one of the primary requirements here is that the repository is addressable using java code from multiple instances1. TDB doesnt allow this because you define it per JVM. Concurrent access leads to unpredictable results, to put it politely. SDB would do it, as the database takes care of the ACIDity, but its slow.

I also wanted to be able to demonstrate the application and test performance with RDBMS technology or Semantic Repository, or indeed NoSQL technology. The class hierarchy and componentisation allows this, but at this stage I’ve not tried going back to RDBMS or the NoSQL route. Empire-JPA allows a variety of repositories to be used, and those based on Sesame include OWLIM and BigData which seem to offer large scale and clustered repository capability. To use AllegroGraph or Rdf2Go would require another implementation of my Persitence Layer, and may require more bean annotations.

So, nothing is perfect, everything is “slow”, but flexibility is available.

1. It might be possible to treat the repository as remote datasource and use SPARQL Select and Insert/Update queries (and this may be more performant it turns out), but for this excerise I wanted to stick with tradition and build a J2EE application that didnt have hard coded queries (or externalised and mapped ones a la iBatis) but that encapsulated the business logic and entity as bean and service object base.

  • querying sparql end points
  • inference
  • linking data
More important here than in the J2EE webapp, being able to host a dataset on the Linked Data Web with 303 Redirect, permanent urls, slash rather than hash URIs and content negotiation meant that I ended up with Joseki as the Sparql endpoint, and a servlet filter within a base webapp that did the URI rewriting, 303 redirect and content negotiation. Ontology and instance URIs can be serviced by loading the Ontology into the TDB repository. The application is read only, so theres no need for the Joseki insert/update servlet. I also host an ancillariy dataset for townlands so that I can keep it distinct for use with other applications, but federate in with an ARQ Service keyword. Making links between extracted entities and geoNames, dbPedia and any other dataset is done as a decorator object in the extraction pipeline. Jena’s SPARQL objects are used for this, but in the case of the Geonames webservice, their Java client library is used.

One of the issues here of course is cross-domain scripting. Making client side requests to code from another domain (or making Ajax calls to another domain) isnt allowed by modern UserAgents unless they support JSONP or CORS. Both require an extra effort on the part of the remote data provider and  could do with some seamless support (or acknowledgement at least) from the UI javascript libraries. It happens that Jetty7 has a CORS filter (which I retrofitted to Joseki 3.4.2 [112]). JSONP can be fudged with jQuery it turns out, if the remote dataset provides JSON output – some don’t. The alternative is that for anyone wishing to use your dataset on the Linked Open Data web, that they must implement a server side proxy of some kind and (usually) work with RSF/XML. A lot of web developers and mashup artists will baulk at this, but astonishngly, post Web2.,0, they still seem to be out of the reach of many dataset publishers. Jetty7 with its CORS fitler goes a long way to improving this situation, but it would be great to see it in Tomcat too, so that publishers don’t have to implement what is a non-trivial filter (this is a security issue after all), and clients dont have to revert (or find/hire/blackmail) to server side code and another network hop.

Vladimir Dzhuvinov has another CORS filter [111], that adds request-tagging and Access-Control-Expose-Headers in the response.

The only need of Sparql endpoint here is for debug purposes. You need to be able to see the triples as the repository sees them when you use an ORdfM layer so that you can understand the queries that are generated, why some of your properties arent showing up and so on.

For query handling I needed a full featured console that would allow me inference (performance permitting) and allow me to render results efficiently. I also needed to be able to federate queries across datasets or endpoints – especially to UMBEL so that I could offer end users the ability to locate data tagged with an UMBEL URI that were “similar” to one they were intersted in (eg sharing a skos:broader statement) . Jena provides the best support here in terms of SPARQL extensions, but inference was too slow for me, and I could mimic some of the basic inference with targetted query writing for Sesame. Sesame doesnt do well with aggregate functions, and inference is per repository and on-write, so you need to adjust how you view the repository compared to how Jena does it. Sesame is faster with an in-memory database.

UI and render This is an exercise in HTML and Ajax. It’s easy to issue Sparql queries that are generated in Javascript based on the what needs to be done, but theres one for every action on the website, and its embedded in the code. Thats not a huge problem given the open nature of the dataset and the limited functionality thats being offered (the extraction process only deals with a small subset of the available information in the text). jQuery works well with Joseki, local or not [112] so the JSON/JSONP issue didnt arise for me. Getting a UI based on the Ontology was possible using the jOWL javascript library, but its not the prettiest or most intuituve to use. A more sophisticated UI would need lots more work, and someone with an eye for web page design 🙂 Here, the UI is generated with JSP code with embedded JS/Ajax calls back to the API. URLs are mapped to JSP and Role based access control enforced. Most URLs have to be authenticated. Spring has a Jackson JSON view layer so that the UI could just work with Javascript arrays, but this requires more annotations on the beans for some properties that cause circular references. The UI code is fairly unsophisticated and for the sake of genericity, it mostly just spits out what is in the array, assuming that the annotations have taken care of most of the filtering, and that the authorization code has done its work and cloaked location, identity and datetime information. The latter works perfectly well, but some beans have propoerties that a real user wouldnt be interested in.

Velocity is used in some places when a user sends a message or invitation, but this is done at the object layer.

The UI doesnt talk Sparql to any endpoint. Sparql queries are generated based on end user actions (the query and reporting console), but this is done at the Java level.

[109] http://www.mindswap.org/~aditkal/SEKE04.pdf
[110] https://uoccou.wordpress.com/wp-admin/post.php?post=241&action=edit
[111] http://blog.dzhuvinov.com/?p=685
[112] https://uoccou.wordpress.com/2010/11/29/cors-servlet-filter/

Java Semantic Web & Linked Open Data webapps – Part 2

November 24, 2010 Comments off

Selection criteria – technologies, content, delivery

For both applications different different technologies are required – they have different needs and outputs, as described in the first part of this series. One is aimed at the Linked Open Data web, the other at exploring the possibilities of using Semantic Web technologies in a J2EE stack, in particular within a Location Based Service. Neither of these applications are RDFa publishing mechanisms or concept extractors like OpenCalais.

So what kind of content will the tools I need have to work with, what are the criteria, and how does the final package look ?

Linked Open Data webapp Semantic backed J2EE webapp
Content A PDF of a 19th century gazeteer of Ireland’s Civil Parishes [1]. This takes us into the world of digital humanities, history and archive data. But this is about the rich content of the gazeteer, not about describing the gazeteer itself, so its not a bibliographic application. It takes the form of entries for each of 3600 odd civil parishes in Ireland. Each entry may consist of information regarding 

  • placename and aliases
  • location,
  • relation to other locations – distance, bearing,
    other placenames
  • population – core, borough and rural
  • antiquity and history – thru the eyes of the author and the tendency to ignore social and individual aspects of “historians” at this time
  • natural resources present in the location
  • days or markets and fairs
  • landscape, features and architecture
  • goverment and official presence – brideswell, police station, post office
  • agriculture and industry
  • politics – the great houses of the gentry and aristocracy, the members of parliament
  • ecclesiastical matters – Tithes, Glebes and church buildings for the Church of Ireland/England to the Roman Catholic “chapels”
  • educational facilties for the population

Entries for cities and larger towns are long and wavering in their descriptions, while smaller parishes or those known by common names may simply be entries that say “SEE OtherPlaceName”. Each entry starts with a capitalised placename, which is then followed by freetext sentences, using 19th vocabulary and phraseology. There are no subject headings or breaks within entries. Pages are 2 column layout with page numbers at the bottom and 3 letter index headings on each page – eg “301 BRI” or “199 ATH” – or at least this is how it looks when the text is extracted from the PDF. So, in summary, the content is freetext, but does have some structure, although its not possible to tell until the content is read. There are variations used throughout, capitalisation can be different from entry to entry, and there are myriad OCR errors.

The idea here is that a user might be able to continually and over time record their latitude and longitude, and tag their profile and their locations with a structured vocabulary. To make
it useful, allow others to see that info if I invite them to, and allow commerical applications access to my data if I sign up to it. Obviously there are issues to do with privacy, access control and data protection here, not just at an end-user level but also at a commercial level. And if we’re going to do it with semantic technology then it needs to justify itself. Oh, and almost (didnt) forget, it needs to be highly performant too. So – its simple really, right ? A user database, login, authorization, post some coordinates, create groups and “friends”, report on them. maintain privacy, show some real benefit from choosing Semantic technologies. What could be easier ! 

So, what kind of content do we have to look at :

  • User profile information – but we want to minimise and identification details, preferably retain none, not even an email address
  • Roles – User, Administrator, Partner, Root. An admin is a user who owns an Application and can see Location data for all users in the application. A Partner is a repository wide user who can
    see locations in all applications, but cannot identify users or applications by name.
  • Blur – a representation of the degree of fuzziness applied to an identifying entity, to be used with Roles, ACLs or permissions to diminish the accuracy of the information reported to a
    user or machine using the system
  • Location information – lat and long, but also related to Application or Group, User
  • Application profile – we want 3rd parties to be able to create an “application” with some valuable proprietary content that builds on an anonymous individuals location history,
  • Groups – like applications but not commercially orientated. These are for individuals who know each other – friends – or who share an interest – eg a Football Club. They may have an off
    domain web application of their own.
  • Date and time – more than one user might post the same location at the same time, and we want to have a history trail of
    locations for users
  • Devices – individuals may use applications that reside on a mobile device or a web page perhaps, to post location information. They may have multiple devices. Multiple users might share
    the same device.
  • Interests/Tags – what people are interested in, how they categorise themselves, what they think of when theyre at a location
  • Permissions – read,write,identify,all and so on – the degree to which a role or user can perform some operation
  • Platforms – the operating system or software that a device runs on
  • Status – Active, Deleted, Archived etc – flags to signify the state of an entity in the system and around which access rules may be tied
  • UserAgent – as well as the Platform and Device, we may want to record the agent that posted a location or a tag
  • Query – commercial applications need to be able to “can” a query that interests them so they can run it again and again, or even schedule it
  • Schedule – a cron-like entity that assumes an owners identity when run to query available and accessible data, then perform a Task
  • Task – an coarse entity that encapsulates a query and some action that happens when the query is run and either “succeeds” or “fails”
Technology questions The webapp will consist of a SPARQL endpoint and some HTML with Javascript Ajax calls. This is 2-tier rather then 3-tier architecture, because the SPARQL endpoint is simply a means to pass queries to the repository using http. For an application where developers are focused on UI and dynamic linkage across the web, where the data is effectively read-only and doesnt have much in the way of
surrounding business logic – a typical Mashup – then this kind of architecture gets you there fast. Following guidelines for building Linked Open Data applications [2,3,4] this is broken down into a number of technology problems. 

  • Getting a quality corpus of text to use. This also includes making sure that any licensing and privacy issues are considered.
  • Exracting entities from the text – but this is in itself a series of tasks 
    • what entities am i interested in ?
    • how do i define the entity ?
    • is the entity actually a compound of more than one thing – eg a distance of “11 miles” may be an entity that is a string or a compound that is a number (what kind of number) and a unit or
      measurement “miles”. Are the miles in this 1842 corpus the same as the ones used today ?
    • do I need to bundle each entries entities into a single blob or RDF ? this is as much a question of what RDF is as to how you go about developing, debugging and staging the content
    • how are entites related to other content in an entry in the corpus, and to other entries in the corpus
    • is the quality and structure of the content sufficiently consistent to support automated techniques, available technology and time constraints ?
    • can i treat the corpus as a series of text fields, or do I need to consider it as set of concepts ?
    • for either approach, what technologies are available and what are their limitations ? are they current, documented and actively supported ? do they do what they say they do ? how much
      effort is needed to learn them ? Will I need to write my own code even if do use them ? Are there examples of code elsewhere ? Do they have dependencies that arent compatible with other technologies I use ? Are there licensing issues and costs if my application becomes commercial ? Once I start using them, how long will it take to get useable output ?
  • Once an entity is identified, transforming that into an RDF representation – what URI and tags to use (this relates to the ontology design), what serialisation format (if any) is best to work with – xml/ttl/n3/direct-load. How do I design my URIs (“hash or slash” ?) [5] ? Do I need to use or be aware of other URI schemes for compatability and reuse ? [6]. Should I try and clean the source text first (OCR errors) or rectify this by using RDF “tricks” or “aliases” ?
  • Storing or staging the RDF and loading into a repository – one file, many files (how many?), a database, a filesystem, a semantic repository
  • Building a query front end for the data once it its in the repository – check its in there, check its correct, make it available – do I need a 303 redirect ? Do I serve the ontology (T) alongside the instances (A) ? Do I need inference ? Can I use Tomcat/Jetty/OpenVirtuoso ? What hardware do I need ?
  • How do I make the content searchable ? What is search in this context – is it google-esque keyword lookup (TF/IDF) or is it a query console, or a browse capability – how can I do any of these things ? Do I need a separate search instance/technology or do any of the SemWeb technologies do this as well as store and retrieve RDF ? What am I going to index – text, uris (subjects/properties/objects)
  • Once I have a SPARQL endpoint, how do I build a webapp to talk to it  – can I get JSON ? can I get JSONP ? What XML formats are output ? Do I need to make server side calls or can I use Ajax client side calls ? Are there security features required – or do any of the SPARQL endpoint technologies have any security or access control facilties ?
  • What libraries of code are available to make calls to a SPARQL endpoint – are there specialist libraries or do I just treat it as an XML web service ?
  • If I am to link with other SPARQL endpoints – eg dbPedia [7]- how do I do that ? Is it a server side or client side problem ? How do I match URIs or more importantly concepts ?
  • Can I build other datasets from related information later, independently, and then link those to my dataset ?
  • How can I build a UI around RDF – are there conventional ways to render forms or graphs ? Do I need to write code myself or are there “black boxes” that I can make use of to render RDF or forms to capture user input as RDF ?
  • How will machine rather than human access be handled ? – Do I  need to build an API other than the SPARQL endpoint – is this for other applications, for spiders or robots. Do I need a
    client side API ? Do I want to service cross domain calls  eg (JSPONP/CORS) ?
  • Are there any concurrency issues to be aware of – will the data extracted ever be updated once we get it out of the text ?
  • Phasing – Will the extraction be phased, does it take place over time and need different staging and migration strategies ?
  • Will I need to deal with versions of my information ?
  • Do we need backup and resilience at the service and/or data levels ?
  • Can we cluster, can we deploy round-robin, can we separate the display logic hosting from query, and this from the data hosting ?
  • Is performance of this application comparable with a traditionally built application that might make thinly proxied JDBC service calls from Javascript ?
  • Can you really treat a semantic repository like an RDBMS ?
  • Is Sign-in really required – why not just drop a cookie and let people post using some random GUID ?
  • How can I represent my entities if i use RDF ? Can I control the IDs in the system ?
  • Are there any libraries of code the bridge the gap from Object to RDF ? What criteria do I need in selecting one ? 
    • Licensing, cost
    • Support, documentation, recency
    • Tie-in to particular technology
    • Standards compliance
    • Performance
    • Configuration ease
    • API design – built to interface, modular, separation of concerns ?
    • Object oriented or RDF leaning ?
  • Can I build my app using common patterns – DAO, service layer, MVC ?
  • What if it all turns out to be wrong and semantic repository technology just doesnt do it ?
  • How do I control anonymity ? I dont want or care to know who people are really, and I want users to be secure knowing that they cannot be found by other users if they dont already know who they are. Likewise, how do I hide or cloak sensitive information even when a user allows another access to their details ?
  • Is the system performant in read and write ?
  • Does (how does) the system scale with concurrency ?
  • How can I allow cross-domain usage of the information, so that users, affinity groups and commercial 3rd parties create useful or novel applications from the location-content and then
    add-on value with proprietary information and perhaps Linked Open Data ? How do I allow the end user to control that access ?
  • Can a message queue be modelled using RDF ?
  • Can messages be “sent” from one user to another or a user to an application ?
  • Can I ensure that identities cannot be inadvertently explored or discovered ?
  • Are there ontologies available that allow me to model my entities ?
  • What tools can use to create an ontology, and reuse others ?
  • Do I need inferrence ? What advantages does it give me, or the users of the system ?
  • Can I combine those advantages with the application data to deliver a new kind of service ? Does Semantic Web technology deliver on its promise ?
  • How can I allow a user to use a structured vocabulary ?
  • Will I need to partition all this data by user, by application, by date ?
  • Can I host this reliably, efficiently and with performance ? What deployment configuration do I need during development and then when I come to host it ?
  • Do I need a purpose-designed semantic repository or can I use my favourite RDBMS as a storage medium ?
Delivery This is really a question about who my users are, what their requirements and expectations are of the application, what I want to deliver to them – is it a technology demo or a useful, perhaps even attractive application ? Will it be long lasting – what are my expectations for it ? 

For me, this is primarily a technology demonstration, a learning tool and perhaps if it turns out to actually work and delivers something over and above what a conventional application might then I can keep it going and running without it costing too much. Cost is as a large consideration –

  • Amazon EC2 will cost roughly e2/day for a micro instance,
  • an ISP might charge e70/y but wont be capable of hosting a large memory JVM and give me the level of control I need, or
  • self-host at home on an old machine – electricity isnt cheap, will the machine be powerful enough, and what do I do if and when the machine
    fails ?

The intent in this exercise is to learn as much as possible about the practical sides of building such an application, and to try and do what can be done without it becoming a research project. A long the way, there are choices of course, and in the end hindsight and experience will pay rewards.

The J2EE application is to deliver a “white-label” (customisable) service for application builders who are interested in making use of crowd sourced location data. Individuals need to be secure in trusting their location, and some aspect of their identity (or a moniker for it), to the system. They need to be sure that when they allow another user to get to it that they cannon be identified from it (or a sequences of locations or recurring location usage) unless they chose to do so. 

The tags used against locations or member profiles need to be queryable usefully – not just by equivalence or presence. Any queries run against the system must be availble to non technicians (SPARQL experts) so a useable UI.

The service will not deliver a SPARQL endpoint initially, but may deliver one against a subset of aggregate information over long periods of time. Similarly an API delivered for third party users is only open
to registered users or applications.

Users and groups get restricted volume access, and cannot schedule queries, while applications get unrestricted access to their own data, and query capabilities. Partners will get access to repository wide location data, but cannot see under which application or group the location was posted.

The information, and the intelligence within, created from users attracted to commercial applications (eg an anonymous FourSquare type application), are within the realm of the 3rd party delivering that
application. This web service only deals with the custodianship of location trails, user declared relationships between users, and the tags assigned by users to them.

Building the application must be successful in terms of raw functionality but also in terms of performance. It must not be tied to a particular semantic technology, and it must be possible to compare with a traditional RDBMS based application. Importantly, the application must demonstrate commercial advantage and a pattern of usage where information is sensitive, but also usefully relatable to the Linked Open Data web.

So – many questions, and many choices to make. Answers to most of these will follow in subsequent articles.

[1] http://www.libraryireland.com/topog/
[2] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[3] http://www.w3.org/DesignIssues/LinkedData.html
[4] http://virtuoso.openlinksw.com/whitepapers/VirtDeployingLinkedDataGuide.html
[5] http://www.w3.org/TR/swbp-vocab-pub/
[6] http://www.w3.org/TR/cooluris/
[7] http://wiki.dbpedia.org/Interlinking

Lewis Topographical Dictionary Ireland and SkyTwenty on EC2

November 17, 2010 Comments off

Both applications are now running on Amazon EC2 in a micro instance.

  • AMI 32bit Ubuntu 10.04 (ami-480df921)
  • OpenJDK6
  • Tomcat6
  • MySQL5
  • Joseki 3.4.2
  • Jena 3.6.2
  • Sesame2
  • Empire 0.7
  • DynDNS ddclient (see [1])

Dont try installing sun-java6-jdk, it wont work. You might get it installed if you try running instance as m1.small, and do it as the first task on the AMI instance. Didnt suit me, as I discovered too late, and my motivation to want to install it turned out to be no-propagation of JAVA_OPTS, not the jdk. See earlier post on setting up Ubuntu.

  • Lewist Topographical Dictionary of Ireland
    • Javascript/Ajax to sparql endpoint. Speedy.
    • Extraction and RDF generation from unstructured text with custom software.
    • Sparql endpoint on Joseki, with custom content negotiation
    • Ontology for location, roads, related locations, administrative description, natural resources, populations, peerage.
    • Onotology for Peerage – Nobility, Gentry, Commoner.
    • Find locations where peers have more than one seat
    • Did one peer know another, in what locations, degree of separation
    • Linked Open Data connections to dbPedia, GeoNames (uberblic and sindice to come) – find people in dbPedia born in 1842 for your selected location. Map on google maps with geoNames sourced wgs84 lat/long.
  • SkyTwenty
    • Location based service built JPA based Enterprise app on Semantic repo (sesame native).
    • Spring with SpringSec ACL, OpenID Authorisation.
    • Location and profile tagging with Umbel Subject Concepts.
    • FOAF and SIOC based ontology
    • Semantic query console – “find locations tagged like this”, “find locations posted by people like me”
    • Scheduled queries, with customisable action on success or failure
    • Location sharing and messaging with ACL groups – – identity hidden and location and date time cloaked to medium accuracy.
    • Commercial apps possible – identity hidden and location and date time cloaked to low accuracy
    • Data mining across all data for aggregate queries – very low accuracy, no app/group/person identifiable
    • To come
      • OpenAuth for application federation,
      • split/dual JPA – to rdbms for typical app behaviour, to semantic repo for query console
      • API documentation

A report on how these were developed and the things learned is planned, warts and all.

[1]http://blog.codesta.com/codesta_weblog/2008/02/amazon-ec2—wh.html – not everything needs to be done, but you’ll get the idea. Install ddclient and follow instructions.

Mobile Device Apps Ranking Sep ’10

September 14, 2010 Comments off

Number of apps in mobile app stores in selected categories, as of Sep 2010

Category iTunes Ovi Android
History 2015 242 1102
Heritage 123 2 18
Tourism 158 14 20
Ireland 51 64 73
Genealogy 43 0 9
Veterans 17 0 20
Archive 81 3 199

North Korean news web site

June 28, 2010 Comments off

Just came across a website called NKNews.org about North Korea. Looks good and seems well put together. We need more sites like this to spread the word.

Time to defect ?

April 23, 2010 Comments off

Now that Ireland is officially a financial basketcase, I thought again about emigrating like I did in the 80’s when Charlie Haughey had the country in a midden. But where to go, and how to register dissatisfaction ?  Could I change my emigrate and change my nationality ? Or perhaps a more headline grabbing, petulant and downright in-your-face solution would be to defect. Where to ?


How unpatriotic is it to leave a county thats been murdered by politicians,  greed and “I’m alright jack” ? Would I be better to stay and fix it, even when my taxes go to bail out the criminals who got us into the mess, and who by connection to the government are acting treasonably ? Is this seditious ?

Categories: history, politics Tags: , ,