I am doing some work on a Top Secret Project to demonstrate on the SkyTwenty platform the use of email data (in place of location data).
I am making use of Aperture to crawl an IMAP store, then allow sharing of contact and message information, so that queries can be run to discover
- who-knows-who in what domain
- how many degrees of freedom there are between contacts
- do selected contacts have any connection
- how “well” do they know each other and so on.
Aperture makes use of the Nepomuk  message and desktop ontologies, and they’re fairly extensive, so a graphic helps to understand some of the ontological relationships.
The brilliant Protege4  ontology design tool has plugins for GraphViz and OntoGraf produce some fairly neat images to visualise ontologies, so here they are. I would like if there was a way to include object and data propertys (by annotation perhaps, will try later) but for now have compiled a table of the class properties from a crawl and sparql query I did against the repository I loaded the data into.
Note that OntoGraf needs the Sun JDK to work, so on Ubuntu, which has the OpenJDK by default, you need to install and agree to the license terms, then make sure that Protege is using the Sun java at /usr/lib/jvm/java-6-sun-220.127.116.11 (or whatever version).
These tables are incomplete, and represent the classes and properties from the crawl of my nearly empty inbox. The full set of classes and properties for the Nepomuk ontologies are available on another page on this blog.
How to Architect ?
Well – what before how – this is firstly about requirements, and then about treatment
Linked Open Data app
Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.
So – a fairly tall order : in summary
- define ontology
- extract entites from digital text and transform to rdf defined by ontology
- create an RDF dataset and host in a repository.
- provide a sparql endpoint
- create a URI namespace and resolution capability. ensure persistence and decoupling of possible
- provide content negotiation for human and machine addressing
- create a UI with client side code only
- create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces
- link to other datasets – geonames, dbpedia, any others meaningful – demonstrate promise and capability of linkage
- build an ontology driven UI so that a human can navigate data, with appropriate display based on type, and appropriate form to drive exploration
Here’s what we end up
- UserAgent – a browser navigates to Lewis TDI homepage – http://uoccou.endofinternet.net:8080/resources/sparql – and
- purl.org redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.
- dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.
- Joseki as a sparql endpoint provides RDF query processing with extensions for freetext search, aggregates, federation, inferencing
- TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).
Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.
Just got my bill from Amazon for the 2 instances Im running and find Ive been charged for 728 hours on one of them – I thought this was supposed to be free for a year ! Reading again the small print (ugh) it seems you are entitled to 750 hours free, but it doesn’t explicitely say per instance. So – it seems its per account and you can run as many instances as you like and use a total of 750 hours across them in total before you get charged. Then again, I suppose thats reasonable enough – Amazon wouldn’t want to have every SME in the world running in the cloud for free, for a year when you could be getting cash from them, would you ? I must have been in a daze :-)
I’ve filled out the tools matrix with the 60 or so tools, libraries and frameworks I looked at for the two projects I created. Not all are used of course, and only a few are used in both. Includes comments and opinion, which I used and why, and all referenced. Phew.
This is a crucially important aspect in a new and evolving technology domain like the Semantic/Linked-Open-Data web – whether its a commercial or FOSS component you are thinking about using.
For commercual tools, many offer free end-user or community licensing, limited by size or frequency of use, but if you plan to take your application to market you may well need to upgrade to a commercial license, and these are often very expensive – a Semantic Web or Knowledge based application based on what might be an essential technology component, will surely be seen as large value-add area for commercial companies. While this is true I believe, and commercial licenses can be justified, some technology offerings have small print that takes you straight to commercial licensing once you go to production. Others have smaller but knobbled versions, while some do have true SME quality licensing. So, watch out, it can be a barrier to entry, and we do need to see Mid-level, SME and Cloud offerings for the success of the pervasive or ubiquitous Semantic Linked Open Data web.
Unfortunately, it seems that many tools and libraries born from academic research or OpenSource endeavours, while available for use, are often not maintained. The author or team moves on, or the tool or library is published but languishes. This ends up with a situation where you may find a tool that does what you need but that has no or poor documentation; no active maintenance; no visible community support forums or user-base; or compatibility problems with other tools, libraries or runtime environments. While that removes many from “production” usage or deployment, they can still be an important learning resource, and a means of comparing more current tools and libraries. I will itemise what I’ve come across below, but make sure you cast your professional eye over any offering – once you know what you are looking for, and what help in tools, libraries and environments you need : hopefully this article and the previous two have helped you in that.
- What does it say it does and does-not do ?
- How old is it ? What are its dependencies ?
- How often is code being updated ?
- Is it written in java/php/perl/.NET/ProLog/Lisp/ ? Does it suit you – does it matter if its written in Perl but youre going to write your app in Java – is what you are going to use it for an independent stage in the production of your application, or are all stages inter-twined ? How much will you have to learn ?
- Who is the author ? What else has he/she/they done ? Are they involved in standards process, coding, design, implementation, community ? Blogs, conferences, presentations ?
- Is there documentation ? A tutorial ? A reference ? Sample Code ? Production applications ?
- Is there a means of contacting the authors, and other users ?
- Are there bugs ? Are there many ? Are they being fixed ?
- What are answers to questions like – simple, helpful, understanding, presumptuous, brick-wall !? One sentence answers or contextualised for audience ?
- What are the user group like – beginner, intermediate, advanced, helpful, broad or narrow base, international, academic, commercial,… ?
- How quickly are questions answered ?
- Does it seem like the tool/library is successfully used by the community, or is it too early to say, or unfit for purpose :-( ?
- Under what licensiing is the tool/library made available ?
At the application level, this is how things pan out then.
|Linked Open Data webapp||Semantic backed J2EE webapp|
|Metadata, RDF, OWL||Need to have entries for each location in gazeteer. Need list of those locations. Then need to relate one to another from what text describes about road links, directions and bearing. Need metadata fields for each of those. Will also pull out administrative region type, population information, natural resources, and “House” information – seats of power/peerage/members of parliament. Will need RDF, RDFS, OWL for this, along with metadata from other ontologies. A further dataset later added for townland names – this allows parish descriptions from Lewis to encompass townland divisions, and potential for crossover to more detailed reporting at the time (eg Parliamentary reports)||This application associates a member or person with a list of locations and datetimes. Locations are posted by a device on a platform by a useragent at a datetime, and also associated with an application or group. An application is an anonymous association of people with a webapp page or pages that makes use of locations posted by its members. A group is an association of people who know each other by name/ID/email address and who want to share locations. Application owners cannot see locations or members of other locations unless they own each of the applications. Application owners cannot see with full accuracy the location or datetime information. Group owners can see the location and datetime with more accuracy, but not full accuracy, of their members. A further user type (“Partner”) can see all locations for all groups and applications but cannot see names of groups, applications or people, and has less accuracy on location and datetime. Concept subject tags can be associated with profiles and locations. A query capability is exposed to allow data mining with inference to application owners and partners. Queries can be scheduled and actions performed on “success” or “fail”. Metadata for people, devices, platforms, datetime, location, tags, applications and groups is required. ACL control based on that metadata is performed, but done so at an application logic level, not at a data level.|
|Artificial intelligence, machine learning, linguistics||Machine learning and linguistic analysus avoid in favour of syntactic a-priori extraction via gazeteer and word list after sentences have been delimited within each delimited location entry or report. Aliases and synonyms added later manually as fixup for OCR errors. Quality restricted by text from PDF and structural artifacts (page headings, numbers) newlines, linefeeds and lack of section headings within locations, location delimiters, and linguistic vagaries of author. Much much more information is available within each entry, but for now the original text is also stored sentence by sentence, with each entry.||None required here as no extraction is performed. Tag words and terms are restricted to those available in Umbel (OpenCyc) and condensed to Umberl Subject Concept URIs, which sparql queries can then make use of for broader, narrower and associative queries. “Find everyone who likes sports who posted a location within 1 mile of here”.|
|Linked Open Data||Location name lookups at extraction time link with to WGS84 grid location and ID in geonames, then to dbPedia entry. Former done using traditional web service API, latter by Sparql query. Coverage of about 85% achieved. dbPedia lookup based on name attempted but higher error rate (no or ambiguous hits) and lower coverage found (there are many infobox field variations for same type of information) QA manual/”eyeball” deemed sufficient for expected usage and audience.Link to Dictionaries of Biography for houses,possible using some form of owl:equivalence of peerage ontology. UI level links to Sindice and Uberblic attempted but cross-domain scripting prohibited. Locations mapped to Google maps – could be migrated to OpenStreenMap (geonames basis). Visualisation possible with Google visualisation or other web tool. Server side proxy created for this, and for further dbPedia integration – this provides example link to “people born before 1842 at this location”.||Links to Umbel are performed at query time based on Umbel Subject concepts applied by members to their profile and location. Umberl vocabulary is currently directly queried to Structured Dynamics endpoint, but could be loaded into same data repository or a separate but more local repository. Large memory footprint. Federated query capability depends on pluggable persistence technology used in application. Applications built on or off domain are free to make use of owl:sameAs for instance to further link proprietary data with data stored in this system, but need to make that association within their own repository. Links can be made to profile identity (local or OpenID) if known or if user expressly associates (after OAuth verification), to wgs84 location (assuming some proximity calculation), to application or group name (if known).|
|Community & Tools||All opensource tooling required for extraction, repository and application/UI code. (Open public data set, no commercial aspects)
Some components need handwriting – eg content negotiation. Most libraries facilitate rather than fulfill requirements – eg RDF generation and serialization, Ontology creation, code generation. Damn – I have to write code !
NLP and ML too advanced, too manual, too time consuming for a beginner, or a one-person prototyping “team”.
UI from RDF a problematic area – would be good to be able to geneate a UI now theres an ontoloy, but no more advanced than any UI or Form generation from XML or structured data.
Link generation code largely manual, could do with abstraction and ease of use (but this is complex area !). Lots and lots to learn, active support and experience required . Cross domain scripting a problem for Linked Open Data.
|Where open linked data isnt a primary requirement then most other requirements are met by traditional RDBMS based technology and architecture. Open source can meet all component requirements for now (tech demo)
So, 3-tier MVC architecture, DAO and service objects. Enterprise security and ACL.
RDF access – read and write – libraries available, each with differing features, compliance and performance levels.
Federation poorly supported in repository/RDF access libs – complicated area, but Linked Open Data needs it, and forced to devolve to large repositories isn’t an attractive option.
No JDBC type access wrappers to semantic repositories. SPARQL young and evolving.
Concurrency and multi-instance access considerations need to be made up front, early in development.
Some library or repository specific ORM type tools, one (I found) JPA based library being developed. Lots and lots to learn, active support and experience required.
This is as comprehensive a list as I can come up with based on what I looked at and ended up using (or not). There are many many more for sure, some in Java, others in various other languages. As some of the work types in the text->knowledge progression are often independent, being available in Java many not be important or even a consideration for you. So – look here, there and everywhere. See also Dave Becketts  list for a great source of information about available tools and technologies.
|Category||Tool||Comment||Linked Open Data webapp||Semantic backed J2EE webapp|
|Extraction||GATE ||IDE for configuration of NLP toolsets and training ML engine. Active user group, but tool UI seemed buggy (q1 2010) and documentation was obtuse – not geared towards those not “in the know” IMO. Still, good, but would need a lot of effort and patience.||X||X|
|OpenCalais ||Commercially oriented business and news online entity extraction and linking. Not suitable for historical archive text, commercial.||X||X|
|RDF generation||nothing||This is part of the transformation of source content to “knowledge”. Once entities are extracted they need to be used in RDF triples – how you go about this depends on your vocabulary and ontology and its up to you to use the RDF-Java-Object frameworks (below) that allow you to create a Subject and add a Property with an Object value. I havent found a tool that would allow code to generate RDF from tagged entities say, and its likely not reasonble to think in this way – however convenient. How would such a tool know which relationships in an ontology were asserted in the entity set you gave it ? The only way to about this is to code those things yourself from the knowledge you already have about the information, or what you want to assert, or perhaps, if you are dealing with a database to use its schema as the basis for a set of asserted statements in RDF – using D2R or Triplify say (do you need inference or not ?). This approach was not used in either of these projects however. Perhaos owl2java  might have helped ?||X||X|
|NLP, ML||GATE ||NLP engine from Sheffield University with support for ML – see also Extraction category above. Tried but not used.||X||X|
|OpenNLP[89, 90]||NLP library for tokenization, chunking, parsing and coreference. Simple than GATE, less documentation, dormant ? Tried but not used.||X||X|
|MinorThird ||Probably more ML than NLP, but with tokenization and extraction capability. Getting long in the tooth, and had some compatability issues when tested.||X||X|
|UIMA[92, 94,95]||“Unstructured Information Management Architecture”. A full blown framework for NLP and ML – “text mining”, a la GATE. Now in Apache (contributed by IBM). Good documentation, active support and development. Came close to using for Linked Data app but came too late, and seemed large and time consuming to learn (in my timescale). However, for a version2 of the project I would use it, over GATE and the custom code I built – documentation for end user and developer is less assuming than GATE, and there are various plugins available, and as it is modular (so is GATE btw) you can create and add your own discrete code into the UIMA processing pipeline. Still need something to generate RDF based around your ontology and the extracted entities tho…|
|SenseRelate||NLP-Wordnet disambiguation toolkit. Couldnt see how I would integrate this – what purpose for my application as I was using a-priori knowledge of the text for the Linked Open Data webapp, and the application business logic for the Semantic backed J2EE webapp. Also getting old…||X||X|
|LingPipe ||Very interesting toolkit for NLP, text and document processing, but ultimately with a commercial license||X||X|
|Mallet ||Like LingPipe but opensource, with sequence tagging and topic modelling.||X||X|
|Weka ||Another text mining tool, opensource, good docs, current and maintained, also works with GATE||X||X|
|RDF-Java||OpenJena [59, 65]||Maturing framework for RDF with java. Sparql implementation  follows standards closely and previews upcoming versions, as Andy Seaborne on SPARQL w3c group. Has repository capability as well. Used in both projects, but in J2EE app was just on of possibilties for repository integration and RDF capability. Support forum high traffic – popular choice. Expected to provide working code examples when describing problems – discussion not entertained ! HP  and now Apache  backing. Combined with JenaBean  and Empire-JPA in J2EE app. TTL/N3 config may seem alien to java webapp developers.||Y||Y|
|KAON ||Another library – didnt seem as popular as Jena or Sesame. Documentation ? Old, not actively maintained ?||X||X|
|Sesame ||Modular RDF to java library and repository framework. v3 expected soon (Q1 2011 ?). Good documentation and comment available on and off site, but you still need to experiment. Support forum can be slow and low traffic, but still a popular choice. Also home for Elmo  (an object-RDF extension), and Alibaba  – “the next generation of the Elmo codebase”. Combined with Empire  in J2EE app. TTL/N3 config may seem alien to java webapp developers.||X||Y|
|Object-RDF||JenaBean ||Appears now dormant, but Jena Object library with custom annotations to model and map Java Classes to RDF classes. Support very slow. Low activity.||X||Y|
|Empire-JPA||Aka Empire-RDF. From makers of Pellet . JPA implementation for access to semantic repositories, with adapters for Sesame, Jena, Fourstore . Newish, v0.7 about to be released. Support good, interested, helpful.||X||Y|
|RDF2GO ||Abstraction over repository and triplestores, with Jena, Sesame and OWLIM adapters. Decided in favour of Empire.||X||X|
|Repository and/or database||TDB ||Single instance in memory repository, with cmdline and Jena integration. No clustering, replication capability – must be local to webapp. Configuration can be awkward, imo, but easy enough to get started with. Inferencing and custom ontology support, both at configuration and code levels. Single writer multiple reader. Used in both projects but in J2EE app was just one of possible repository technologies. Memory mapped files in 64bit JVM.||Y||Y|
|SDB ||RDBMS backed repository technology for Jena. External connection handling possible. Single writer multiple reader. Slower than TDB, slow compared to Sesame. In J2EE app was just one of possible repository technologies||X||Y|
|Sesame ||Provides proxy http capability in front of in memory, file based or database backed repositories. Inferrence by configuration, performed on write – inferred statements are asserted and persisted. Allows for multiple web app instances to make use of any of the repositories. Web based “workbench”. Limited reasoning support compared to Jena. Support forum could be described as “slow”. OntoText  backing.||X||Y|
|BigData ||Sesame + Zookeeper  + MapReduce  based clustered semantic repository for very large datasets. Too big for either apps at this stage, but Empire/Sesame usage provides growth path.||X||X|
|AllegroGraph ||Lisp based Semantic Repository with community and commercial licensing options for larger datasets. Http interface – could be used as alternative to Jena/Sesame/Empire. Biggish application and framework to read and learn – too big for now !||X||X|
|OWLIM ||Large scale repository based around Sesame. Reasoning support better then Sesame, and takes alternative approach to implementation compared with Jena say. Community and commercial license. Too big for now !||X||X|
|Fourstore ||Python semantic repository. Could be used behind Empire.||X||X|
|Content negotiation||Pubby ||WAR file with configuration (N3) for URI mapping, 303 redirect and many other aspects of Linked Data access – for sparql endpoints that support DESCRIBE. Wrote filter that could sit on remote front end as alternative, but may get used later.||X||X|
|SPARQL access & Endpoint||Joseki ||Sparql endpoint for use with Jena. Needs URL rewriting for PURLs and content negotiation code in front.(custom code)||Y||X|
|Link generation||N/A||Use custom code from eg Jena or Sesame to create statements in model – once you’ve designed your URI scheme – and get the code to serialise/materialise the URI for you.||Y||Y|
|Ontologies||Protégé ||IDE to create RDFS and OWL ontologies, with reasoning and visualisation.||Y||Y|
|NeOn Toolkit ||Eclipse based tool suite for semantic apps. Broad scope, protege seemed a better fit – easier and quicker to get to grips with at the time. May be used again tho.||X||X|
|KAON – OI-Modeler ||old. still available ? being maintained ?||X||X|
|m2t4 ||looked promising, simple eclipse plugin, had compatability and maintenance issues. switched to Protege in the end however.||X||X|
|Inference & Reasoning||Jena ||Jena has built in inference capability, but is considered slower than others.. In the J2EE app, with an RDBMS backed repository it was poor, IMO. With a TDB repo its better, but still something you really need to have before you would deploy in production. This is probably true of all current repostories, but Jena seems to be at the slow end of the scale.However, it does deliver high standards compliance rather than a “degraded” compliance you may get with others.||Y||Y|
|Sesame||Sesame has “reduced” reasoning support – it can do RDFS based reasoning, and if custom ontologies are added to a repository type with inferrence support it will make use of them. If a “view” of a dataset is required that doesnt contain inferred statements, then a query parameter needs to be used so they are filtered out.||X||Y|
|Pellet ||“Independent” inference and reasoning. Not used except as plugin in Protege. Supposedly faster than some others.||X||X|
|OWLIM ||OWLIM comes with its own flavour of inference and reasoning “support for the semantics of RDFS, OWL Horst and OWL 2 RL”||X||X|
|BrownSauce||RDF UI generation that might be possible to plug into servlet code and sparql endpoint. Dormant, unsupported ? Dependency compatabilty and documentation issues.||X||X|
|Fenfire ||Visualisation interface – Last update 2008. Seems like research project for developers only||X||X|
|Humboldt||Faceted Browser – not publically available it seems||X||X|
|ZLinks ||Linked data link generator – general purpose, browser plugin||X||X|
|Facet ||Standalone faceted browser for RDF datasets, prolog based. (Cant integrate with Java/js ?)||X||X|
|Longwell ||Standalone faceted browser for RDF datasets, fresnel  – dormant ? integratable ? extendable ?||X||X|
|jOWL ||Javscript lib for owl ontology driven browsing. Last release v1 2009. Low traffic support, but code is accessible and customisable.||Y||X|
|Exhibit [102, 103]||“Publishing” for rdf datasets- looked promising and useful but had compatability issues iirc, and integration with existing semantic repository wasnt clear||X||X|