Posts Tagged ‘dbpedia’

Java Semantic & Linked Open Data webapps – Part 5.1

January 18, 2011 1 comment

How to Architect ?

Well – what before how  – this is firstly about requirements, and then about treatment

Linked Open Data app

Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.

So – a fairly tall order : in summary

  • define ontology
  • extract entites from digital text and transform to rdf defined by ontology
  • create an RDF dataset and host in a repository.
  • provide a sparql endpoint
  • create a URI namespace and resolution capability. ensure persistence and decoupling of possible
  • provide content negotiation for human and machine addressing
  • create a UI with client side code only
  • create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces
  • link to other datasets – geonames, dbpedia, any others meaningful – demonstrate promise and capability of linkage
  • build an ontology driven UI so that a human can navigate data, with appropriate display based on type, and appropriate form to drive exploration

Here’s what we end up

Lewis Topographical Dictionary linked data app - system diagram

  1. UserAgent – a browser navigates to Lewis TDI homepage – – and
  2. the webserver (tomcat in fact) returns html and javascript. This is the “application”.
  3. interactions on the webpage invoke javascript that either makes direct calls to Joseki (6) or makes use or permanent URIs (at for subject instances from the ontology
  4. redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.
  5. dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.
  6. Joseki as a sparql endpoint provides RDF query processing with extensions for freetext search, aggregates, federation, inferencing
  7. TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).

Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.

Politicans per capita in EU member states

December 8, 2010 3 comments

I was looking for an interesting query to use as the basis for a quick SPARQL in Drupal7 page, and given Ireland’s “austerity” (aka hairshirt) budget yesterday and our glorious IMF bailout,  I was minded to create a query ranking number of members of houses of legislature in each country against population.

The query took a long time because its very difficult to ascertain the field names from the dbpedia infobox fields, the dbpedia properties and ontology, and the variations in field usage per country, the apparent disconnect (for a human) between a topic and what might be exected as property. I for instance was expecting to be able see a reference to the house of parliament on each country’s page, but its in fact a little more organised than that 🙂

In the end the easiest way to do this is to open up the SPARQL endpoint [1], for example a wiki page about a country, then find an URI for a dbpedia resource for a country. I ended up using the Czech Republic URI [2].

After much tooing-and-froing the query I came up with the query below. Strangely, some countries have recorded populations that this query doesnt find (eg Germany – simplify the query and take a look).  It’s a bit rough around the edges but the evidence is clear [1] : Ireland is over-represented in legislature.

  • Anyone have any opinions why Ireland is over-represented ?
  • Is the situation similar at local level ?
  • Why doe the query not pick up dbpprop:populationEstimate for Germany ?
SELECT ?s ?t ?estPop ?estPopYear ?cenPop ?cenPopYear ?house ?n (?estPop/?n) as ?estPerCapita (?cenPop/?n) as ?cenPerCapita
?s ?p <> .
?t skos:subject ?s .
?t rdf:type dbpedia-owl:Country .

?b skos:broader ?s .
?house skos:subject ?b .
?house dbpedia-owl:numberOfMembers ?n .

?t dbpprop:populationEstimate ?estPop .
?t dbpprop:populationCensus ?cenPop .
?t dbpprop:populationEstimateYear ?estPopYear .
?t dbpprop:populationCensusYear ?cenPopYear .

order by ?t




Java Semantic Web & Linked Open Data webapps – Part 2

November 24, 2010 Comments off

Selection criteria – technologies, content, delivery

For both applications different different technologies are required – they have different needs and outputs, as described in the first part of this series. One is aimed at the Linked Open Data web, the other at exploring the possibilities of using Semantic Web technologies in a J2EE stack, in particular within a Location Based Service. Neither of these applications are RDFa publishing mechanisms or concept extractors like OpenCalais.

So what kind of content will the tools I need have to work with, what are the criteria, and how does the final package look ?

Linked Open Data webapp Semantic backed J2EE webapp
Content A PDF of a 19th century gazeteer of Ireland’s Civil Parishes [1]. This takes us into the world of digital humanities, history and archive data. But this is about the rich content of the gazeteer, not about describing the gazeteer itself, so its not a bibliographic application. It takes the form of entries for each of 3600 odd civil parishes in Ireland. Each entry may consist of information regarding 

  • placename and aliases
  • location,
  • relation to other locations – distance, bearing,
    other placenames
  • population – core, borough and rural
  • antiquity and history – thru the eyes of the author and the tendency to ignore social and individual aspects of “historians” at this time
  • natural resources present in the location
  • days or markets and fairs
  • landscape, features and architecture
  • goverment and official presence – brideswell, police station, post office
  • agriculture and industry
  • politics – the great houses of the gentry and aristocracy, the members of parliament
  • ecclesiastical matters – Tithes, Glebes and church buildings for the Church of Ireland/England to the Roman Catholic “chapels”
  • educational facilties for the population

Entries for cities and larger towns are long and wavering in their descriptions, while smaller parishes or those known by common names may simply be entries that say “SEE OtherPlaceName”. Each entry starts with a capitalised placename, which is then followed by freetext sentences, using 19th vocabulary and phraseology. There are no subject headings or breaks within entries. Pages are 2 column layout with page numbers at the bottom and 3 letter index headings on each page – eg “301 BRI” or “199 ATH” – or at least this is how it looks when the text is extracted from the PDF. So, in summary, the content is freetext, but does have some structure, although its not possible to tell until the content is read. There are variations used throughout, capitalisation can be different from entry to entry, and there are myriad OCR errors.

The idea here is that a user might be able to continually and over time record their latitude and longitude, and tag their profile and their locations with a structured vocabulary. To make
it useful, allow others to see that info if I invite them to, and allow commerical applications access to my data if I sign up to it. Obviously there are issues to do with privacy, access control and data protection here, not just at an end-user level but also at a commercial level. And if we’re going to do it with semantic technology then it needs to justify itself. Oh, and almost (didnt) forget, it needs to be highly performant too. So – its simple really, right ? A user database, login, authorization, post some coordinates, create groups and “friends”, report on them. maintain privacy, show some real benefit from choosing Semantic technologies. What could be easier ! 

So, what kind of content do we have to look at :

  • User profile information – but we want to minimise and identification details, preferably retain none, not even an email address
  • Roles – User, Administrator, Partner, Root. An admin is a user who owns an Application and can see Location data for all users in the application. A Partner is a repository wide user who can
    see locations in all applications, but cannot identify users or applications by name.
  • Blur – a representation of the degree of fuzziness applied to an identifying entity, to be used with Roles, ACLs or permissions to diminish the accuracy of the information reported to a
    user or machine using the system
  • Location information – lat and long, but also related to Application or Group, User
  • Application profile – we want 3rd parties to be able to create an “application” with some valuable proprietary content that builds on an anonymous individuals location history,
  • Groups – like applications but not commercially orientated. These are for individuals who know each other – friends – or who share an interest – eg a Football Club. They may have an off
    domain web application of their own.
  • Date and time – more than one user might post the same location at the same time, and we want to have a history trail of
    locations for users
  • Devices – individuals may use applications that reside on a mobile device or a web page perhaps, to post location information. They may have multiple devices. Multiple users might share
    the same device.
  • Interests/Tags – what people are interested in, how they categorise themselves, what they think of when theyre at a location
  • Permissions – read,write,identify,all and so on – the degree to which a role or user can perform some operation
  • Platforms – the operating system or software that a device runs on
  • Status – Active, Deleted, Archived etc – flags to signify the state of an entity in the system and around which access rules may be tied
  • UserAgent – as well as the Platform and Device, we may want to record the agent that posted a location or a tag
  • Query – commercial applications need to be able to “can” a query that interests them so they can run it again and again, or even schedule it
  • Schedule – a cron-like entity that assumes an owners identity when run to query available and accessible data, then perform a Task
  • Task – an coarse entity that encapsulates a query and some action that happens when the query is run and either “succeeds” or “fails”
Technology questions The webapp will consist of a SPARQL endpoint and some HTML with Javascript Ajax calls. This is 2-tier rather then 3-tier architecture, because the SPARQL endpoint is simply a means to pass queries to the repository using http. For an application where developers are focused on UI and dynamic linkage across the web, where the data is effectively read-only and doesnt have much in the way of
surrounding business logic – a typical Mashup – then this kind of architecture gets you there fast. Following guidelines for building Linked Open Data applications [2,3,4] this is broken down into a number of technology problems. 

  • Getting a quality corpus of text to use. This also includes making sure that any licensing and privacy issues are considered.
  • Exracting entities from the text – but this is in itself a series of tasks 
    • what entities am i interested in ?
    • how do i define the entity ?
    • is the entity actually a compound of more than one thing – eg a distance of “11 miles” may be an entity that is a string or a compound that is a number (what kind of number) and a unit or
      measurement “miles”. Are the miles in this 1842 corpus the same as the ones used today ?
    • do I need to bundle each entries entities into a single blob or RDF ? this is as much a question of what RDF is as to how you go about developing, debugging and staging the content
    • how are entites related to other content in an entry in the corpus, and to other entries in the corpus
    • is the quality and structure of the content sufficiently consistent to support automated techniques, available technology and time constraints ?
    • can i treat the corpus as a series of text fields, or do I need to consider it as set of concepts ?
    • for either approach, what technologies are available and what are their limitations ? are they current, documented and actively supported ? do they do what they say they do ? how much
      effort is needed to learn them ? Will I need to write my own code even if do use them ? Are there examples of code elsewhere ? Do they have dependencies that arent compatible with other technologies I use ? Are there licensing issues and costs if my application becomes commercial ? Once I start using them, how long will it take to get useable output ?
  • Once an entity is identified, transforming that into an RDF representation – what URI and tags to use (this relates to the ontology design), what serialisation format (if any) is best to work with – xml/ttl/n3/direct-load. How do I design my URIs (“hash or slash” ?) [5] ? Do I need to use or be aware of other URI schemes for compatability and reuse ? [6]. Should I try and clean the source text first (OCR errors) or rectify this by using RDF “tricks” or “aliases” ?
  • Storing or staging the RDF and loading into a repository – one file, many files (how many?), a database, a filesystem, a semantic repository
  • Building a query front end for the data once it its in the repository – check its in there, check its correct, make it available – do I need a 303 redirect ? Do I serve the ontology (T) alongside the instances (A) ? Do I need inference ? Can I use Tomcat/Jetty/OpenVirtuoso ? What hardware do I need ?
  • How do I make the content searchable ? What is search in this context – is it google-esque keyword lookup (TF/IDF) or is it a query console, or a browse capability – how can I do any of these things ? Do I need a separate search instance/technology or do any of the SemWeb technologies do this as well as store and retrieve RDF ? What am I going to index – text, uris (subjects/properties/objects)
  • Once I have a SPARQL endpoint, how do I build a webapp to talk to it  – can I get JSON ? can I get JSONP ? What XML formats are output ? Do I need to make server side calls or can I use Ajax client side calls ? Are there security features required – or do any of the SPARQL endpoint technologies have any security or access control facilties ?
  • What libraries of code are available to make calls to a SPARQL endpoint – are there specialist libraries or do I just treat it as an XML web service ?
  • If I am to link with other SPARQL endpoints – eg dbPedia [7]- how do I do that ? Is it a server side or client side problem ? How do I match URIs or more importantly concepts ?
  • Can I build other datasets from related information later, independently, and then link those to my dataset ?
  • How can I build a UI around RDF – are there conventional ways to render forms or graphs ? Do I need to write code myself or are there “black boxes” that I can make use of to render RDF or forms to capture user input as RDF ?
  • How will machine rather than human access be handled ? – Do I  need to build an API other than the SPARQL endpoint – is this for other applications, for spiders or robots. Do I need a
    client side API ? Do I want to service cross domain calls  eg (JSPONP/CORS) ?
  • Are there any concurrency issues to be aware of – will the data extracted ever be updated once we get it out of the text ?
  • Phasing – Will the extraction be phased, does it take place over time and need different staging and migration strategies ?
  • Will I need to deal with versions of my information ?
  • Do we need backup and resilience at the service and/or data levels ?
  • Can we cluster, can we deploy round-robin, can we separate the display logic hosting from query, and this from the data hosting ?
  • Is performance of this application comparable with a traditionally built application that might make thinly proxied JDBC service calls from Javascript ?
  • Can you really treat a semantic repository like an RDBMS ?
  • Is Sign-in really required – why not just drop a cookie and let people post using some random GUID ?
  • How can I represent my entities if i use RDF ? Can I control the IDs in the system ?
  • Are there any libraries of code the bridge the gap from Object to RDF ? What criteria do I need in selecting one ? 
    • Licensing, cost
    • Support, documentation, recency
    • Tie-in to particular technology
    • Standards compliance
    • Performance
    • Configuration ease
    • API design – built to interface, modular, separation of concerns ?
    • Object oriented or RDF leaning ?
  • Can I build my app using common patterns – DAO, service layer, MVC ?
  • What if it all turns out to be wrong and semantic repository technology just doesnt do it ?
  • How do I control anonymity ? I dont want or care to know who people are really, and I want users to be secure knowing that they cannot be found by other users if they dont already know who they are. Likewise, how do I hide or cloak sensitive information even when a user allows another access to their details ?
  • Is the system performant in read and write ?
  • Does (how does) the system scale with concurrency ?
  • How can I allow cross-domain usage of the information, so that users, affinity groups and commercial 3rd parties create useful or novel applications from the location-content and then
    add-on value with proprietary information and perhaps Linked Open Data ? How do I allow the end user to control that access ?
  • Can a message queue be modelled using RDF ?
  • Can messages be “sent” from one user to another or a user to an application ?
  • Can I ensure that identities cannot be inadvertently explored or discovered ?
  • Are there ontologies available that allow me to model my entities ?
  • What tools can use to create an ontology, and reuse others ?
  • Do I need inferrence ? What advantages does it give me, or the users of the system ?
  • Can I combine those advantages with the application data to deliver a new kind of service ? Does Semantic Web technology deliver on its promise ?
  • How can I allow a user to use a structured vocabulary ?
  • Will I need to partition all this data by user, by application, by date ?
  • Can I host this reliably, efficiently and with performance ? What deployment configuration do I need during development and then when I come to host it ?
  • Do I need a purpose-designed semantic repository or can I use my favourite RDBMS as a storage medium ?
Delivery This is really a question about who my users are, what their requirements and expectations are of the application, what I want to deliver to them – is it a technology demo or a useful, perhaps even attractive application ? Will it be long lasting – what are my expectations for it ? 

For me, this is primarily a technology demonstration, a learning tool and perhaps if it turns out to actually work and delivers something over and above what a conventional application might then I can keep it going and running without it costing too much. Cost is as a large consideration –

  • Amazon EC2 will cost roughly e2/day for a micro instance,
  • an ISP might charge e70/y but wont be capable of hosting a large memory JVM and give me the level of control I need, or
  • self-host at home on an old machine – electricity isnt cheap, will the machine be powerful enough, and what do I do if and when the machine
    fails ?

The intent in this exercise is to learn as much as possible about the practical sides of building such an application, and to try and do what can be done without it becoming a research project. A long the way, there are choices of course, and in the end hindsight and experience will pay rewards.

The J2EE application is to deliver a “white-label” (customisable) service for application builders who are interested in making use of crowd sourced location data. Individuals need to be secure in trusting their location, and some aspect of their identity (or a moniker for it), to the system. They need to be sure that when they allow another user to get to it that they cannon be identified from it (or a sequences of locations or recurring location usage) unless they chose to do so. 

The tags used against locations or member profiles need to be queryable usefully – not just by equivalence or presence. Any queries run against the system must be availble to non technicians (SPARQL experts) so a useable UI.

The service will not deliver a SPARQL endpoint initially, but may deliver one against a subset of aggregate information over long periods of time. Similarly an API delivered for third party users is only open
to registered users or applications.

Users and groups get restricted volume access, and cannot schedule queries, while applications get unrestricted access to their own data, and query capabilities. Partners will get access to repository wide location data, but cannot see under which application or group the location was posted.

The information, and the intelligence within, created from users attracted to commercial applications (eg an anonymous FourSquare type application), are within the realm of the 3rd party delivering that
application. This web service only deals with the custodianship of location trails, user declared relationships between users, and the tags assigned by users to them.

Building the application must be successful in terms of raw functionality but also in terms of performance. It must not be tied to a particular semantic technology, and it must be possible to compare with a traditional RDBMS based application. Importantly, the application must demonstrate commercial advantage and a pattern of usage where information is sensitive, but also usefully relatable to the Linked Open Data web.

So – many questions, and many choices to make. Answers to most of these will follow in subsequent articles.


Building a Semantic Web Application in Java

November 22, 2010 1 comment

I assume you know Java and you have built web applications. You also know what a database is, what hierarchichal data are, what a schema is. You may have used ORM libraries like Hibernate. You’ve done some UI work with Javascript, JSP or RCP perhaps. You know what MVC means in the context of a web application. Now you want to know what and why you would use the Semantic Web and the Linked Open Data web to build a useful application.

Since you have decided that you also want to make sure it can fulfill all the usual use-cases – performance, concurrent updating, ease of maintenance, basic functional capability (for a business app perhaps), and importantly, whats the benefit – how do you sell this to your manager. For you it’s cool, its being talked about in all the hard to reach places that only you know, and it has more and more promise the more you look into it. You want to do it, but if it doesn’t cut-the-mustard, whats the point – wait a while, let someone else solve the problems, then come back to it. You can still work in the web app world, doing what youve done for the last while, make some money, pay the bills. So, before you even start, the Semantic Web is up against it – it has a lot to prove against a long embedded technology where experience has already paid and dividends been reaped. It better be good……

First question – where to start ? Its a hard question to answer if you’re not familiar with the Semantic Web, and even if you are – if you can grasp the basic and simple ideas behind it – then you’re still going to have difficulty. And whats the Linked Open Data web ? How do I use it, why should I bother ? Why would my boss be interested ? How can I make them interested ? How can they make money from it ? Doesn’t his mean my companies data is going to be available to the public, and my competitors ?

So, over the next while, Im going to try and relate how and why I did what I did to build two different kinds of web application –

1) a read-only reference data application that makes use content loaded into a repository and fronted with a sparql end point, talks to geonames, dbpedia, sindice, uberblic and google maps.

2) a “White Label” J2EE Location Based Service that uses JPA behind its DAOs to talk to a semantic repository. It also makes use of OpenID to provide some anonymity, Spring Security to provide security and ACL controlled authorization (all stored in a semantic repository) and integrates the first app using JSONP (Facebook Authentication and OAuth authorization are also on the cards).

It has taken a while, and there have been some good and bad choices, so tune in next time for the first installment in the series, selected from this bunch :

  1. Target Functionality – what is my application and what parts of the semantic web do I want and/or need to deliver. Does the application want/need to be part of the Linked Open Data web ?
  2. Selection criteria – technologies, content, delivery
  3. Available tools and technologies. Can I avoid SOA pitfalls or is this just the same old story ?
  4. What needs writing ?
  5. How to architect ?
  6. Do I need an ontology – how to create ? Whats OWL-DL/Lite/RDFS/DAML/XYZ ?
  7. Text vs RDF vs XML – Reading, generating, parsing, APIs.
  8. Size, performance, scale
  9. Output – files, databse, rdf, URIs, linkage, API ?
  10. Do I need freetext search ? How do I do it ? Why not just use SOLR and faceted search ?
  11. Content Negotiation – who/what is my audience ? browser, script, machine, API ?
  12. Mapping – I’ve got location, lets use it – show it and link it
  13. Ajax, js – can I use semantic web/rdf/rdfa libs in my UI  – do I have to do it all ? What help is there ?
  14. UI – can I build a UI to display and collect queries (forms) using my ontology ? How do I allow a human to easily navigate thru my strutured, but infinitely (hey, what about closure ?) graphable set of nodes ?.

Which do you want to hear about first ?

Like I said, I built two web applications for the Semantic and Linked Open Data web. The articles following this introduction will talk about both in parallel.

Semantic Progress

November 7, 2009 Comments off

Cleaned up some of the RDF that I had been trying to ignore, so the code and config is now in a state to do a full run and populate a respository. Getting about 80% hit to GeoNames and about 80% of that again to DBPedia. Will do a manual hitlist later, and work out how to update the repository with decorations like that. Now just have to work out Jena and TDB, and hope my URIs are navigable out of the box – I doubt it.

Semantic Progress

November 2, 2009 Comments off

Almost there. Have to do a run over Lewis and find which records I cant match to DBPedia or GeoNames and create manual links, then work out how to get into TDB (or perhaps SDB). Wonder how the mapping for 303 Concept-to-Document is going to work.

Will then get the semantic pedants to have a look-see, before a friends-and-family and a general announcement. Hope my hardware can cope.

Have to track access and usage as well. Need an opensource analytics package.

And then I can go back to NLP and ML. Will contact Getty about the AAT thesaurus.

Semantic Progress

October 31, 2009 Comments off

Getting closer. Integrated some fixed dbpedia lod links, now looking at rest of toplevel info in repository items for possible ontology matching and lod links. Will also need a static decorator for dc publish info, and perhaps foaf links to here.

Getty AAT thesaurus looks like the job for some of the other information yet to conceptize, but there are licensing costs. Need to consider some more.

Drupal7 looks interesting, esp for up-and-running publishing needs – might be quick way to build app on top of LewisT.  Shame (for me) its in PHP!

Got to get the hosting machine set up some more: damned intel i845 graphics chip causing all kinds if issues with package installation for me. I might need something more powerful as well. we’ll see.

  • not enough time in the day !
  • Drupal7 vs ?
  • Calendering, Todo, fix SCM
  • All things RDFized
  • dc into lewist
  • foaf into lewist
  • more geonames into lewist
  • publish and be damned !
  • Getty – AAT
  • domainnames to set up
  • machine setup
  • openVirtuoso, pubby, sesame ? jena repos with ARQ ?
  • HI2111 – elphin, dunraven, newspaper.