Archive

Posts Tagged ‘thesaurus’

Using w3c org ontology to describe Ireland’s entry to the EU (i)

November 22, 2014 Comments off

eGovernment EU Institutions working on interoperability (“Joinup”) have commited to the W3C their work on an ontology describing organisations. This follows work by Dave Reynolds.

There are also ontologies and vocabularies for Registered Organisations, EU institutions, job descriptions and do on. All very useful – really ! – but in trying to use them to describe the history of the Irish accession and membership of the EU I have come across a few things that have led to some customisations and extensions – a new ontology in fact. Of course, this is part of the beauty of Linked Data and I hope to have some real data to publish soon for review.

The dataset should include, starting at 1950, the history of Ireland’s interaction with the EU and the events that took place from them, in context. This will include organisations, events, ministries, ministers, locations, dates, and outcomes, and so will require some thought and analysis to avoid duplicating what is there already and to make correct use of the classes and properties already defined.

There are a few things to consider, not least the definition of a “state” for a “government” and a “ministry” or “department”. The Location ontology allows for geometric coordinates and spatial descriptions for instance, but the organisation ontology on the other hand doesn’t cater for a political or economic location to be described. So I am adding that and making use of another ontology (the Agricultural Information Management Standards geopolitical ontology/thesaurus). This will be joined by an object property.

The Joinup vocabulary for EU institutions includes historical names for the economic organisations and treaties – like the European Coal and Steel Community for instance – and is described using SKOS. This gets included as a reference source in my ontology. Treaties are not described as particular things so I’m working on adding that too. Prov-o may or may not help here. The CPSV catalog describes the Public Service but appears incomplete or still in development. It also doesn’t seem to link to the org ontology in any way, so I’ll tackle that as well. There is also a vocab for institutions names and civil service roles that will be useful, as long as I can link it back to foaf properly – still trying to see how for now !

Of course this all highlights a  issues and advantages of open world ontologies

  • your modelling perspective defines what you want to use and how you see other work – I see some gaps and connections that others may not, or that were not important or particular at the time, or that in fact need or can only be defined depending on your perspective
  • its possible to make up for that by joining and linking classes in a new ontology that reflects your vision
  • it can be hard work trying to figure out what exists already and where and how to extend it
  • tool support (in Protege at least) is lacking when it comes to creating individuals as well as new classes and properties – ideally I want to pick a class, then get some completion help to allow me to see the class hierarchy and the available properties (like a programming IDE – Eclipse or Idea for instance). I resort to sparql.
  • How to coin linkable URI and  where to publish  with a content negotiator and sparql endpoint ?

Still, it’s going to take some time, but should be interesting to come up with something that can be reviewed, corrected and evolve – without reinventing the wheel. Success will be measured in how many inbound links it gets !

Java Semantic Web & Linked Open Data webapps – Part 3.0

November 26, 2010 5 comments

Available tools and technologies

(this section is unfinished, but taking a while to put together, so – more to come)

When you first start trying to find out what the Semantic Web is in technical terms, and then what the Linked Open Data web is, you soon find that you have a lot of reading to do – because you have lots of questions. That is not surprising since this is a new field (even though Jena for instance has been going 10 years) for the average Java web developer who is used to RDBMS, SOA, MVC, HTML, XML and so on. On the face of it, RDF is just XML right ? A semantic repository is some kind of storage do-dah and there’s bound to be an API for it, right ? Should be an easy thing to pick up, right ? But you need answers to these kind of questions before you can start describing what you want to do as technical requirements, understanding what the various tools and technologies can do, which ones are suitable and appropriate, and then select some for your application.

One pathway is to dive in, get dirty and see what comes out the other side. But that to me is just a little unstructured and open-ended, so I wanted to tackle what seemed to be fairly real scenarios (see Part 2 of this series) – a 2-tier web app built around a SPARQL endpoint with links to other datasets and a more corporate style web application that used a semantic repository instead of an RDBMS, delivering a high level API and a semantic “console”.

In general then it seems you need to cover in your reading the following areas

  • Metadata – this is at the heart of the Semantic Web and Linked Open Data web. What is it !! Is it just stuff about Things ? Can I just have a table of metadata associated with my “subjects” ? Do I need a special kind of database ? Do I need structures of metadata – are there different types of things or “buckets” I need to describe things as ? How does this all relate to how I model my things in my application – is it different than Object Relational Modelling ? Is there a specific way that I should write my metadata ?
  • RDF, RDFS and OWL – what is it, why is it used,  how is it different than just XML or RSS;  what is a namespace, what can you model with it, what tools there are and so on
  • SPARQL – what is it, how to write it, what makes it different from SQL; what can it NOT do for you, are there different flavours, where does it fit in a web architecture compared to where a SQL engine might sit ?
  • Description Logic – you’ll come across this and wonder, or worse give up – it can seem very arcane very quickly – but do you need to know it all or any of it ? Whats a graph, a node, a blank node dammit, a triple, a statement ?
  • Ontologies – isn’t this just a taxonomy ? Or a thesaurus ? Why do I need one, how does metadata fit into it ? Should I use RDFS or OWL or something else ? Is it XML ?
  • Artificial Intelligence, Machine Learning, Linguistics – what !? you mean this is robotics and grammer ? where does it fit in – whats going on, do I need to have a degree in cybernetics to make use of the semantic web ? Am I really creating a knowledge base here and not a data repository ? Or is it an information repository ?
  • Linked Open Data – what does it actually mean – it seems simple enough ? Do I have to have a SPARQL endpoint or can I just embed some metadata in my documents and wait for them to be crawled. What do I want my application to be able to do in the context of Linked Open Data ? Do I need my own URIs ? How do I make or “coin” them ? How does that fit in with my ontology ? How do I host my data set so someone else can use it ? Surely there is best practice and examples for all this ?
  • Support and community – this seems very academic, and very big – does my problem fit into this ? Why can I not just use traditional technolgies that I know and love ?  Where are all the “users” and applications if this is so cool and useful and groundbreaking ? Who can help me get comfortable, is anyone doing any work in this field ? Am I doing the right thing ? Help !

I’m going to describe these things before listing the tools I came across and ended up selecting for my applications. So – this is going to be a long post, but you can scan and skip the things you know already. Hopefully, you can get started more quickly than I did.

First the End

So you read and your read and come across tools and libraries and academic reports and W3C documents and you see it has been going on some time, that some things are available and current, others are available and dormant. Most are OpenSource thankfully and you can get your hands on them easily, but where to start ? What to try first – what is the core issue or risk to take on first ? Is the enough to decide that you should continue ?

What is my manager going to say to me when I start yapping on about all these unfamiliar things –

  • why do I need it ?
  • what problem is it solving ?
  • how will it make or save us money  ?
  • our information is our information – why would I want to make it public ?

Those are tough questions when you’re starting from scratch, and no one else seems to be using the technologies you think are cool and useful – who is going to believe you if you talk about sea-change, or “Web3.0” or paradigm shift, or an internet for “machines”. I believe you need to demonstrate-by-doing, and to get to the bottom of these questions so you know the answers before someone asks them of you. And you better end up believing what your saying so you that you are convincing and confident. Its risky….*

So – off I go – here is what I found, in simple, probably technically incorrect terms – but you’ll get the idea and work out the details later (if you even need to)

*see my answers to these questions at the end of this section

Metadata, RDF/S, OWL, Ontologies

Coarsely, RDF allows you to write linked lists. URIs allow you to create unique identifiers for anything. If you use the same URI twice, your saying that exact Thing is the same in both places. You create the URIs yourself, or when you want to identify a thing (“john smith”) or a property of a thing (eg “loginId”) that already exists, you reuse the URI that you or someone else created. You may well have a URI for a concept or idea, and another for one of its physical form – eg a URI for a person in your organisation, and another for the webpage that shows his photo and telephone number, another for his HR system details.

Imagine 3 columns in a spreadsheet called Subject, Object and Predicate. Imagine a statement like “John Smith is a user with loginId ‘john123’ and he is in the sales Department“. This ends up like

Subject Predicate Object
S-ID1 type User
S-ID1 name “John”
S-ID1 familyName “Smith”
S-ID1 loginId “john123”
S-ID2 type department
S-ID2 name “sales”
S-ID2 member ID1

That is it, simply – RDF allows you to say that a Thing with an ID we call S-ID1 has properties, and that those properties are either other Things (S-ID2/member/ID1) or literal things like strings “john123”.

So you can build a “graph” or a connected list of Things (nodes) where each Thing can be connected to another Thing. And once you look at one of those Things, you might find that it has other properties that link to different Things that you don’t know about or that aren’t related to what you are looking at – S-ID2 may have another “triple” or “statement” that links it with ID-99 say (another user) or ID-10039 (a car lot space, say). So you can wire up these graphs to represent whatever you want in terms of properties and values (Objects). A Subject, Property or Object can be a reference to another Thing.

Metadata are those properties you use to describe Things. And in the case of RDF each metadatum can be a Thing with its own properties (follow the property to its own definition), or a concrete fact – eg a string, a number. Why is metadata important – because it helps you contextualise and find things and to differentiate one thing from another even if they are called the same name. Some say “Content is King” but I say Metadata is !.

RDF has some predefined properties like “property” and “type”. Its pretty simple and you’ll pick it up easily [1]. Now RDFS extends RDF to add some more predefined properties that allow you to create a “schema” that describes your data or information – “class”, “domain”, “range”, “label”, “comment”. So if you start to formalise the relationships described above – a user has a name, familyName, loginID and so on – before you know it, you’ve got an ontology on your hands. That was easy, right ? No cyborgs, logic bombs, T-Box or A-Box in sight.(see the next section) And you can see the difference between an ontology and a taxonomy – the latter is a way of classifying or categorising things, but an ontology does that and also describes and relates them. So keep going, this isn’t hard ! (Hindsight is great too)

Next you might look at OWL because you need more expressiveness and control in your information model and you find out that it has different flavours – DL, LITE, FULL[2] What do you do now ? Well, happily, you don’t have to think about it too much, because it turns out that you can mix and match things in your ontology – use RDFS and OWL, and you can even use things from other ontologies. Mash it up – you don’t have to define these properties from scratch yourself. So go ahead and do it, and if you find that you end up in OWL-FULL instead of DL then you can investigate and see why. The point is, start, dig in and do what you need to do. You can revise and evolve at this stage.

A metadata specification called “Dublin Core”[3] comes up a lot – this is a useful vocabulary for describing things like “title”, “creator”, “relation”, “publisher”. Another, the XSD schema is useful for defining things like number types -integer, long and float – and is used as part of SPARQL for describing literals. You’ll also find that there are properties of things that you thought are so common that someone would have an ontology or a property defined for them already. I had a time looking for a definition of old English miles, but it turns out luckily that there was one[4,5]. On the other hand, there wasn’t one for a compass bearing of  “North” – or at least one that I could find, so I invented one, because it seemed important to me. Not all things in your dataset will need metadata – and in fact you might find that you, and someone working on another project have completely different views on whats important in a dataset – you might be interested in describing financial matters, and someone else might be more interested in the location information. If you think about it long enought a question might come to mind – should we still maintain our data somewhere in canonical, raw or system-of-record form, and have multiple views of what it is stored elsewhere ? (I dont have an answer for that one yet).

Once your start you soon see that the point of reusing properties from other ontologies is that you are creating connections between datasets and information just by using them – you may have a finance department that uses “creator” that you can now link records in the HR system with the same person – and because the value used  for the “creator” is in fact a unique URI (simply, an ID that looks like an URL) eg http://myCompany.com/people/john123. If you have another John in the company, he’ll have a different ID eg http://myCompany.com/people/john911, so you can be sure that the link is correct and precise – no ambiguity –  John123 will not get the payslip meant for John911. There are also other ways of connecting information – you could use owl:sameAs for instance – this makes a connection between two Things when a common vocabulary or ID is not available, or when you want to make a connection where one didn’t exist before. But think about these connections before you commit them to statements – the correctness, provenance and trust around that new connection has to be justifiable – you want your information and assertions about it to have integrity, right ?

I needed RDF and RDFS at least – this would be the means that I would express the definition and parameters of my concepts, and then also the statements to represent actual embodiments of those concepts – instances. It started that way, but I knew I might need OWL if I wanted to more controlled over the structure and integrity of my information – eg to say that John123 could only be a member of one department and one department only, that he had role of “salesman” but couldn’t also have a role of “paymaster”. So, if you need this kind of thing, read more about it [6,7]. If you don’t yet, just keep going, and you can still come back to it later.(turns out I did in fact)

The table above now looks like this when you use URIs – its the same information, just written down in a way that ensures things are unique, and connectable.

Namespaces
myCo:http://myCompany.com/people/
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
foaf:http://xmlns.com/foaf/0.1/
Subject Predicate Object
myCo:S-ID1 rdf:type myCo:User
myCo:S-ID1 rdfs:label “John”
myCo:S-ID1 foaf:family_name “Smith”
myCo:S-ID1 myCo:loginId “john123”
myCo:S-ID2 rdf:type department
myCo:S-ID2 rdfs:label “sales”
myCo:S-ID2 myCo:member myCo:S-ID1

The Namespaces at the top of the table mean that you can use shorthand in the three columns and don’t have to repeat the longer part of the URI each time. Makes things easier to read and take in too, especially if you’re a simple human. For predicates, I’ve changed name to rdfs:label and familyName to foaf:family_name[8]. In the Object column only the myCo namespace is used – in the first case it points to a Subject with a type defined elsewhere (in the ontology in fact). I say the ontology is defined elsewhere, but that doesnt haev to be physically elsewhere, its not uncommon to have a file on disk that contains the RDF to define the ontology but also contains the instances that make up the vocabulary or the information base.

So – why is this better than a database schema ? The simple broad answers are the best ones I think :

  • You have only 3 columns*
  • You can put anything you like in each column (almost – literals cant be predicates (?) ), and its possible to describe a binary property User->name->John as well as n-ary relationships[9] User->hasVehicle->car->withTransmission->automatic
  • You can define what you know about the things in those columns and use it to create a world view of things (a set of “schema rules”, an ontology).
  • You can (and should) use common properties – define a property called “address1” and use it in Finance and HR so you know you’re talking about the same property. But if you don’t, you can fix it later with some form of equivalence statement..
  • If there are properties on instances that aren’t in your ontology, they don’t break anything, but they might give you a surprise – this is called an “open world assumption” – that is to say just because it is not defined does not mean it cannot exist – this is a key difference from database schema modelling.
  • You use the same language to define different ontologies, rather than say MySQL DDL for one dataset and Oracle DDL for another
  • There is one language spec for querying any repository – SPARQL **. You use the same for yours and any others you can find – and over Http – no firewall dodging, no operations team objections, predictable, quick and easy to access
  • You don not have to keep creating new table designs for new information types
  • You can easily add information types that were not there before while preserving older data or facts
  • You can augment existing data with new information that allows you to refine it or expand it – eg provide aliases that allow you to get around OCR errors in extracted text, alternative language expressions
  • Any others ?

*Implementations may add one or two more, or break things up into partitioned tables for contextual or performance reasons
**there are different extensions in different implementations

[1]http://rdfabout.net/
[2]http://www.w3.org/TR/owl-features/
[3]http://dublincore.org/
[4]http://forge.morfeo-project.org/wiki_en/index.php/Units_of_measurement_ontology
[5]http://purl.oclc.org/NET/muo/ucum-instances.owl
[6]http://www.cs.man.ac.uk/~horrocks/ISWC2003/Tutorial/
[7]http://www.w3.org/TR/owl-ref/#sameAs-def
[8]http://xmlns.com/foaf/spec/
[9] http://www.w3.org/TR/swbp-n-aryRelations/

SPARQL, Description Logic (DL), Ontologies

SPARQL [10] aims to allow those familiar with querying relational data to query graph data without too much introduction. Its not too distant but needs a little getting used to. “Select * from users” looks like “select * from {?s rdf:type myCo:User}”, and then you get back 2 types of information rather than every column from a table. Of course this is because you have effectively 3 “columns” in the graph data and theyre populated with a bunch of different things. So you need to dig deeper[11] into tutorials and what others have written.[12,13]

One of the key things about SPARQL is that you can use it to find out what is in the graph data without having any idea before hand.[14] You can ask to find the types of data available, then ask for the properties of the types, then DESCRIBE or select a range of types for identified subjects. So, its possible to discover whats available to suit your needs, or for anyone else to do the same with your data.

Another useful thing is the ability (for some SPARQL engines – Jena’s ARQ [15] comes to mind) to federate queries either by using a “graph” (effectively just a named set of triples)  that is an URI to a remote dataset, or by using (in Jena’s) case, the SERVICE keyword. So you can have separate and independent datasets and query across them easily. Sesame[16] allows a similar kind of thing with Federated Sail but you predefine the federation you want, rather than specify it in-situ. Beware of runtime network calls in the Jena case, and consider hosting your independent data in a single store but under different graphs to avoid them. You’ll need more memory in one instance, but you should get better performance. And watch out for JVM memory limits and type size increases if you (probably) move to a 64bit JVM.[17,18]

While learning the syntax of SPARQL isn’t a huge matter, understanding that youre dealing with a graph of data and having to navigate or understand that graph before hand can be a challenge, especially if its not your data you want to federate or link with. Having ontologies and sample data (from your initial SPARQL queries) helps a lot, but it can be like trying to understand several foreign database schemas at once, visualising a chain rather than a hierarchy, taking on multiple-inheritance and perhaps cardinality rules, domain and range restrictions and maybe other advanced ontology capabilities.

SPARQL engines or libraries used by SPARQL engines that allow inferencing provide a unique selling point for the Semantic and Linked web of data. Operations you cannot easily do in SQL are possible. Derived statements with information that is not actually “asserted” in the physical data you may have loaded into your repository start to appear. You might for instance ask for all Subjects or things of a certain type. If the ontology of the information set says that one type is a subclass of another – say you ask for “cars” – then you’ll get back statements that say your results are cars, but you’ll also get statements saying they are also “vehicles”. If you did this with an information set that you were not familiar with, say a natural history data set, then when you ask for “kangaroos” you are also told that its an animal, a kangaroo, and a marsupial. The animal statement might be easy to understand, but perhaps you expected that it was a mammal. And you might not have expressly said that a Kangaroo was one or the other.

Once you get back results from a SPARQL query you can start to explore – you start looking for kangaroos, then you follow the marsupial link, and you end up with Opossum, then you see its in the USA and not Australia, and you compare the climates of the two continents. Alternatively of course, you may have started at the top end – asked for marsupials, and you get back all the kangaroos and koalas etc, then you drill down into living environment and so on. Another scenario deals with disambiguation – you ask for statements about eagles and the system might return you things named eagles, but you’ll be able to see that one is a band, one is a US football team, and the other a bird of prey. Then you might follow links up or down the classifications in the ontology.

Some engines have features or utilities that allow you to “forward-chain”[19] statements before loading – this can mean that using an ontology or a reasoning engine based on a language specification that derived statements about things are asserted and materialised for you before you load them into your repository. This is not only things to do with class hierarchy but also where a hierarchy isnt explicit, inference might create a statement – “if a Thing has a title, pages, book, and has a hardback coverthen it is ….a book”.  This saves the effort at runtime and should mean that you get a faster response to your query. Forward chaining (and backward-chaining[20]) are common reasoning methods used with inferrence rules in Artificial Intelligence and Logic systems.

It turns out, Description Logic or “DL” [21] is what we are concerned with here – a formal way of expressing or representing knowledge – things have properties that are a certain value. OWL is a DL representation for instance. And like Object oriented prorgammic languages – Java say – there are classes (ontology, T-Box statements) and instances (A-Box, instances, vocabularies). There are also notable differences from Java (eg multiple inheritance or typing),  and a higher level of formalism, and these can make mapping between your programming language and your ontology or modelling difficult or problematic. For some languages, ProLog or Lisp this mapping may not be such a problem, and indeed you’ll fnd many semantic tools and technologies built using them.

Despite the fact that DL and AI  can get quite heady once you start delving into these things, it is easy to start with the understanding that they allow you to describe or model your information expressively and formally without being bound to an implementation detail like the programning language you’ll use, and that once you do implement and make use of your formal knowledge representation – your ontology – that hidden information and relationships may well become clear where they may not have been before. Doing this with a network of information sets means that the scope of discovery and fact is broadened – for your business, this may well be the difference between a sale or not, or provide a competitive edge in a crowded market.

[10]http://www.w3.org/TR/rdf-sparql-query/
[11]http://www.w3.org/2009/sparql/wiki/Main_Page
[12] http://www.ibm.com/developerworks/xml/library/j-sparql/
[13]http://en.wikibooks.org/wiki/XQuery/SPARQL_Tutorial
[14]http://dallemang.typepad.com/my_weblog/2008/08/rdf-as-self-describing-data.html
[15]http://openjena.org/ARQ/
[16]http://wiki.aduna-software.org/confluence/display/SESDOC/Federation
[17]http://jroller.com/mert/entry/java_heapconfig_32_bit_vs
[18]http://portal.acm.org/citation.cfm?id=1107407
[19]http://en.wikipedia.org/wiki/Forward_chaining
[20]http://en.wikipedia.org/wiki/Backward_chaining
[21]http://en.wikipedia.org/wiki/Description_logic

Artificial intelligence, machine learning, linguistics

When you come across Description Logic and the Semantic Web in the context of identifying “things” or entities in documents – for example the name of a company or person, a pronoun or a verb – you’ll soon be taken back to memories of school – grammer, clauses, definitive articles and so on. And you’ll grow to love it Im sure, just like you used to 🙂
It’s a necessary evil, and its at the heart of a one side of the semantic web – information extraction(“IE”) as a part of information retrieval (“IR”)[22,23]). Here, we’re interested in the content of documents, tables, databases, pages, excel spreadsheets, pdfs, audio and video files, maps, etc etc. And because these “documents” are written largely for human consumption, in order to get at the content using “a stupid machine”, we have to be able to tell the stupid machine what to do and what to look for – it does not “know” about language characteristices – what the difference is between a noun and a verb – let alone how to recognise one in a stream of characters, with variations in position, capitalisation, context and so on. And what if you then want to say that a particular noun, used a particular way is a word about “politics” or “sport”; that its Englih rather than German; that it refers to another word two words previous, and that its qualified by an adjective immediately after it ? This is where Natural Language Processing (NLP) comes in.

You may be familiar with tokenising a string in a high level programming language, then writing a loop to look at each word and then do something with it. NLP will do this kind of thing but apply more sophisticated abstractions, actions and processing to the tokens it finds, even having a rule base or dictionary of tokens to look for, or allowing a user to dynamically define what that dictionary or gazeteer is. Automating this is where Machine Learning (ML) comes in. Combined, and making use of mathematical modelling and statistical analysis they look at sequences of words and then make a “best guess” at what each word is, and tell you how good that guess is.

You may need (probably) to “train” the machine learning algorithm or system with sample documents – manually identify and position the tokens you are interested in, tag them with categories (perhaps these categories themselves are from a structured vocabulary you have created or found, or bought) and then run the “trained” extractor over your corpus of documents. With luck, or actually, with a lot of training (maybe 20%-30% of the corpus size), you’ll get some output that says “rugby” is a “sports” term and “All Blacks” is a “rugby team”. Now you have a your robot, your artificial intelligence.

But the game is not up yet – for the Semantic and Linked web, you now you have to do something with that output – organise and transform into RDF – a related set of extracted entities – relate one entity to another into a statement “all blacks”-“type”-“rugby team”, and then collect your statements into a set of facts that mean something to you, or the user for whom you are creating your application. This may be defined or contextualised by some structure in your source document, but it may not be – you may have to provide and organising structure. At some point you need to define a start – a Subject you are going to describe, and one of the Subjects you come up will be the very beginning or root Thing of your new information base. You may also consider using an online service like OpenCalais[24], but you’re then limited to the range of entities and concepts that those services know about – in OpenCalais’ case its largely business and news topics – wide ranging for sure, but if you want to extract information about rugby teams and matches it may not be too successful. (There are others available and more becoming available). In my experience, most often and for now, you’ll have to start from scratch, or as near as damn-it. If you’re lucky there may be a set or list of terms for the concept you are interested in, but its a bit like writing software applications for business – no two are the same, even if they have the same pattern. Unlike software applications though, this will change over time – assuming that people will publish their ontologies, taxonomies, term sets, gazeteers and thesauri. Lets hope they do, but get ready to pay for them as well – they’re valuable stuff.

So

  1. Design and Define your concepts
    1. Define what you are interested in
    2. Define what things represent what you are interested in
    3. Define how those things are expressed – the terms, relations, ranges and so on – you may need to build up a gazeteer or thesaurus
    4. Understand how and where those things are used – the context, frequency, position
  2. Extract the concepts and metadata
    1. Now tell the “machine” about it, in fact, teach it what you know and what you are interested in – show it by example, or create a set or rules and relations that it understands
    2. Teach it some more – the more you tell it, the more variety, the more examples and repitition you can throw it, the better the quality of results you’ll get
    3. Get your output – do you need to organise the output, do you have multiple files and locations where things are stored, do  you need to feed the results from the first pass into your next one ?
    4. Fashion some RDF
    5. Create URIs for your output – perhaps the entities extracted are tagged with categories (that you provided to the trained system) or with your vocabulary, or perhaps not – but now you need to get from this output to URIs, Subjects, Properties, Objects – to match your ontology or your concept domain. Relate and collect them into “graphs” of information, into RDF.
    6. Stage them somewhere – on a filesystem say (one file, thousands ? versions, dates ? tests and trials, final runs; spaces, capitalisation, reserved characters, encoding – its the web afterall)
  3. Make it accessible
    1. Find a repository technology you like – if you dont know, if its your first time, pick one – suck it and see – if you have RDF on disk you might be able touse that directly (maybe slower than an online optimised repository). Initialise it, get familiar with it, consider size and performance implications. Do you need backup ?
    2. Load your RDF into the repository. (Or perhaps you want to modify some existing html docs you have with the metadata you’ve extracted – RDFa probably)
    3. Test what you’ve loaded matches what you had on disk – you need to be able to query it – how do you do that ? Is there a commandline tool – does it do SPARQL ? What about it you want to use it on the web, this is whole point isnt it ?Is there a sparql endpoint  – do you need to set up Tomcat or a Jetty say to talk to your repository ?
  4. Link it
    1. And what about those URIs – you have URIs for your concept instances (“All Blacks”), and URIs for their properties (“rdf:type”), and URIs for the Object of those properties (“myOnt:Team”), What happens now – what do you do with them ? If there for the web, if theyre URIs shouldnt I be able to click on them ? (Now were talking Linked Data – see next section).
    2. Link your RDF with other datasets (See next section) if you want to be found, to participate, and to add value by association, affiliation,connection – the network effect – the knowledge and the value (make some money, save some money)
  5. Build your application
    1. Now create your application around your information set. You used to have data, now you have information – your application turns that into knowledge and intelligence, and perhaps profit.

There are a few tools to help you in all this (see below) but you’ll  find that they dont do everything you need, and they  wont generate RDF for you without some help – so roll your sleeves up. Or – don’t – I decided against it, having looked at the amount of work involved in learning all about NLP & ML, in the arcane science (its new to me), in the amount of time needed to set up training and the quality of the output. I decided on the KISS principle – “Keep It Simple, Stupid”, so instead I opted to write something myself, based on grep !

I still had to do 1-5 above, but now I had to write my own code to do the extraction and “RDFication”. It also meant I got my hands dirty and learned hard lessons by doing rather than reading or trusting someone else’s code that I didnt understand. And the quality of the output and the meaning of it was all in my control still.  It is not real Machine Learning, it’s still in the tokenisation world I suppose, but I got what I wanted and in the process made something I can use again. It also gave me practical and valuable experience so that I can revisit the experts tools with a better perspective – not so daunting, more  comfortable and confident, something to compare to, patterns to witness and create, less to learn and take on, and, importantly, a much better chance of actual, deliverable success.

It was quite a decision to take – it felt dirty somehow – all that knowledge and science bound up in those tools, it was a shame not to use it – but I wanted to learn and to fail in some ways, I didn’t want to spend weeks training a “machine”, and it seemed better to fail with something I understood (grep) rather than take on a body of science that was alien. In the end – I succeeded – I extracted my terms with my custom-automated-grep-based-extractor and I created RDF and loaded it into a repository. Its not pretty, but it worked – I have gained lots of experience, and I know where to go next. I recommend it.

Finally, it’s worth noting here the value-add components

  • ontologies – domain expertise written down
  • vocabularies – these embody statements of knowledge
  • knowledge gathering – collecting a disparate set of facts, or describing and assembling a novel perspective
  • assurance, provenance, trust – certifying and guaranteeing levels of correctness and origin
  • links – connections, relationships, ranges, boundaries, domains, associations – the scaffolding of the brains !
  • the application – a means to access, present and use that knowledge to make decisions and choices

How many business opportunities are there here ?

[22] http://en.wikipedia.org/wiki/Information_extraction

[23] http://en.wikipedia.org/wiki/Information_retrieval

[24] http://www.opencalais.com/

Linked Open Data

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply”* embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

* This blog doesn’t contain RDFa because its just too hard to do – wordpress.com doesn’t have available pluginst, and the wordpress.org plugins may be limited for what you want to do. Drupal7 [50] does a better job, and Joomla [51] may get there in the end.

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data Comment
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/
[50] http://semantic-drupal.com/
[51] http://semanticweb.com/drupal-may-be-the-first-mainstream-semantic-web-winner_b568

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

  • Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments
  • Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data
  • Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets
  • Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

  • a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.
    • You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.
    • How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?
    • Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?
    • Or a mashup of route numbers, planning applications and house values ?
    • Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ??? 🙂
The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.
  • a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but
And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available;  that you “simply” embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology APIs, Services, Data
REST, SOAP Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.
VoID[38,39], WSDL[40] WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.
BPM UDDI [41,42,43], CPoA [44,45], Discovery, Orchestration, eventing, service bus, collaboration This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.
SLA Identify, Trust,Provenance, Quality, Ownership, Licensing, Privacy, Governance The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

[25] http://www.w3.org/DesignIssues/Semantic.html
[26] http://www.w3.org/DesignIssues/LinkedData.html
[27] http://ld2sd.deri.org/lod-ng-tutorial/
[28] http://esw.w3.org/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
[29] http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
[30] http://www.w3.org/TR/cooluris/
[31] http://en.wikipedia.org/wiki/World_Geodetic_System
[32] http://www.gutenberg.org/wiki/Gutenberg:Information_About_Linking_to_our_Pages#Canonical_URLs_to_Books_and_Authors
[33] http://www.gutenberg.org/wiki/Gutenberg:Feeds
[34] https://svn.eionet.europa.eu/projects/Reportnet/wiki/SparqlEurostat
[35] http://semantic.ckan.net/sparql
[36] http://www.freebase.com/view/user/bio2rdf/public/sparql
[37] http://lists.w3.org/Archives/Public/public-sparql-dev/2010OctDec/0000.html
[38] http://semanticweb.org/wiki/VoiD
[39] http://lists.w3.org/Archives/Public/public-lod/2010Feb/0072.html
[40] http://www.w3.org/TR/wsdl
[41] http://www.w3schools.com/WSDL/wsdl_uddi.asp
[42] http://www.omii.ac.uk/docs/3.2.0/user_guide/reference_guide/uddi/what_is_uddi.htm
[43] http://www.w3.org/TR/2006/WD-ws-policy-20060731/
[44] http://webofdata.wordpress.com/2010/02/
[45] http://void.rkbexplorer.com/
[46] http://stefandietze.files.wordpress.com/2010/10/dietze-et-al-integratedapproachtosws.pdf
[47] http://rapporter.ffi.no/rapporter/2010/00015.pdf
[48] http://domino.research.ibm.com/comm/research_projects.nsf/pages/semanticwebservices.research.html
[49] http://www.serviceweb30.eu/cms/

Lewis Topographical Dictionary Ireland and SkyTwenty on EC2

November 17, 2010 Comments off

Both applications are now running on Amazon EC2 in a micro instance.

  • AMI 32bit Ubuntu 10.04 (ami-480df921)
  • OpenJDK6
  • Tomcat6
  • MySQL5
  • Joseki 3.4.2
  • Jena 3.6.2
  • Sesame2
  • Empire 0.7
  • DynDNS ddclient (see [1])

Dont try installing sun-java6-jdk, it wont work. You might get it installed if you try running instance as m1.small, and do it as the first task on the AMI instance. Didnt suit me, as I discovered too late, and my motivation to want to install it turned out to be no-propagation of JAVA_OPTS, not the jdk. See earlier post on setting up Ubuntu.

  • Lewist Topographical Dictionary of Ireland
    • Javascript/Ajax to sparql endpoint. Speedy.
    • Extraction and RDF generation from unstructured text with custom software.
    • Sparql endpoint on Joseki, with custom content negotiation
    • Ontology for location, roads, related locations, administrative description, natural resources, populations, peerage.
    • Onotology for Peerage – Nobility, Gentry, Commoner.
    • Find locations where peers have more than one seat
    • Did one peer know another, in what locations, degree of separation
    • Linked Open Data connections to dbPedia, GeoNames (uberblic and sindice to come) – find people in dbPedia born in 1842 for your selected location. Map on google maps with geoNames sourced wgs84 lat/long.
  • SkyTwenty
    • Location based service built JPA based Enterprise app on Semantic repo (sesame native).
    • Spring with SpringSec ACL, OpenID Authorisation.
    • Location and profile tagging with Umbel Subject Concepts.
    • FOAF and SIOC based ontology
    • Semantic query console – “find locations tagged like this”, “find locations posted by people like me”
    • Scheduled queries, with customisable action on success or failure
    • Location sharing and messaging with ACL groups – – identity hidden and location and date time cloaked to medium accuracy.
    • Commercial apps possible – identity hidden and location and date time cloaked to low accuracy
    • Data mining across all data for aggregate queries – very low accuracy, no app/group/person identifiable
    • To come
      • OpenAuth for application federation,
      • split/dual JPA – to rdbms for typical app behaviour, to semantic repo for query console
      • API documentation

A report on how these were developed and the things learned is planned, warts and all.

[1]http://blog.codesta.com/codesta_weblog/2008/02/amazon-ec2—wh.html – not everything needs to be done, but you’ll get the idea. Install ddclient and follow instructions.

Semantic Progress

October 6, 2009 Comments off

Switching to Jena, some refactoring required, but should be more robust (RDF generation wise) that what I had. SVN and Weave required, and to follow. Have decided to hold off on the NEE and ML side for now and instead get linked triples into the data and then publish to the LOD somewhere….

Have to consider getting Thesauri and ontologies for Architectural, Historical and Eccl information, but it seems this is going to be a project in itself.

Categories: Work Tags: , , , , ,