Archive

Posts Tagged ‘rdf’

Open Semantic Desktop Search – good but….

April 22, 2016 4 comments

….needs more administration documentation I think, or maybe an idiots guide for the likes of me. I installed the Desktop version VirtualBox image and it all went fairly smoothly. After setting up doc shares to an archive of about 800k docs on a NAS things started indexing. Cool ! Facets ! Keywords ! Metadata ! But all was not right – it was slowish – but hey its a VM and my host is not super-top-of-the-range (a Haswell pentium G3258 – so I made sure it was running with 2 CPUs and had 4gb RAM and about 40gb disk to play with at the start. Monitoring it is easy with the search GUI or using the XML REST response at http://localhost:8983/solr/admin/cores?action=status. But things seem to halt at times, or CPU spikes and not much appears to be happening – where do you find any info about what OSDS is doing right now ?

So – the usual places – web server, syslog, etc. Only trouble is I can get a desktop terminal to run in the VM – it seems to start then nothing. So ctrl-f2 into a console. What user id ? Turns out its “user”. What’s the password ? turns out its “live”. I found the log4j.properties for solr in /var/solr and adjusted to INFO level with console and file output, restarted SOLR and…no more info. Messages and syslog need root access – sudo of course – but have to add user “user” to sudoers. Whats the root password then ? I found it somewhere in the documentation but now (ironic) I cant re-find it there. So if you find it let me know – and when you do you can update the sudoers to include user live. Turns out the other place to look for clues is the /tmp dir – it contains the tesseract OCR and tika tmp copies of things so you can monitor the number of files in there and see progress.

But I still cant find out what exactly is going on right now (or maybe this is all there is) and importantly I cannot really guess when the detection, extraction, OCR and indexing will finish. I have a file count from my archive and can see the numbers of current docs indexed but that doesnt give me much help in terms of timing. Tesseract seems pretty sensitive and some quick blog and forum searching seems to confirm that. Still – despite this and the occasional crash or VM abort (and no real way to understand why except removing the most recently active folder share from the VM and wading thru /var/log – making only 1 cpu available to the VM seems to help the crash frequency it turns out) its still going to be better than Recoll I think which wont have facets are the possibilities or RDF enrichment, vocabularies etc. I’d also like to try out ElasticSearch with it – soon.

So –

  • zenity at 50% ? – kill the pid – its just a GUI notification that somethings running, and not really needed
    Nautilus seems to miss behave and if you leave it open on /tmp it also seems to take 50% cpu – kill it
    Give the VM plenty of RAM. I seem to have come across some SMP bug in debian on my system so Ive tuned the VM down to 1 cpu, which seems to help
  • Important dirs on your travels…
    • /opt/solr,
    • /var/solr/logs
    • /var/log/messages
    • /tmp
    • ~
    • /var/opensemanticdesktopsearch
    • /var/lib/opensemanticsearch (*.py files for UI)
    • /usr/share/solr-php-ui/templates/view.index.topbar.php (more UI files – eg header)
      /usr/share/python-django-common/django/
      /var/solr/logs
      /opt/solr
      /etc/defaults/solr.sh
    • /var/solr/data/core1/conf
    • /usr/share/solr-php-ui/config.php
    • /usr/share/solr-php-ui/config/config.facets.php (add facet to this list – even tho it says not to because the UI will overwrite it : it doesnt tho – so they appear in the UI)
      ./opensemanticsearch/enhancer-rdf (map facets to purls)
      ./opensemanticsearch-django-webapps/apache.conf
  • tika on localhost:9998
  • default logging for tika and tesseract appears to be system.out
  • sudo apt-get update !

Adding facets

Note editing and uploading facets via text file at http://localhost/search-apps/admin/thesaurus/facet/ attempts to overwrite config.facets.php but fails !

'Facet' object has no attribute 'title'

but the facet is created in Django and appears under “ontologies”, but without any named entities. Some debugging and rooting around shows that the PHP code in /var/lib/opensemanticsearch/ontologies.views is looking for facet.title from the form, when it is in fact called facet.label. Changing line 287 in this file to

""".format(     facet.facet.encode('utf-8'),    facet.label.encode('utf-8')

means that you can now upload text files with concepts for facets. These then show up under the facet name in the right hand column, but dont show up as “concepts” that you can alias for instance.

Collapsing facet menus

if the list of concepts or even the list of facets gets long then putting them in accordian might be good idea. I used Daniel Stocks jQuery plugin. (https://github.com/danielstocks/jQuery-Collapse). Download then include the plugin, eg in

/usr/share/solr-php-ui/templates/view.index.php:

add the following script include

http://js/jQuery-Collapse-master/src/jquery.collapse.js

Then : change line 229 (in function print_fact) to in

/usr/share/solr-php-ui/ndex.php
<div id="<?= $facet_field ?>" class="facet" data-collapse="accordian">


Counting docs

A quick script to show the number of processed docs on the cmdline

FILE=/tmp/numdocs.log
echo "Outputting to $FILE"
wget -o /tmp/status_msg.log -O $FILE http://localhost:8983/solr/admin/cores?action=status 
grep --color numDocs $FILE
rm $FILE

Any more tips ?


 

  • Update (May 4 2016) – about half way thru volume of 800k docs now after 25 days processing. Still crashing out but not so often now it seems. About 20gb of disk used in the VM now.
  • Update (June 1 2016) – finished, but only after I disabled pdf ocr at about 700k – have to come back to this
  • Update June 7 2016 – Ive been trying to get exif data into solr from all the jpegs that I have but without much success until now. After head scratching and debugging and trying to work it out I have had to
    • provide a tika config file :
      <?xml version="1.0" encoding="UTF-8"?>
      <properties>
       <parsers>
         <!-- Most things can use the default -->
         <parser class="org.apache.tika.parser.DefaultParser">
           <!-- Don't use DefaultParser for these mimetypes, alternate config below -->
           <mime-exclude>image/jpeg</mime-exclude>
         </parser>
      
         <!-- JPEG needs special handling - try+combine everything -->
         <parser class="org.apache.tika.parser.jpeg.JpegParser" >
            <mime>image/jpeg</mime>
         </parser>
       </parsers>
      </properties>
    • update/fix the /etc/init.d/tika script start/respawn cmd to correctly use that file (and reboot the vm as init restart doesnt seem to work and systemctl daemon-restart doesnt either – or maybe its just my dud config) :
      daemon --respawn --user=tika --name=tika --verbose -o /tmp/tika.log -O /tmp/tika.err -- 
      java -jar /usr/share/java/tika-server.jar -c /home/user/osds-config.xml
    • try and work out if the /usr/lib/python2.7/etl/enhance_extract_text_tika_server.py script was working or not -lots of extra print statements and verbose = True. The long and the short of it is that it is working, but the extracted metadata fields defined in the script dont include much in the way of exif fields, and even if they did we’d also have to update the /var/solr/data/core1/conf/schema.xml to include them as fields. Thats the next job…
    • A handy cmdline test of the tika-server is to post a jpeg to it using curl. If your init script isnt working you wont get much back, likewise of the file you think you are posting doesnt actually exist, and if you are getting a 415 unsupported media type back in verbose curl response, it probably means you tika config file is screwed, like mine was, but I kept ignoring that – fool !. I went back to unit level and defined a single test dir in the /etc/opensemanticsearch/filemonitoring/files and put one test jpeg in there. Using the curl cmd you can then test the tika-server is working (you’ll get back a json blob with exif fields), and then using ‘touch’ and /usr/bin/opensemanticsearch-index-dir you can test the pipeline in full.
      curl -vX POST -H "Accept: application/json" -F file=@exif.jpg http://localhost:9998/rmeta/form -H "Content-type: multipart/form-data"
  • (Update Sept 2016) – new version of OSDS available that seems to work better out of the box. Interface changes, django defaults to english, adding named/entities and facets doesn’t barf.

 

http://www.opensemanticsearch.org/doc/tutorial

http://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/index.html

https://www.elastic.co/products/elasticsearch

https://www.kernel.org/doc/Documentation/sysrq.txt (although this doesnt seem to be possible during the crash as the system is completely unresponsive)

https://help.ubuntu.com/community/DebuggingSystemCrash

Advertisements

CAP and RDF storage at web scale

November 9, 2012 Comments off

Requirements:

  • Store RDF – upload/insert at runtime. (Not URI for each triple, done want storm of network requests or lazy load convoy)
  •   Possible inference, tho not top priority
  •   Scale – multiple sync’d datacentre availability and durability, geo-regional partitioning
  •   Interface
    •   sparql
    •   java (did someone say ProLog ?)
    •   programmable standards (JDBC, JPA, in lieu of a JGraphDbConnectivity (“JGBC”) standard )
  •   Triple level security/ACL
  •   Transaction support
  •   Sparql 1.1
  •   FOSS
  •   Non hadoop – dont want batch capability or stop/start reconfiguration/scale : want dynamic load and query.

Any suggestions ? Cant have everything, that would be top much to ask in the 21st century, so was thinking Mongo, RIAK, Redis or Cassandra to get the availability and quick start setup, but suspect performance may be an issue from various things I’ve read, or that there are multiple translation steps into json/what-not, or an effectively proprietary API (I dont want to code to one, and then find out it wont do the job and have to rip lots out). On the other hand, I’ll probably have to take what I can get (and will be grateful), and code/engineer around misgivings as best I can. Hopefully, with a shallow RDF graph I can get away with it. Start small, agile, prove it does work (or does not), re-evalute, progress and change with an eye to the future.

Categories: cloud, linked, technology Tags: , ,

MonetDB and OpenJena

April 6, 2012 1 comment

MonetDB has been updated recently with a Dec 2011-SP2 release. Having previously tried to integrate it with OpenJena and failed because of the use of multiple inner joins, I was happy to find that the update fixed those problems and allows all the integration/unit-tests to pass.

This means of course that Im going to now have to create a patch to Jena (see Jira issue[1]), and when thats done, you can follow the instructions below to test it out – literally run the unit tests. I have been using Ubuntu 11.10 amd64 for this so the notes below reflect this:

1) Download latest MonetDB and JDBC driver

2) Install as per instructions (default username:monetdb with password:monetdb)

3) In your home dir create a my-farm directory

4) Create an "env.sh" file to house your local settings for PATH etc

export JAVA_HOME=/usr/lib/jvm/java-6-sun
#point this to whereever you have SDB installed
export SDBROOT=${JENA_HOME}/SDB
export PATH=$SDBROOT/bin:$PATH
#point this to whereever you have downloaded the MonetDB JDBC driver
export SDB_JDBC=~/Downloads/monetdb/jdbcclient.jar

5) Create a "monet_h.ttl" assembly file to define a layout2/hash repository

@prefix sdb:     <http://jena.hpl.hp.com/2007/sdb#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

# MonetDB

<#store> rdf:type sdb:Store ;
sdb:layout     "layout2/hash" ;
sdb:connection <#conn> ;
.

<#conn> rdf:type sdb:SDBConnection ;
sdb:sdbType       "MonetDB" ;    # Needed for JDBC URL
sdb:sdbHost       "localhost" ;
sdb:sdbName       "TEST2H" ;
sdb:driver        "nl.cwi.monetdb.jdbc.MonetDriver" ;
sdb:sdbUser        "monetdb" ;
sdb:sdbPassword        "monetdb" ;
sdb:jdbcURL    "jdbc:monetdb://localhost:50000/TEST2H";
.

6) create a script – "make_db.sh"– to drop,create and initialise the repo – this needs to be used each time you run the sdbtest suite. It will make use of the env.sh and the monet_h.ttl

cd $JENA_HOME
monetdb stop TEST2H
monetdb destroy TEST2H
monetdb create TEST2H
monetdb release TEST2H
. ./env.sh
bin/sdbconfig --sdb monet_h.ttl --create

7) Run the make_db.sh script

8) Check things went ok with

i) mclient -u monetdb -d TEST2H.

ii) \D

You should see a dump of the schema. There should be among other things a prefixes table.

9) Now for the unit tests :

Create a monetdb-hash.ttl file that Jena can use to connect with

@prefix sdb:     <http://jena.hpl.hp.com/2007/sdb#> .
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] rdf:type sdb:Store ;
sdb:layout     "layout2" ;
sdb:connection _:c ;
.

_:c rdf:type sdb:SDBConnection ;
sdb:sdbType       "MonetDB" ;    # Needed for JDBC URL
sdb:sdbHost       "localhost" ;
sdb:sdbName       "TEST2H" ;
sdb:driver        "nl.cwi.monetdb.jdbc.MonetDriver" ;
sdb:sdbUser        "monetdb" ;
sdb:sdbPassword        "monetdb" ;
sdb:jdbcURL    "jdbc:monetdb://localhost:50000/TEST2H?debug=true&logfile=monet.debug.log";

10) If in Eclipse, with the SDB source, create a run configuration for sdbtest.

#Main class : sdb.sdbtest
#Arguments: --sdb monetdb-hash.ttl ./testing/manifest-sdb.ttl

11) Run the test suite – all tests should pass.

12) Next : Load some RDF and test performance !……

[1] https://issues.apache.org/jira/browse/JENA-134

Column stores, Hadoop, Semantic web

November 3, 2011 Comments off

Been trying to do some work on Jena, to get some column store support in there. This is all predicated on having a DBC driver to talk to the column store. Some have, some dont, but the ones that do have do seem to have minimal JDBC implementations. Either temp table support isn’t there, or things like batch support are lacking. Still, pursuing this, because the normalized schema (a simple star-ish schema) used by Jena (and Sesame iirc) seems to marry well with some of the optimisation claims the column stores make (retrieval, compressed storage, materialized views). For near read-only semantic knowledge bases, this might make a significant performance boost over row based RDBMS as semantic backends. And hadoop might come in useful here too at load stage, if RDF needs to be ETLd to some kind of loadable format, or to materialize sparql query results to column store accessible external storage. Might being the operative word I think, but there are interesting possibilties.

Categories: technology Tags: , , , ,

DERI (LATC) launch schema.rdfs.org

June 18, 2011 Comments off

Some of the DERI people (and others) involved in LATC have launched schema.rdfs.org to counter the lack of rdfs in schema.org – the Microsoft/Google/Yahoo attempt to kickstart some RDFa publishing so their search engines can try and improve result relevancy. Some of the items in schema.org are quite simple, but thats probably a good thing : a large term set or number of properties is going to look daunting to anyone interested or someone starting out for the first time – and indeed this is the reason cited that it is not RDF (its microdata). And while I agree with Michael Bergman that it is more than likely another step towards structured/linked/common/open data, adopters urgently need a combination of

  1. Tools (or better still no tools, just an unobtrusive natural way to author microdata or rdfa) and
  2. a Reason to do it – payback
  3. Support in search UIs to specify vocabulary items

I’d like a wordpress plugin for instance, but then I’d need to host an instance myself or find a hoster that allows plugins because wordpress.com doesnt allow it. I’d also like to think that if I placed some RDFa in my blog that it would get higher a ranking in Search results (it should) but this blog is pretty specialised anyway and its not commercially oriented so Im happy enough with keyword based results anyway.

So, I’m not going to be doing it too soon, and thats the problem really. Or is it ? This post isn’t data really, but it does have links and it does talk about concepts, people, technology problems. If I could mark them up with tags and attributes that define what I am talking about then it would mean that I could tell those search engines and crawlers what I am talking about rather than hoping they can work it out from the title, the links I have chosen, then feedback comments and so on. Then people looking for these particular topics could find or stumble upon this post more easily. So, while there is some data here, arguably I don’t see it that way, and even if I think there might be a good Reason to do it, it’s too hard without the Tools

So I wonder finally, if I was to mark up one of these people mentioned in this post with name,address, affiliation,organisation and so on, would the search engine UIs allow me to use this vocabulary directly – I want to find articles about DERI say, would the search drop down prompt me with itemprop="EducationalOrganization" – so that I’d then only get results that have been marked up with this microdata property and not with things that are about the Deri vineyard in wales, punto deri, courtney deri and so on ?

Sindice kinda does this, couldn’t the Goog do it too ??? Or indicate which results are microdata’d, or allow a keyword predicate (like site: say), or allow the results to be filtered (like Search Tools in the left column). The point for me is that schema.org is only half or less than half the story – the search engines need to Support the initiative by making it available at query time, and to allow their results to manipulated in terms of microdata/rdfs too. Then I might be more tempted to markup my posts in microdata,rdfs,microformat or whatever, and I might create some extensions to the schemas and contribute a bit more, and my post might get more traffic in the long tail, that traffic would be more valuable, my ad revenue might go up (if I had ads for myself !), and the ECB might drop their interest rates. Well, maybe not, but they’re not listening to anythine else, perhaps some structured data might persuade them. It is the future after all.

Aperture Nepomuk queries

February 22, 2011 1 comment

Having crawled an Imap store (ie google mail), I now need to query the results to see whats what, whos who, and how they are connected, if at all.

These are the namespace prefixes used in the queries

Prefix URI
nie http://www.semanticdesktop.org/ontologies/2007/01/19/nie#
nco http://www.semanticdesktop.org/ontologies/2007/03/22/nco#
nfo http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#
nmo http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#
sesame http://www.openrdf.org/schema/sesame#

And these are the queries. Note that each message is in its own graph, and references the folder in which it rests – eg <imap://youraddress@imap.yourprovider.com/INBOX;TYPE=LIST>. This in turn nie:isPartOf another folder, which isnt nie:isPartOf a parent folder.

An Imap store has a username and password etc, but doesnt have an associated email address. A folder may contain messages to the owner with an email address the server accepts, but may also contain messages to other addresses if the cc list contains the owner address.

id Folder Relationship Purpose Query
inbox direct Basic find list of emails, with
sender email address
select distinct ?subject
?from ?address{

?s nmo:from ?o  .
?o nco:fullname ?from .
?o nco:hasEmailAddress ?e .
?e nco:emailAddress
?address .

?s nmo:messageSubject
?subject .

?s a nmo:Email} 


note : with a Jena TDB dataset, use

select distinct ?subject ?from ?address{
graph ?g{
?s nmo:from ?o  .
?o nco:fullname ?from .
?o nco:hasEmailAddress ?e .
?e nco:emailAddress
?address .
?s nmo:messageSubject
?subject .
?s a nmo:Email
}
}

inbox direct Find emails, distinguish
replies(and what replied to), and CC addresses
select distinct ?s ?subject
?r ?to ?refid ?from ?address{

?s nmo:from ?o  .
?s nmo:messageId ?sid .
?o nco:fullname ?from .
?o nco:hasEmailAddress ?e .
?e nco:emailAddress
?address .

?s nmo:messageSubject
?subject .

?s a nmo:Email
optional {
?s nmo:inReplyTo ?r .
?r nmo:messageId ?mid .
}
optional {
?s nmo:to ?toid .
?toid nco:fullname ?to .
}
optional {
?s nmo:cc ?ccid .
optional{
?ccid nco:fullname ?ccto .
}
}
optional {
?s nmo:references ?refid .}} 

order by ?subject

Note : nco:fullname optional as you may not know the email addressee’s name
Note : As with the basic query about, where using a Jean Dataset, you need a graph selector in the where clause eg

select * { graph ?g {?s ?p ?o}}
inbox direct most messages direct to you
select (count(?from) as ?count) ?from ?address{
graph ?g{
?s nmo:from ?o  .
?o nco:fullname ?from .
?o nco:hasEmailAddress ?e .
?e nco:emailAddress ?address .
?s a nmo:Email
} 

}
group by ?from ?address
order by desc(?count)
inbox direct most messages CC to you Not so easy : where you are a CC recipient, its not possible to match on the to: field, or with any metadata on the imap server.
inbox direct fastest replies
inbox direct most replies
inbox contacts and counts by mail
domain
inbox indirect messages to others on CC list
(may not be known to you, but sender knows)
outbox direct recipents (to,cc,bcc)
outbox direct replies
outbox direct most replied to
outbox direct most sent to
outbox direct fastest replied to (by message,
by recipient)
output direct fastest sent to (by message, by
recipient)

Things get more interesting when more that one mailbox is available for
analyis…but Im going to need Sesame3 or revert to Jena because Sesame2 doesnt do aggregate functions like count. 2 steps forward, 1 step back. So, Jena support in Aperture is minimal and old. It cannot make use of graphs, TDB or SDB, (but the libraries are up to date). It also doesnt support Datasets or Named Graphs in Jena. So, I add ModelSet (the RDF2Go adapter type needed), Dataset and Named graph support, in TDB to begin with. This involves updating the Aperture Jena adapter. Doesn’t seem to be any activity on the Aperture mailing list tho, as I get zero response to a question about updating the Jena support. Is Aperture another nice-but-dead Semantic Web technology ?

Linked Data, OData, GData, DataRSS comparison matrix

February 17, 2011 4 comments

(Update : Its a year since my next-in-line brother died. This post involved a conversation I had with him. I’ve finally, after 5 years, updated it to include an .ODT version of the table below. He’ll probably kick my ass the next time I see him….)

I’m new to OData, having just talked with one of my brothers about it. He’s using it in a large company, but I’m not sure if its an internal tool or for customers. However, having forgotten or never investigated it before because of the lack of Microsoft fanfare, I was struggling to see what the difference between it and Linked Data with RDF is. Much googling and reading [47,48,50] left me with lots of questions, points of view, some pros and cons, and discovery about GData and DataRSS. (I’ve really been sipping the W3C Linked Data Kool Aid too long 🙂 ).

So, I want to create a matrix of criteria that anyone can quickly look at and get salient information about them. (Perhaps I could publish that matrix as linked data sometime…). You’ll understand by now that I haven’t used OData so I’m going on what i read until I install SharePoint somewhere, or whatever else it takes to get a producer running to play with. And with that, I’ve also made the mental jump to Drupal publishing RDFa – can you embed OData in a web page ? Would you want to if you had a CMS where the data was also content ?

I hope to end up with information about LoD, OData, GData, DataRSS (and fix this table’s formatting). Help ! RDF_OData_GData_DataRSS

Criteria RDF http://www.w3.org/…/LinkingOpenData OData http://www.odata.org/ GData

http://code.google.com/intl/en/apis/gdata/

DataRSS
Logical Model Graph/EAV.
Technology grounding (esp OWL ) in Description Logic.[12, 13]. “Open
World Assumption” [27]
Graph/EAV. AtomPub
and EDM grounding in entity relationship modelling [11]. “Closed World
Assumption”[28] view (?) but with “OpenTypes” and “Dynamic Properties”
[29]
Unclear/Mixed – whatever google logical Model is behind services, but transcoded and exposed as AtomPub/JSON. Data relations and graphs not controllable by API – eg cannot define a link between data elements that doesnt already exist. GData is primarily a client API.
Physical
model
Not mandated, but probably
backed by a triple store and serialised over Http to RDF/XML, Json,
TTL, N3 or other format. RDBMS backing or proxying possible.
not mandated, but probably
backed by existing RDBMS persistence [4 – “Abstract Data Model”], or
more precisely a non-triple store. (I have no evidence to
support this, but the gist of docs and examples suggests it as a
typical use case) and serialised over Http with Atom/JSON
according to Entity Data Model (EDM)[6] and  Conceptual Schema
Definition Language (CSDL)[11]
Google applications and services publishing data in AtomPub/JSON format, with Google Data Namespace[58] elements.
Intent Data syndication
and web level linking : “The goal of
the W3C SWEO Linking Open Data community project is to extend the Web
with a data commons by publishing various open data sets as RDF on the
Web and by setting RDF links between data items from different data
sources”
Data publishing
and
syndication : “There is a vast
amount of data available today and data is now
being collected and stored at a rate never seen before. Much, if
not most, of this data however is locked into specific applications
or formats and difficult to access or to integrate into new
uses”
Google cloud data publishing [55] : “The Google Data Protocol provides a secure means for external developers to write new applications that let end users access and update the data stored by many Google products. External developers can use the Google Data Protocol directly, or they can use any of the supported programming languages provided by the client libraries.”
Protocol,
operations
http, content
negotiation, RDF, REST-GET. Sparql 1.1 for update
http, content
negotiation, AtomPub/JSON, REST-GET/PUT/POST/DELETE [9]
http,REST (PUT/POST?GET/PATCH/DELETE)[56]
Openness/Extensibility Any and all,
create your own ontology/namespace/URIs with RDFS/OWL/SKOS/…, large
opensource tooling & community, multiple serialisation RDF/XML,
JSON, N3, TTL,…
Any and all (with
a “legacy” Microsoft base), while reuse Microsoft classes and types,
namespaces (EDM)[6] with Atom/JSON serialisation. Large microsoft
tooling and integration with others following.[7,8]
Google applications and services only.
URI minting,
dereferencing
Create your own
URIs and namespaces following guidelines (“slash vs hash”) [15,16]
Subject, predicate and object URIs must be dereferencible, content
negotiation expected. Separation of concept URI and location URI
central.
Unclear whether
concept URI and Location URI are distinguished in specification –
values can certainly be Location URIs, and IDs can be URIs, but
attribute properties aren’t dereferencible to Location URIs.Well specified URI conventions [21]
Atom namespace.  <link rel=”self” …/> denotes URI of item. ETags also used for versioned updates.  Google Data namespace for content “Kinds”.[59], no dereferencing.
Linking,
matching, equivalence
External entities
can inherently be directly linked by reference, and equivalence is
possible with owl:sameAs, owl:seeAlso (and other equivalence assertions)
Navigation
properties link entity elements within a single OData materialisation –
external linkage not possible. Dereferencable attribute properties not
possible but proposed[10].
URIS Not dereferencable, linkage outside of google not possible.
Data Model :
Classes,Types,
Relationships/Ontology
RDF-S, OWL to
create ontology model of data, concepts and relations. Import and
extend external ontologies. Terminology (T-Box) and Asserted ( A-Box,
vocabulary) both possible with OWL presentation.
EDM defines
creation of entities, types, sets, associations and navigation
properties for data and relations. Primitive types a la XSD types[35].
Seems more akin to capabilities of schema definition than ontology
modelling with OWL. Unclear at this stage whether new entities can be
created, and then reused or existing ones imported and extended.
Reasoning Inferrence or reasoning out of
DL terminology and assertion separation possible. May be handled at
repository level (eg Sesame) or at query time (eg Jena)
service may be able to infer
from derived typing[41]
Namespace
handling,
vocabularies
Declare namespaces
as required when importing public or “well known”
ontologies/vocabularies, creating SPARQL queries, short hand URIs,
create new as required for your own custom classes, instances.
namespaces
supported in EDM but unclear if possible to create and use namespace,
or if it can be backed with a custome class/property definition
(ontology). $metadata seems to separate logically and physically type
and service metadata from instance data – ie oData doesn’t “eat its own
dog food”.
AtomPub and Google Data namespace only.
Content negotiation Client and server
negotiate content to best determination.[17,18]
Client specifies
or server fails, or default to Atom representation.[19]. Only XML
serialisation for service metadata.[40]. New mime-types introduced.
Use alt query param (accept-header not used)[57]
Query capability Dereferencibility
central principle to linked data, whether in document, local endpoint
or federated. SPARQL [14] query language allows suitably equipped
endpoints to service structured query requests and return serialised
RDF, json, csv, html, …
Proposed
dereferencible URIs with special $metadata path element allow type
metadata to be retrieved [10]. Running a structured query against an
OData service with something like SPARQL isn’t possible.
Query by author,category,fields.
Interoperability,
discovery
Derefencable URIs,
well-known/common/upperlevel ontologies/vocabularies.VoID [22,52] can be used to provide extensive metadata for a linked
data
endpoint.
Service documents
[32] describe types, datasets. Programmatic Mapping to/from RDF
possible.
middleware,
conversion
RDF-XML, ttl, n3
well known formats. See also content negotiation.
AtomPub, JSON
outputs. OpenVirtuoso mapping, custom code.
Security, privacy, provenance. No additional
specifications above that supplied in web/http architecture. CORS
becoming popular as access filter method for cross-site syndication
capability at client level. Server side access control. Standards for
Provenance and privacy planned and under development[24]. W3C XG
provenance group[25]
No additional
specifications above that mandated in http/atom/json.[23, 31] CORS use
possible for cross site syndication. Dallas/Azure Datamarket for
“trusted commercial and premium public domain data”.[26]
Http wire protocols, but in addition authentication (OpenID) and authorization are required(OAuth). “ClientLogin” and AuthSub are deprecated. [60]. No provenance handling.
Ownership,
license, sponsorship, governance
W3C Supported
community project [1], after a proposal by TBL[2]. Built up
Architecture of World Wide Web [3]
Microsoft owned
and
sponsored under “Open Specification Promise”, [4] but brought to W3C
incubator [5]
Documentation,
support, community
w3c docs,
community wikis,forums,blogs,developer groups and libraries.[38]
OData.org site
with developer docs, and links to articles and videos, mailinglist, msdn
[[39]
Tooling,
producing, consuming
Many and varied,
open.[1,36,37] et al.
Producers[7],
consumers[8], “datamarket”[26], PowerPivot for Excel[49
Other SPARQL update v1.1[42], Semantic
Web Services, [43-46,51-54]
Batch request [20], protocol
versioning [33], Service Operations [30]

[1]
http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

[2] http://www.w3.org/DesignIssues/LinkedData.html

[3] http://www.w3.org/TR/webarch

[4] http://www.microsoft.com/interop/osp/default.mspx

[5] http://www.w3.org/QA/2010/03/microsoft_bring_odata_to_a_w3c.html

[6] http://www.odata.org/developers/protocols/overview#EntityDataModel

[7] http://www.odata.org/producers

[8] http://www.odata.org/consumers

[9] http://www.odata.org/developers/protocols/operations

[10] http://www.odata.org/blog/2010/4/22/queryable-odata-metadata

[11] http://www.odata.org/media/16348/%5Bmc-csdl%5D.pdf

[12] http://www.w3.org/TR/2009/REC-owl2-direct-semantics-20091027/

[13] http://en.wikipedia.org/wiki/Description_logic

[14] http://www.w3.org/TR/rdf-sparql-query/

[15] http://www.w3.org/TR/cooluris/

[16] http://www.w3.org/wiki/DereferenceURI

[17] http://www.w3.org/TR/webarch/#def-coneg

[18] http://www.w3.org/TR/cooluris/#implementation

[19]
http://www.odata.org/developers/protocols/operations#RepresentationFormatsAndContentTypeNegotiation

[20] http://www.odata.org/developers/protocols/batch

[21] http://www.odata.org/developers/protocols/uri-conventions

[22] http://code.google.com/p/void-impl/

[23]
http://www.odata.org/developers/protocols/overview#SecurityConsiderations

[24] http://lod2.eu/Welcome.html

[25] http://www.w3.org/2005/Incubator/prov/wiki/Relevant_Technologies

[26] https://datamarket.azure.com/

[27] http://en.wikipedia.org/wiki/Open_world_assumption

[28] http://en.wikipedia.org/wiki/Closed_world_assumption

[29] http://www.odata.org/media/16343/%5Bmc-edmx%5D.pdf

[30]
http://www.odata.org/developers/protocols/operations#InvokingServiceOperations

[31]
http://blogs.msdn.com/astoriateam/archive/2010/05/10/odata-and-authentication-part-1.aspx

[32]
http://www.odata.org/developers/protocols/overview#ServiceMetadataDocument

[33]
http://www.odata.org/developers/protocols/overview#ProtocolVersioning

[34]
http://www.odata.org/developers/protocols/overview#AbstractTypeSystem

[35] http://www.w3.org/TR/xmlschema-2

[36] http://ckan.net/

[37]
http://www.w3.org/wiki/SemanticWebTools#head-805c63479c854babe4657d5184de605910f6d3e2

[38] http://www.w3.org/2001/sw/

[39] http://www.odata.org/developers/articles

[40]
http://www.odata.org/developers/protocols/operations#Retrievingthemetadatadocument

[41]
http://www.odata.org/blog/2010/8/6/enhancing-odata-support-for-querying-derived-types—revisited

[42] http://www.w3.org/TR/2009/WD-sparql11-update-20091022/

[43] http://www.swsi.org/

[44] http://www.w3.org/Submission/OWL-S/

[45] http://www.serviceweb30.eu/cms/

[46] http://www.w3.org/Submission/WSDL-S/

[47] http://webofdata.wordpress.com/2010/04/14/oh-it-is-data-on-the-web/

[48]
http://blog.jonudell.net/2010/01/29/odata-for-collaborative-sense-making/

[49] http://www.powerpivot.com/

[50]
http://sqlblog.com/blogs/jamie_thomson/archive/2010/02/03/microsoft-odata-and-rdf.aspx

[51] http://www.wsmo.org/

[52] http://void.rkbexplorer.com/

[53] http://www.alphaworks.ibm.com/tech/wssem

[54] http://rapporter.ffi.no/rapporter/2010/00015.pdf

[55] http://code.google.com/intl/en/apis/gdata/docs/directory.html

[56] http://code.google.com/intl/en/apis/gdata/docs/2.0/basics.html

[57] http://code.google.com/intl/en/apis/gdata/docs/2.0/reference.html#QueryRequests

[58] http://schemas.google.com/g/2005

[59] http://code.google.com/intl/en/apis/gdata/docs/2.0/elements.html

[60] http://code.google.com/intl/en/apis/gdata/docs/auth/overview.html

Categories: linked Tags: , , , , , , , ,