….needs more administration documentation I think, or maybe an idiots guide for the likes of me. I installed the Desktop version VirtualBox image and it all went fairly smoothly. After setting up doc shares to an archive of about 800k docs on a NAS things started indexing. Cool ! Facets ! Keywords ! Metadata ! But all was not right – it was slowish – but hey its a VM and my host is not super-top-of-the-range (a Haswell pentium G3258 – so I made sure it was running with 2 CPUs and had 4gb RAM and about 40gb disk to play with at the start. Monitoring it is easy with the search GUI or using the XML REST response at http://localhost:8983/solr/admin/cores?action=status. But things seem to halt at times, or CPU spikes and not much appears to be happening – where do you find any info about what OSDS is doing right now ?
So – the usual places – web server, syslog, etc. Only trouble is I can get a desktop terminal to run in the VM – it seems to start then nothing. So ctrl-f2 into a console. What user id ? Turns out its “user”. What’s the password ? turns out its “live”. I found the log4j.properties for solr in /var/solr and adjusted to INFO level with console and file output, restarted SOLR and…no more info. Messages and syslog need root access – sudo of course – but have to add user “user” to sudoers. Whats the root password then ? I found it somewhere in the documentation but now (ironic) I cant re-find it there. So if you find it let me know – and when you do you can update the sudoers to include user live. Turns out the other place to look for clues is the /tmp dir – it contains the tesseract OCR and tika tmp copies of things so you can monitor the number of files in there and see progress.
But I still cant find out what exactly is going on right now (or maybe this is all there is) and importantly I cannot really guess when the detection, extraction, OCR and indexing will finish. I have a file count from my archive and can see the numbers of current docs indexed but that doesnt give me much help in terms of timing. Tesseract seems pretty sensitive and some quick blog and forum searching seems to confirm that. Still – despite this and the occasional crash or VM abort (and no real way to understand why except removing the most recently active folder share from the VM and wading thru /var/log – making only 1 cpu available to the VM seems to help the crash frequency it turns out) its still going to be better than Recoll I think which wont have facets are the possibilities or RDF enrichment, vocabularies etc. I’d also like to try out ElasticSearch with it – soon.
- zenity at 50% ? – kill the pid – its just a GUI notification that somethings running, and not really needed
Nautilus seems to miss behave and if you leave it open on /tmp it also seems to take 50% cpu – kill it
Give the VM plenty of RAM. I seem to have come across some SMP bug in debian on my system so Ive tuned the VM down to 1 cpu, which seems to help
- Important dirs on your travels…
- /var/lib/opensemanticsearch (*.py files for UI)
- /usr/share/solr-php-ui/templates/view.index.topbar.php (more UI files – eg header)
- /usr/share/solr-php-ui/config/config.facets.php (add facet to this list – even tho it says not to because the UI will overwrite it : it doesnt tho – so they appear in the UI)
./opensemanticsearch/enhancer-rdf (map facets to purls)
- tika on localhost:9998
- default logging for tika and tesseract appears to be system.out
- sudo apt-get update !
Note editing and uploading facets via text file at http://localhost/search-apps/admin/thesaurus/facet/ attempts to overwrite config.facets.php but fails !
'Facet' object has no attribute 'title'
but the facet is created in Django and appears under “ontologies”, but without any named entities. Some debugging and rooting around shows that the PHP code in /var/lib/opensemanticsearch/ontologies.views is looking for facet.title from the form, when it is in fact called facet.label. Changing line 287 in this file to
""".format( facet.facet.encode('utf-8'), facet.label.encode('utf-8')
means that you can now upload text files with concepts for facets. These then show up under the facet name in the right hand column, but dont show up as “concepts” that you can alias for instance.
Collapsing facet menus
if the list of concepts or even the list of facets gets long then putting them in accordian might be good idea. I used Daniel Stocks jQuery plugin. (https://github.com/danielstocks/jQuery-Collapse). Download then include the plugin, eg in
add the following script include
Then : change line 229 (in function print_fact) to in
<div id="<?= $facet_field ?>" class="facet" data-collapse="accordian">
A quick script to show the number of processed docs on the cmdline
FILE=/tmp/numdocs.log echo "Outputting to $FILE" wget -o /tmp/status_msg.log -O $FILE http://localhost:8983/solr/admin/cores?action=status grep --color numDocs $FILE rm $FILE
Any more tips ?
- Update (May 4 2016) – about half way thru volume of 800k docs now after 25 days processing. Still crashing out but not so often now it seems. About 20gb of disk used in the VM now.
- Update (June 1 2016) – finished, but only after I disabled pdf ocr at about 700k – have to come back to this
- Update June 7 2016 – Ive been trying to get exif data into solr from all the jpegs that I have but without much success until now. After head scratching and debugging and trying to work it out I have had to
- provide a tika config file :
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <!-- Most things can use the default --> <parser class="org.apache.tika.parser.DefaultParser"> <!-- Don't use DefaultParser for these mimetypes, alternate config below --> <mime-exclude>image/jpeg</mime-exclude> </parser> <!-- JPEG needs special handling - try+combine everything --> <parser class="org.apache.tika.parser.jpeg.JpegParser" > <mime>image/jpeg</mime> </parser> </parsers> </properties>
- update/fix the /etc/init.d/tika script start/respawn cmd to correctly use that file (and reboot the vm as init restart doesnt seem to work and systemctl daemon-restart doesnt either – or maybe its just my dud config) :
daemon --respawn --user=tika --name=tika --verbose -o /tmp/tika.log -O /tmp/tika.err -- java -jar /usr/share/java/tika-server.jar -c /home/user/osds-config.xml
- try and work out if the /usr/lib/python2.7/etl/enhance_extract_text_tika_server.py script was working or not -lots of extra print statements and verbose = True. The long and the short of it is that it is working, but the extracted metadata fields defined in the script dont include much in the way of exif fields, and even if they did we’d also have to update the /var/solr/data/core1/conf/schema.xml to include them as fields. Thats the next job…
- A handy cmdline test of the tika-server is to post a jpeg to it using curl. If your init script isnt working you wont get much back, likewise of the file you think you are posting doesnt actually exist, and if you are getting a 415 unsupported media type back in verbose curl response, it probably means you tika config file is screwed, like mine was, but I kept ignoring that – fool !. I went back to unit level and defined a single test dir in the /etc/opensemanticsearch/filemonitoring/files and put one test jpeg in there. Using the curl cmd you can then test the tika-server is working (you’ll get back a json blob with exif fields), and then using ‘touch’ and /usr/bin/opensemanticsearch-index-dir you can test the pipeline in full.
curl -vX POST -H "Accept: application/json" -F email@example.com http://localhost:9998/rmeta/form -H "Content-type: multipart/form-data"
- provide a tika config file :
- (Update Sept 2016) – new version of OSDS available that seems to work better out of the box. Interface changes, django defaults to english, adding named/entities and facets doesn’t barf.
https://www.kernel.org/doc/Documentation/sysrq.txt (although this doesnt seem to be possible during the crash as the system is completely unresponsive)
Google do semantic recipe search apparently [1-3] and a university in New York does a Semantic Sommelier . Interesting – yes, but I want more info !
What ontology and what vocabulary is being used – tap, pips ? w3.org food, hrecipe – or other microformats  ? How about Umbel or Yago ? Any chance of a link for the explanation of the sommelier app – would love to know more about it and see it, rather than just some PR ? Do they link to other datasets, so that if I pick a wine or a recipe I can find things that go with the flavours and aromas, see photos, maybe learn the history, culture, location and science/tech of the recipe ? Hell, commercialisation here I come, perhaps I want to know what stores in my area have the ingredients, or stock the wine, with other related produce and offers ? And if a celeb chef happens to endorse it, then maybe I’ll go and but their set of cookware for christmas. (Or I’ll go/link to Amazon and get it cheaper, if they use the same vocabulary…)
If I search for Sausages and you have a recipe for “Bangers and Mash” do I get a recipe snippet or result fraction ? If I search for Mash and you have a blog post about creating a web page from lots of parts of other pages, does it show up ? How about if I search for a French Classic recipe – do I not find results for pages that describe the same thing but in English, German or Japanese ?
This semantic web thing needs to get out there, so you can taste it.
I am doing some work on a Top Secret Project to demonstrate on the SkyTwenty platform the use of email data (in place of location data).
I am making use of Aperture to crawl an IMAP store, then allow sharing of contact and message information, so that queries can be run to discover
- who-knows-who in what domain
- how many degrees of freedom there are between contacts
- do selected contacts have any connection
- how “well” do they know each other and so on.
Aperture makes use of the Nepomuk  message and desktop ontologies, and they’re fairly extensive, so a graphic helps to understand some of the ontological relationships.
The brilliant Protege4  ontology design tool has plugins for GraphViz and OntoGraf produce some fairly neat images to visualise ontologies, so here they are. I would like if there was a way to include object and data propertys (by annotation perhaps, will try later) but for now have compiled a table of the class properties from a crawl and sparql query I did against the repository I loaded the data into.
Note that OntoGraf needs the Sun JDK to work, so on Ubuntu, which has the OpenJDK by default, you need to install and agree to the license terms, then make sure that Protege is using the Sun java at /usr/lib/jvm/java-6-sun-18.104.22.168 (or whatever version).
These tables are incomplete, and represent the classes and properties from the crawl of my nearly empty inbox. The full set of classes and properties for the Nepomuk ontologies are available on another page on this blog.