Open Semantic Desktop Search – good but….
….needs more administration documentation I think, or maybe an idiots guide for the likes of me. I installed the Desktop version VirtualBox image and it all went fairly smoothly. After setting up doc shares to an archive of about 800k docs on a NAS things started indexing. Cool ! Facets ! Keywords ! Metadata ! But all was not right – it was slowish – but hey its a VM and my host is not super-top-of-the-range (a Haswell pentium G3258 – so I made sure it was running with 2 CPUs and had 4gb RAM and about 40gb disk to play with at the start. Monitoring it is easy with the search GUI or using the XML REST response at http://localhost:8983/solr/admin/cores?action=status. But things seem to halt at times, or CPU spikes and not much appears to be happening – where do you find any info about what OSDS is doing right now ?
So – the usual places – web server, syslog, etc. Only trouble is I can get a desktop terminal to run in the VM – it seems to start then nothing. So ctrl-f2 into a console. What user id ? Turns out its “user”. What’s the password ? turns out its “live”. I found the log4j.properties for solr in /var/solr and adjusted to INFO level with console and file output, restarted SOLR and…no more info. Messages and syslog need root access – sudo of course – but have to add user “user” to sudoers. Whats the root password then ? I found it somewhere in the documentation but now (ironic) I cant re-find it there. So if you find it let me know – and when you do you can update the sudoers to include user live. Turns out the other place to look for clues is the /tmp dir – it contains the tesseract OCR and tika tmp copies of things so you can monitor the number of files in there and see progress.
But I still cant find out what exactly is going on right now (or maybe this is all there is) and importantly I cannot really guess when the detection, extraction, OCR and indexing will finish. I have a file count from my archive and can see the numbers of current docs indexed but that doesnt give me much help in terms of timing. Tesseract seems pretty sensitive and some quick blog and forum searching seems to confirm that. Still – despite this and the occasional crash or VM abort (and no real way to understand why except removing the most recently active folder share from the VM and wading thru /var/log – making only 1 cpu available to the VM seems to help the crash frequency it turns out) its still going to be better than Recoll I think which wont have facets are the possibilities or RDF enrichment, vocabularies etc. I’d also like to try out ElasticSearch with it – soon.
- zenity at 50% ? – kill the pid – its just a GUI notification that somethings running, and not really needed
Nautilus seems to miss behave and if you leave it open on /tmp it also seems to take 50% cpu – kill it
Give the VM plenty of RAM. I seem to have come across some SMP bug in debian on my system so Ive tuned the VM down to 1 cpu, which seems to help
- Important dirs on your travels…
- /var/lib/opensemanticsearch (*.py files for UI)
- /usr/share/solr-php-ui/templates/view.index.topbar.php (more UI files – eg header)
- /usr/share/solr-php-ui/config/config.facets.php (add facet to this list – even tho it says not to because the UI will overwrite it : it doesnt tho – so they appear in the UI)
./opensemanticsearch/enhancer-rdf (map facets to purls)
- tika on localhost:9998
- default logging for tika and tesseract appears to be system.out
- sudo apt-get update !
Note editing and uploading facets via text file at http://localhost/search-apps/admin/thesaurus/facet/ attempts to overwrite config.facets.php but fails !
'Facet' object has no attribute 'title'
but the facet is created in Django and appears under “ontologies”, but without any named entities. Some debugging and rooting around shows that the PHP code in /var/lib/opensemanticsearch/ontologies.views is looking for facet.title from the form, when it is in fact called facet.label. Changing line 287 in this file to
""".format( facet.facet.encode('utf-8'), facet.label.encode('utf-8')
means that you can now upload text files with concepts for facets. These then show up under the facet name in the right hand column, but dont show up as “concepts” that you can alias for instance.
Collapsing facet menus
if the list of concepts or even the list of facets gets long then putting them in accordian might be good idea. I used Daniel Stocks jQuery plugin. (https://github.com/danielstocks/jQuery-Collapse). Download then include the plugin, eg in
add the following script include
Then : change line 229 (in function print_fact) to in
<div id="<?= $facet_field ?>" class="facet" data-collapse="accordian">
A quick script to show the number of processed docs on the cmdline
FILE=/tmp/numdocs.log echo "Outputting to $FILE" wget -o /tmp/status_msg.log -O $FILE http://localhost:8983/solr/admin/cores?action=status grep --color numDocs $FILE rm $FILE
Any more tips ?
- Update (May 4 2016) – about half way thru volume of 800k docs now after 25 days processing. Still crashing out but not so often now it seems. About 20gb of disk used in the VM now.
- Update (June 1 2016) – finished, but only after I disabled pdf ocr at about 700k – have to come back to this
- Update June 7 2016 – Ive been trying to get exif data into solr from all the jpegs that I have but without much success until now. After head scratching and debugging and trying to work it out I have had to
- provide a tika config file :
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <!-- Most things can use the default --> <parser class="org.apache.tika.parser.DefaultParser"> <!-- Don't use DefaultParser for these mimetypes, alternate config below --> <mime-exclude>image/jpeg</mime-exclude> </parser> <!-- JPEG needs special handling - try+combine everything --> <parser class="org.apache.tika.parser.jpeg.JpegParser" > <mime>image/jpeg</mime> </parser> </parsers> </properties>
- update/fix the /etc/init.d/tika script start/respawn cmd to correctly use that file (and reboot the vm as init restart doesnt seem to work and systemctl daemon-restart doesnt either – or maybe its just my dud config) :
daemon --respawn --user=tika --name=tika --verbose -o /tmp/tika.log -O /tmp/tika.err -- java -jar /usr/share/java/tika-server.jar -c /home/user/osds-config.xml
- try and work out if the /usr/lib/python2.7/etl/enhance_extract_text_tika_server.py script was working or not -lots of extra print statements and verbose = True. The long and the short of it is that it is working, but the extracted metadata fields defined in the script dont include much in the way of exif fields, and even if they did we’d also have to update the /var/solr/data/core1/conf/schema.xml to include them as fields. Thats the next job…
- A handy cmdline test of the tika-server is to post a jpeg to it using curl. If your init script isnt working you wont get much back, likewise of the file you think you are posting doesnt actually exist, and if you are getting a 415 unsupported media type back in verbose curl response, it probably means you tika config file is screwed, like mine was, but I kept ignoring that – fool !. I went back to unit level and defined a single test dir in the /etc/opensemanticsearch/filemonitoring/files and put one test jpeg in there. Using the curl cmd you can then test the tika-server is working (you’ll get back a json blob with exif fields), and then using ‘touch’ and /usr/bin/opensemanticsearch-index-dir you can test the pipeline in full.
curl -vX POST -H "Accept: application/json" -F email@example.com http://localhost:9998/rmeta/form -H "Content-type: multipart/form-data"
- provide a tika config file :
- (Update Sept 2016) – new version of OSDS available that seems to work better out of the box. Interface changes, django defaults to english, adding named/entities and facets doesn’t barf.
https://www.kernel.org/doc/Documentation/sysrq.txt (although this doesnt seem to be possible during the crash as the system is completely unresponsive)