
0
Under review
Tika/Tesseract not OCRing content in PDF files
Hi Vlad and Team,
I've noticed that when looking at the results for the Content Extraction in the Control Panel of a PDF I see the following output:
Using Tika in server mode. Server URL: http://tika:9998/tika cURL error: Operation timed out after 5001 milliseconds with 0 bytes receive
Resulting in no indexable content:
-> Extracting...No text contents
When examining the process command line in the Tika container it looks like this:
tesseract /tmp/apache-tika-7883861614107540574.tmp /tmp/apache-tika-4559584953829807082.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt
The logging from the this command
php /var/www/html/cron/add_folder_to_search_index.php /user-files/DOCUMENTS
Looks like this (along with many other similar lines of course):
INDEX /user-files/DOCUMENTS/Some Directory/Some File.pdf -> Extracting......OK
My Tika container compose looks like this:
tika: image: logicalspark/docker-tikaserver container_name: filerun-tika mem_limit: 1g restart: unless-stopped
Any idea how I can get this working?
Thanks in advance Team Filerun!
Customer support service by UserEcho
I just added tika server to my filerun instance and am seeing this too on files which are too big - it seems like tika simply needs longer to process them.
Can somebody please advise where this time limit of 5001 milliseconds can be changed?
The timeout seems to be hardcoded. At least it is mentioned at https://www.filerun.com/changelog on June 28, 2021:
I am very interested in this issue too since I am getting the same error.
Replacing logicalspark/docker-tikaserver with apache/tika solved the problem for me. Tested with a 426 pages document.
Is that a drop-in replacement? No other changes needed?
please let me know which tag you used with apache/tika as the latest tag seems to point to version 2.4.0 and the latest tag of logicalspark&docker/tikaserver to 1.28.4
Yes, drop-in. Didn't change anything else.
I have to admit that I am very new to filerun. Setup my VM with docker-compose the day before yesterday. PDFs are working, images are not, but I do not know if this is related to the Tika version. Just try the replacement (docker ftw) and let me know your impressions. :-)
I replaced logicalspark/docker-tikaserver with apache/tika so basically I went from tika 1.28.4 to 2.4.0
I checked a 50MB file with right click: more options : control panel:
For increasing the size of the indexable files, please see: https://feedback.filerun.com/en/communities/1/topics/1653-ocr-not-running?redirect_to_reply=6160#comment-6148
Thanks I got that and issue fixed. I'm now only wondering what the status of logicalspark/docker-tikaserver vs apache/tika is since it sounds like the author of logicalspark/docker-tikaserver said to use the apache version see here: https://github.com/LogicalSpark/docker-tikaserver/issues/35
Hi Vlad,
any chance this timeout can be made configurable?
The timeout seems to be hardcoded. At least it is mentioned at https://www.filerun.com/changelog on June 28, 2021:
Hey, I'm giving filerun a try right now and got the same issue about tika-server not responding fast enough.
My nas has a really low performance CPU, so ocr for every page takes about 20 seconds.
Making this timeout configurable would help alot.
trying with a 3 page pdf sized 1MB shows content although the tika server says it couldn't. Maybe this text was extracted with the previous tika version?
There seems to be a cache indeed. If you try to re-scan a file with a different tika version, you get the error "This file cannot be indexed..." If you try with a freshly imported file instead, it will work.
But you are right with another thing: version 2.4 is not compatible. The success I got with the previously mentioned 426p file was apparently due to some kind of cache. I tried as well to scan a fresh file with 2.4 and got an empty result from the tika server.
However, you can use the tag 1.28.4-full, so apache/tika:1.28.4-full will give you a compatible version. The only question left is if this version does really fix the original timeout problem. Would you give it a try?
I forgot to mention that I tried 1.28.4-full and as far as I remember it still didn't work so I am back to the original tika server I used
I'm at a loss with this. Unfortunately, there is no help from the dev team so far...maybe because the search and indexing feature is meant for enterpri$e users.
Thanks for your feedback, @ovidiu. Will return when I'll eventually find a solution.
Any updates? Is this your thread? https://feedback.filerun.com/en/communities/1/topics/1241-content-successfully-indexed-but-no-search-results
No updates so far. I switched to nextcloud in the meantime. The mentioned thread is not mine.