0
Under review

Tika/Tesseract not OCRing content in PDF files

marcusbabajews 2 years ago updated 2 months ago 18

Hi Vlad and Team,

I've noticed that when looking at the results for the Content Extraction in the Control Panel of a PDF I see the following output:

Using Tika in server mode.
Server URL: http://tika:9998/tika
cURL error: Operation timed out after 5001 milliseconds with 0 bytes receive

Resulting in no indexable content:

-> Extracting...No text contents    

When examining the process command line in the Tika container it looks like this:

tesseract /tmp/apache-tika-7883861614107540574.tmp /tmp/apache-tika-4559584953829807082.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt


The logging from the this command

php /var/www/html/cron/add_folder_to_search_index.php /user-files/DOCUMENTS

 

Looks like this (along with many other similar lines of course):

INDEX /user-files/DOCUMENTS/Some Directory/Some File.pdf -> Extracting......OK

My Tika container compose looks like this:

 tika:
   image: logicalspark/docker-tikaserver
   container_name: filerun-tika
   mem_limit: 1g
   restart: unless-stopped


Any idea how I can get this working?

Thanks in advance Team Filerun!

I just added tika server to my filerun instance and am seeing this too on files which are too big - it seems like tika simply needs longer to process them. 

Can somebody please advise where this time limit of 5001 milliseconds can be changed?

The timeout seems to be hardcoded. At least it is mentioned at https://www.filerun.com/changelog on June 28, 2021:

Increased timeout for indexing via Apache Tika in server mode from 5 to 50 seconds.

I am very interested in this issue too since I am getting the same error.

Replacing logicalspark/docker-tikaserver with apache/tika solved the problem for me. Tested with a 426 pages document.

Is that a drop-in replacement? No other changes needed?

please let me know which tag you used with apache/tika as the latest tag seems to point to version 2.4.0 and the latest tag of logicalspark&docker/tikaserver to 1.28.4

Yes, drop-in. Didn't change anything else.

I have to admit that I am very new to filerun. Setup my VM with docker-compose the day before yesterday. PDFs are working, images are not, but I do not know if this is related to the Tika version. Just try the replacement (docker ftw) and let me know your impressions. :-)

I replaced logicalspark/docker-tikaserver with apache/tika so basically I went from tika 1.28.4 to 2.4.0 

I checked a 50MB file with right click: more options : control panel:

Thanks I got that and issue fixed. I'm now only wondering what the status of logicalspark/docker-tikaserver vs apache/tika is since it sounds like the author of logicalspark/docker-tikaserver said to use the apache version see here: https://github.com/LogicalSpark/docker-tikaserver/issues/35 

Hi Vlad,

any chance this timeout can be made configurable?

The timeout seems to be hardcoded. At least it is mentioned at https://www.filerun.com/changelog on June 28, 2021:

Increased timeout for indexing via Apache Tika in server mode from 5 to 50 seconds.

Hey, I'm giving filerun a try right now and got the same issue about tika-server not responding fast enough.

My nas has a really low performance CPU, so ocr for every page takes about 20 seconds.

Making this timeout configurable would help alot.

trying with a 3 page pdf sized 1MB shows content although the tika server says it couldn't. Maybe this text was extracted with the previous tika version?

INFO  [qtp732189840-19] 09:29:36,798 org.apache.tika.server.core.resource.TikaResource /tika (autodetecting type)                                 │                                                                                                                                                              │   WARN  [qtp732189840-19] 09:29:36,869 org.apache.tika.server.core.resource.TikaResource tika: Text extraction failed (null)                        │                                                                                                                                                              │   org.apache.tika.exception.TikaException: Unable to extract PDF content

There seems to be a cache indeed. If you try to re-scan a file with a different tika version, you get the error "This file cannot be indexed..." If you try with a freshly imported file instead, it will work.


But you are right with another thing: version 2.4 is not compatible. The success I got with the previously mentioned 426p file was apparently due to some kind of cache. I tried as well to scan a fresh file with 2.4 and got an empty result from the tika server.

However, you can use the tag 1.28.4-full, so apache/tika:1.28.4-full will give you a compatible version. The only question left is if this version does really fix the original timeout problem. Would you give it a try?

I forgot to mention that I tried 1.28.4-full and as far as I remember it still didn't work so I am back to the original tika server I used

I'm at a loss with this. Unfortunately, there is no help from the dev team so far...maybe because the search and indexing feature is meant for enterpri$e users.

Thanks for your feedback, @ovidiu. Will return when I'll eventually find a solution.

No updates so far. I switched to nextcloud in the meantime. The mentioned thread is not mine.