Tika/Tesseract not OCRing content in PDF files

marcusbabajews 1 year ago updated by ovidiu 7 months ago 1

Hi Vlad and Team,

I've noticed that when looking at the results for the Content Extraction in the Control Panel of a PDF I see the following output:

Using Tika in server mode.
Server URL: http://tika:9998/tika
cURL error: Operation timed out after 5001 milliseconds with 0 bytes receive

Resulting in no indexable content:

-> Extracting...No text contents    

When examining the process command line in the Tika container it looks like this:

tesseract /tmp/apache-tika-7883861614107540574.tmp /tmp/apache-tika-4559584953829807082.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt

The logging from the this command

php /var/www/html/cron/add_folder_to_search_index.php /user-files/DOCUMENTS


Looks like this (along with many other similar lines of course):

INDEX /user-files/DOCUMENTS/Some Directory/Some File.pdf -> Extracting......OK

My Tika container compose looks like this:

   image: logicalspark/docker-tikaserver
   container_name: filerun-tika
   mem_limit: 1g
   restart: unless-stopped

Any idea how I can get this working?

Thanks in advance Team Filerun!

I just added tika server to my filerun instance and am seeing this too on files which are too big - it seems like tika simply needs longer to process them. 

Can somebody please advise where this time limit of 5001 milliseconds can be changed?