0

Tika/Tesseract not OCRing content in PDF files

marcusbabajews 3 months ago 0

Hi Vlad and Team,

I've noticed that when looking at the results for the Content Extraction in the Control Panel of a PDF I see the following output:

Using Tika in server mode.
Server URL: http://tika:9998/tika
cURL error: Operation timed out after 5001 milliseconds with 0 bytes receive

Resulting in no indexable content:

-> Extracting...No text contents    

When examining the process command line in the Tika container it looks like this:

tesseract /tmp/apache-tika-7883861614107540574.tmp /tmp/apache-tika-4559584953829807082.tmp -l eng --psm 1 -c page_separator= -c preserve_interword_spaces=0 txt


The logging from the this command

php /var/www/html/cron/add_folder_to_search_index.php /user-files/DOCUMENTS

 

Looks like this (along with many other similar lines of course):

INDEX /user-files/DOCUMENTS/Some Directory/Some File.pdf -> Extracting......OK

My Tika container compose looks like this:

 tika:
   image: logicalspark/docker-tikaserver
   container_name: filerun-tika
   mem_limit: 1g
   restart: unless-stopped


Any idea how I can get this working?

Thanks in advance Team Filerun!