+1
Under review

# OCR not running

astone 2 months ago updated 2 months ago

Hi all,

I recently got my FileRun server running on IIS, and all seems to be working, except for OCR with PDFs.

I downloaded the Apache Tika app and linked it along with the recommended config file to my FR install (see screenshot below), and I downloaded Tesseract and linked it in my Path system variable. Unfortunately, when I run process_search_index_queue.php, it always reports that the queue is empty, even though I have uploaded PDF files to my server that need to be OCR'd.

Is there anything I might've missed, or if there's a way to test if Apache Tika is working? I checked my PHP error log, and no errors reported having anything to do with this.

Under review

FileRun queues files for indexing only when these files have been uploaded to with FileRun after enabling content indexing. For indexing the other existing files, there are more command line options that you can find under the documentation site.

Using tika server here, seems to be working fine

Steps I took to enable:
Get latest Java + Tika server jar
set environment variable JAVA_HOME to C:\Program Files\Java\jre1.8.0_341

Create batch file containing
java -jar "C:\path\to\tika\server.jar" --port 9998

run batch file as scheduled task

Only issue I've had is with large pdfs but thats more a limitation of Tika + Filerun should be going by sections for large pdfs

Thanks for the instructions, I feel so close! After running the process_search PHP file, it starts indexing, but on the Tika Server command window, it says "problem with writing the data". Have you ever experienced that? I've tried Tika Server version 2.4.1 and 1.28.2. I've also tried running Command Prompt in administrator mode which made no difference.

Tika might need permissions to write to a temporary folder.

+2
\$config['search']['limit_file_size'] = 10485760;

Add the above inside the config file (see https://docs.filerun.com/advanced_configuration how to create it) and adjust value accordingly.

Procmon is a good way to figure out if an app needs access to something

Vlad, thats a good start (and an option that should be documented!) but cURL times out now

Would be nice if it were put in a foreach loop with each page rather than throwing a large PDF at it causing Java to eat 4GB of ram

I was able to get it to work after putting in another path variable for Tesseract... mostly. It seems like a ~4MB PDF keeps failing. It sort of works, but the indexing also gets screwed up where it assigns words that it has OCR'd to different documents.

It would appear as an app Tika requires access to C:\Users\<username>\appdata\local\temp

So whatever account Filerun runs under will require access to this folder

So it started working after adding the system variable (below). Running Tika as a server worked, but random PDFs would fail. I pointed my FR server to the jar file directly, and despite it running a bit slower (which the documentation warned me about, but I don't mind), it seems to work every time. Like Imagick, this was just a real doozy I guess lol.