0
Answered

Full Text Searching for Existing Files

cynical 8 months ago updated 8 months ago 8

Hello,

While the Control Panel information for my files shows that Tika has extracted the text successfully, running a search in Filerun for file contents will not make a match on any files unless I created or modified them via Filerun.

Is this expected behavior?

For example, I had a .doc file that Tika showed having extracted successfully. Searching for words within that document never came up with any hits. But if made a copy of that doc within Filerun, I was then able to search its contents. 

Actually even filename indexing with Elasticsearch isn't working. Or at least Filerun doesn't know how to interface with it. I ran the reindex_files.sh which took hours and Filerun still doesn't know a file exists on my system unless I have browsed that directory at least once, as far as I can tell. So not just full-text search but just indexing as well isn't working for some reason.


 

Answered

The "Content indexing" section of the file control panel is not indexing anything. It is only a troubleshooting tool to see what text gets extracted from the file.

The indexation happens only when running "cron/process_search_index_queue.php" (https://docs.filerun.com/file_indexing#testing_indexing)

The filename searching is not related to ElasticSearch in any way. "reindex_files.php" will only index contents, not file names. For indexing filenames, there is "index_filenames.php".

Hi. I was not referring to Metadata extraction section. There is a 'Content Indexing' section. If you click that, you'll see the plain text version of the file you clicked on and at the top it says "

Server URL: http://192.168.1.200:9998/tika
Response code: 200

This leads me to believe that TIka is actually indexing the full text of the file, right?

So if Tika is doing the full text indexing and ElasticSearch doesn't do filename searching, then what is ElasticSearch doing? Are Tika and ES doing the same thing? 

Regardless, even though Tika seems to have indexed the plain text of a file, it's not searchable from within the FileRun search UI unless I've modified or created/copied the file within Filerun.

Please let me know if you need more details.

I meant "Content indexing".

If you click that, you'll see the plain text version of the file you clicked on and at the top it says

That sounds like Tika doesn't return any text for the file. It should actually show the text contents.

Tika is used to extract the contents, which is then send to ES for indexing. Then the search happens against the ES index.

"you'll see the plain text version of the file" means Tika did its job. It's there. 

However, this isn't limited to Tika. It won't even search contents of plain text files. Please see this video for exactly what I'm talking about. 

Video

So the files themselves seem to be getting indexed correctly when you look at the control panel for them. Tika does its job on the files it handles, and the plain text files seem to get done via FileRun natively. But when it comes time to search for them using 'contents', then FileRun can't find them. But if I create them fresh in FileRun or make a duplicate of them, then they are now searchable. 

When you want to have full text searching "cron/process_search_index_queue.php" needs to be ran on a schedule, to index the files you create (via FileRun).

If you create files outside FileRun, then FileRun won't know about them, so they are not queued for indexing. In this case you need to use either "add_file_to_search_index.php" for individual files, or "add_folder_to_search_index.php". (https://docs.filerun.com/command_line_tools)

It is run on schedule. It's ran every minute via cron. I just ran it manually and the queue is empty. The files I create via FileRun aren't the issue anyway. Those work great.

I have also run 'reindex_files.php' already. Took hours, but it completed. Doesn't this accomplish the same thing as the add_file and add_folder above but more universal across the whole file system?

Just to test, I ran the "add_folder_to_search_index.php" on the folder in question and it said it processed those two text files. However they are still not content-searchable. I've also done this from the GUI under control panel on the folder.

None of these things make the contents searchable. Only actually creating the file new (or duplicating existing) within Filerun will end up making that file's contents searchable.

Okay I just played around with these PHP files and here's what I found. 

process_search_index_queue.php is working fine. I have it on a schedule.

add_folder_to_search_index.php works.  If I point it at a folder, it does take all the plain-text data and injects it into Elasticsearch database and is now searchable. 

reindex_files.php - Either this isn't working or it's failing or something. My understanding was that this was supposed to do the same thing that add_folder_to_search_index.php does but across the entire home directory and subdirectories. This does not happen. I'm not sure if it's indexing file contents or file names across my entire user folder but it's not doing either. It runs for a long time and seems to be doing something, but the end result is that it either fails before it finishes(?) or it isn't doing what it is supposed to be doing. Can you clarify if I'm using it right and if I should do any debugs or pull logs to see what it accomplished?

Also, while working with this, it was made more clear how Tika and Elasticsearch do their jobs. I understood both had something to do with indexing and searching but wasn't clear what app did what. Perhaps some workflow or clarification update might be good? But for others reading this (and assuming I'm correct):

Tika will extract the plain-text from various types of files like docx, xls, etc, and keep that data somewhere(?). 

Elasticsearch will actually be the indexer for that data that Filerun queries. 

But a script or process has to be run on those files if Filerun didn't create/modify them in order for that plain-text data to get put INTO the Elasticsearch database and become searchable. 

So add_folder_to_search_index.php and add_file_to_search_index.php makes its contents searchable. 

index_filenames.php makes the filenames themselves searchable

reindex_files.php does something but not clear what since I think it's broken for me.

So *for me*, it looks like I need to run add_folder_to_search_index.php as well as index_filenames.php on the root user folder to get everything. 

Previous to me figuring this all out, all I thought I had to do was set up the cron to run the process_index_queue regularly and run the reindex_files.php once, which isn't correct for me at least.