Topics

Files / Search not working #files


Jud Eson
 

Hi all,
I think this is a bug. Please confirm.
- Files recently uploaded are not found when you do a word search from the Files area.

Details:
I am the owner of a group used for a condo association. The group was started on 03/19/19 so is grandfathered in to have more features than newly created groups.  We have stored files in folders in the file section.  A folder for "Board Minutes" with sub folders for each year.  Board minutes are stored as "Print to PDF" files and are searchable.
 
I open a PDF file that was uploaded last year and find an uncommon word in it.
I search for the word using the PDF reader and it is found.
I search for the word using the groups.io Files/Search and the file shows up in the results

I upload a new file (created from Print to PDF from MS Word) with the word "abracadabra" in it.
I search for the word abracadabra using the groups.io Files/Search and the file does not show up in the results.

What am I missing?
- Jud


 

It seems that files uploaded to groups.io after the change that affected new groups are no longer searchable.

I did this test to confirm:
Look at the file list and pick a file uploaded years ago.  Find an uncommon word in it.


Frances
 

On Mon, Sep 21, 2020 at 09:59 AM, Jud Eson wrote:
It seems that files uploaded to groups.io after the change that affected new groups are no longer searchable.

I did this test to confirm:
Look at the file list and pick a file uploaded years ago.  Find an uncommon word in it.
I tested.
You are right. Older files are searched for keywords but newer ones aren't.
Files retrieved with the word in old PDF documents.
I uploaded a new PDF with the same uncommon word. It was not retrieved.
I created the PDF by using Print, Save as PDF on Mac OS 10.15.6

I tested again with another word and the new PDF was not retrieved.

My only concern is that it is possible that there needs to be a time lag between uploading and searching to build an index. Possible? If so, my test may not be relevant.

Frances
 
--
GMF wiki for help. Search box at the top of each page.

Check out the new groups.io Help Center  Use your browser to search or download the PDF.


Duane
 

On Mon, Sep 21, 2020 at 08:59 AM, Jud Eson wrote:
It seems that files uploaded to groups.io after the change that affected new groups are no longer searchable.
I just did some checking on one of my groups.  A pdf upload on July 12 @ 1350 CDT is searched.  A pdf uploaded July 22 is not.  That's not conclusive though because on another group, one uploaded on July 12 @ 1902 CDT is not searched and that's after the one that is.  Maybe someone can narrow it down further.

Duane
--
The official Groups.io user documentation is in the Groups.io Help Center.
GMF's Unofficial Help Wiki: https://groups.io/g/GroupManagersForum/wiki


Christos G. Psarras
 

Interesting, apparently something is indeed afoot and in a peculiar way.  To add to Duane's and Francis' tests, I checked in one of our premium groups (2019) and searched for a certain keyword that exists in several PDFs, and it's found in a PDF uploaded in February 7 (2020) but not in another one uploaded April 8.  But just like Duane's test, a search on a different keyword will find a PDF from June 14, but not on another from July 29.

I don't think it's date related, it seems the "searcher/indexer" is missing uploads or something like that, maybe it needs a manual refresh or whatever.

Cheers,
Christos


Bruce Bowman
 

PDF files are binary. Observing a sequence of characters in Acrobat Reader is no guarantee that same string actually occurs within the file. 

My experience is that "Print to PDF" in Windows 10 does not create a searchable document. 

Regards,
Bruce

Check out the groups.io Help Center and groups.io Owners Manual


Christos G. Psarras
 

Bruce,

>>> PDF files are binary. Observing a sequence of characters in Acrobat Reader is no guarantee
>>> that same string actually occurs within the file.

>>> My experience is that "Print to PDF" in Windows 10 does not create a searchable document.

While it's true that PDFs are either 8-bit binary files or 7-bit ASCII text files, they are still searchable if one opens them a text stream and searches for something.  The difference is when the PDF was created from page images (or bitmaps) in "binary" mode where the text is a graphical representation of itself instead of the actual text, in that case it's not searchable as the characters themselves are not stored in the PDF, but their graphical representation is, which of course is not searchable; maybe that's Win10 is doing, I don't know.  But in my case, and I suspect in Jud's case as well, the PDFs which are failing the search were created the same way as the others which work OK, and in my case they were created by Word and PDF Complete so I know it's not something I'm doing.

You are right that the string may not be (visually) present in the file as such if it contains formatting within its characters (or it was a bitmap image), but if not, it should be there and searchable, and found.

But just to make sure we are not missing something, I did another test, only this time I searched TXT files for a keyword that is there on both files, and lo and behold, one uploaded a month or so ago doesn't get listed in the search results, while another one uploaded earlier this year shows up.

So apparently there is a problem somewhere, more and more I now think some search index or something needs updating or refreshing, I think we should report this to beta.

Cheers,
Christos


Jud Eson
 

 "apparently there is a problem somewhere, more and more I now think some search index or something needs updating or refreshing, I think we should report this to beta"

Christos,
I have done sufficient testing with PDF files and .txt files to be certain that text indexing that happened to old files is no longer happing with new files. 
How do I report it to someone that can confirm the bug and possibly fix it?
- Jud


Christos G. Psarras
 

Hi Jud,

I have done sufficient testing with PDF files and .txt files to be certain that text indexing that happened to old files is no longer happening with new files. 
How do I report it to someone that can confirm the bug and possibly fix it?
You can post it to the beta group, https://beta.groups.io/g/main that's what the developer reads; you can include in it a link to this GMF thread as a reference. 
 
Then please post the link of the beta thread in a message in this thread, for future reference.  I'll add my confirming results in your beta thread as well.

Cheers,
Christos


peteski7
 

I wonder if on the older PDFs the text was stored using the old standard single-byte (ASCII) character sets, and in some newer PDFs the text us stored as Unicode (multi-byte) characters?  Unless the search engine was set up to handle Unicode, the search will fail.  Just thinking out loud . . .

Peteski


Jud Eson
 

Posted in beta group at https://beta.groups.io/g/main/topic/files_recently_uploaded_are/77036376


 

Peteski,

I wonder if on the older PDFs the text was stored using the old
standard single-byte (ASCII) character sets, and in some newer PDFs
the text us stored as Unicode (multi-byte) characters?
PDF files are typically compressed (like zip files). Moreover, the characters are not always encoded as either ASCII or Unicode. So it typically takes some special purpose code to make the content searchable. Later versions of Windows have that built-in, earlier versions required a plug-in to explorer's search indexer.

So being unable to search content of a PDF file would not be a surprise, rather it would be a surprise to find one where the text content is searchable in plain text.

Shal


--
Help: https://groups.io/helpcenter
More Help: https://groups.io/g/GroupManagersForum/wiki
Even More Help: Search button at the top of Messages list


Jud Eson
 

I posted in 
https://beta.groups.io/g/main/message/26318

Mark acknowledged the bug, ran a FT index on files. My testing confirmed that things now work as expected. 

BTW - PDF documents created directly from word processing (as opposed to made from a scanned document) have a layer of text hat corresponds to each key stroke. 

US Courts, government agencies and law offices depend on this to find relevant documents stored in document management systems. 

https://www.onelegal.com/blog/how-to-make-a-pdf-text-searchable/


 

Jud,

Mark acknowledged the bug, ran a FT index on files. My testing
confirmed that things now work as expected.
Yup. Saw that.

BTW - PDF documents created directly from word processing (as opposed
to made from a scanned document) ...
I wasn't talking about scans or other forms of image content. But that's not relevant now since Groups.io's search engine is apparently capable of dealing with PDF file compression.

Shal


--
Help: https://groups.io/helpcenter
More Help: https://groups.io/g/GroupManagersForum/wiki
Even More Help: Search button at the top of Messages list