Find PDFs that contain just images

Discussion related to "Everything" 1.5 Alpha.
Post Reply
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Find PDFs that contain just images

Post by NotNull »

A PDF file can consist of text, images or a combination of both (like magazines or scans)

The ones that are just images can be easily found using Everything 1.5:

Code: Select all

"C:\some folder"  ext:pdf     !dotall:regex:fromdisk:content:^(.{50}) 
This will show all PDFs where less than 50 text-characters can be found.
Next to all-picture-PDFs, this will also list encrypted as well as corrupted PDFs.
Might be useful if you want to convert those to text (using OCR)

Just sharing ...


The 50 characters ilimit is a practical one. I found there were some "noise" characters detected once in a while, but never more than 20. And a text-PDF will contain more than 50 characters.
If you find (for example) 30 works better, please post your experience so this can be updated.
horst.epp
Posts: 1443
Joined: Fri Apr 04, 2014 3:24 pm

Re: Find PDFs that contain just images

Post by horst.epp »

Very useful :D
I stored it as Bookmark "PDF which need OCR"
As all my PDFs are indexed its
"C:\temp\pdf" ext:pdf !dotall:regex:content:^(.{50})

[Edit]
Using the same search on all my PDFs it founds many files which contain a lot of text
but are still displayed
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Find PDFs that contain just images

Post by NotNull »

Could you share one of those for further inspection (if privacy allows)?
(or send it by DM. I think you have my e-mail address already?)


BTW: The search query still needs optimizing. It looks like Everything reads the entire PDF, even if it already should be able to detect that this is a text-PDF (because more than 50 characters text).
ChrisGreaves
Posts: 684
Joined: Wed Jan 05, 2022 9:29 pm

Re: Find PDFs that contain just images

Post by ChrisGreaves »

NotNull wrote: Wed Mar 01, 2023 2:58 pm A PDF file can consist of text, images or a combination of both (like magazines or scans)
Hi NotNull. A bit more grist for your mill.
Content_09.png
Content_09.png (31.31 KiB) Viewed 907 times
According to your search criteria, just one of 1,502 PDF objects in T:\Greaves\Admin (banking etc) satisfies the search criteria, and that a zero-length file.
I suspect that someone smarter than me could work out how to exclude zero-bytes (or for that matter, byte-counts less than your "50") from the content searching?

Anyway, good find, and as such carefully tucked away in t:\greaves\training\everything\tutorial
Cheers, Chris
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Find PDFs that contain just images

Post by NotNull »

You could add size:>50 to the query (or just delete the zero-byte PDF file :D)
Post Reply