Find PDFs that contain just images

Post by **NotNull** » Wed Mar 01, 2023 2:58 pm

A PDF file can consist of text, images or a combination of both (like magazines or scans)

The ones that are just images can be easily found using Everything 1.5:

Code: Select all

"C:\some folder"  ext:pdf     !dotall:regex:fromdisk:content:^(.{50})

This will show all PDFs where less than 50 text-characters can be found.
Next to all-picture-PDFs, this will also list encrypted as well as corrupted PDFs.
Might be useful if you want to convert those to text (using OCR)

Just sharing ...

The 50 characters ilimit is a practical one. I found there were some "noise" characters detected once in a while, but never more than 20. And a text-PDF will contain more than 50 characters.
If you find (for example) 30 works better, please post your experience so this can be updated.

horst.epp · Post by **horst.epp** » Wed Mar 01, 2023 4:22 pm

Very useful

I stored it as Bookmark "PDF which need OCR"
As all my PDFs are indexed its
"C:\temp\pdf" ext:pdf !dotall:regex:content:^(.{50})

[Edit]
Using the same search on all my PDFs it founds many files which contain a lot of text
but are still displayed

Post by **NotNull** » Wed Mar 01, 2023 4:56 pm

Could you share one of those for further inspection (if privacy allows)?
(or send it by DM. I think you have my e-mail address already?)

BTW: The search query still needs optimizing. It looks like Everything reads the entire PDF, even if it already should be able to detect that this is a text-PDF (because more than 50 characters text).

ChrisGreaves · Post by **ChrisGreaves** » Wed Mar 01, 2023 6:51 pm

NotNull wrote: ↑Wed Mar 01, 2023 2:58 pm A PDF file can consist of text, images or a combination of both (like magazines or scans)

Hi NotNull. A bit more grist for your mill.

: Content_09.png (31.31 KiB) Viewed 904 times

According to your search criteria, just one of 1,502 PDF objects in T:\Greaves\Admin (banking etc) satisfies the search criteria, and that a zero-length file.
I suspect that someone smarter than me could work out how to exclude zero-bytes (or for that matter, byte-counts less than your "50") from the content searching?

Anyway, good find, and as such carefully tucked away in t:\greaves\training\everything\tutorial
Cheers, Chris

Post by **NotNull** » Wed Mar 01, 2023 8:06 pm

You could add size:>50 to the query (or just delete the zero-byte PDF file

)

voidtools forum

Find PDFs that contain just images

Find PDFs that contain just images

Re: Find PDFs that contain just images

Re: Find PDFs that contain just images

Re: Find PDFs that contain just images

Re: Find PDFs that contain just images