File hash checksum management; column display

Have a suggestion for "Everything"? Please post it here.
Post Reply
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

File hash checksum management; column display

Post by raccoon »

So I've been using Everything for many years now and just discovered these forums. It's one of the most handy tools in my kit, so hopefully I can spam all sorts of great and annoying ideas here lend a hand.

One of my most painstaking endeavors is maintaining file integrity and managing duplicates and backups. To this end, I always copy+verify (with FastCopy) and I manually create .sha512 hash files of important directories and media libraries (with HashCheck Shell Extension). And yet, I waste way too many hours waiting for DupeGuru to manually scan and compare file contents despite my already having all this checksum data for almost every file in my 10 TB collection.

It would be very amazing if Everything could detect these checksum files and load the hash data into a column along side normal search results. I could then use this to detect which files do not yet have any checksum data, and a new hashdupe: function could easily compare these known file hashes to identify and display duplicates. Everything might even generate and validate checksum data on its own, without the aid of external tools.

I understand there is a lot of minutia behind my request, but I'm willing to help work out the finer points of logistics.

I haven't deleted a file since 1988.
void
Developer
Posts: 17159
Joined: Fri Oct 16, 2009 11:31 pm

Re: File hash checksum management; column display

Post by void »

Hopefully there will be CRC/SHA1/256 columns in the next major version.

I am considering a column for SFV and other verification files. These columns would display Yes/No/Unknown (or blank while loading and calculating hashes).

Note: reading and calculating hashes is extremely slow.

Thanks for the suggestions.
NotNull
Posts: 5517
Joined: Wed May 24, 2017 9:22 pm

Re: File hash checksum management; column display

Post by NotNull »

void wrote: Wed Sep 18, 2019 1:49 am Hopefully there will be CRC/SHA1/256 columns in the next major version.
SHA1 or SHA2 ( = SHA-256/SHA-512/ ..) ?


BTW:
Using SHA-256 or SHA-512 will result in a 256/512 bit hash value (= 32 /64 byte).
Everything uses roughly 100 bytes to store all information of a file/folder.
Adding hashes to the Everything database will increase the size of this database - in RAM and on disk - significantly: resp 32% and 64%
(and that is without the extra overhaed, like the last time the hash value was calculated)
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: File hash checksum management; column display

Post by raccoon »

Logistically, given that file hashes are not expected to be changed all that frequently, especially for large read-only media libraries, I would suggest a json/xml or flat file of hashes, stored externally from the database and loaded only when requested by a function (eg, hashdupe:).

We can discuss nitty gritty about file formats in general -- ie, for other metadata and features like tagging -- maybe in another thread. I would suggest that from a user standpoint, people will want their hash data located both in a central file where Everything can access them readily, and scattered throughout their various drives and libraries so they can travel around with them and be accessed by other users. For me, any file that exceeds 500 mb gets a .sha512 living by its side, and those all get collected into a giant file at the root path for monthly scanning. (Until you start doing this, you have no idea how many files become corrupted by bit flipping. It does happen. But I'm relying on a hodgepodge of scripts and manual editing to maintain this.)
NotNull
Posts: 5517
Joined: Wed May 24, 2017 9:22 pm

Re: File hash checksum management; column display

Post by NotNull »

raccoon wrote: Fri Sep 20, 2019 5:37 pm Until you start doing this, you have no idea how many files become corrupted by bit flipping. It does happen
You are right: I have no idea. How often does his happen?
BTW: If you use SHA-512 to detect bitrot, you might want to consider using CRC instead. That is much "cheaper" in terms of time and processor power. If you combine file-size with CRC, that should give a pretty accurate duplicate-detection.

raccoon wrote: Fri Sep 20, 2019 5:37 pm I would suggest a json/xml or flat file of hashes, stored externally from the database and loaded only when requested by a function (eg, hashdupe:).
We can discuss nitty gritty about file formats in general -- ie, for other metadata and features like tagging -- maybe in another thread.
There are already quite a few threads about tagging/metadata on this forum. One of those matches your idea for how to store hashes :)
therube
Posts: 5060
Joined: Thu Sep 03, 2009 6:48 pm

Re: File hash checksum management; column display

Post by therube »

"Cheaper".

That is the whole point of some of these non-crypto algorithms.


As far as "bit flip", I tend to check files from my flash drive against what I have stored on my HDD - prior to deleting them from my flash drive (as a means to try to identify that it may be failing) & have not had a comparison fail. (My flash drive typically holds archive programs & data files that I may want to have the ability to copy elsewhere, onto a different computer.)

Otherwise, when connected to a network, I'll, from time to time, compare files against backed up copies. Again (knock, knock, knock) I'm not seeing issues. (My regular backup program, I have set to only check [source vs destination] date/size, which of course is essentially meaningless, but my every so often comparison checks have yet to yield discrepancies.)
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: File hash checksum management; column display

Post by raccoon »

CRC, MD5 and SHA1 are all vulnerable to intentional collision attacks. It's not really worth debating end-of-the-word file-pocalypse scenarios because SHA512 is as fast as SHA256 and they're both as fast as SHA1. It's not really a big deal. If I'm going to keep hashes of every file on my 10 TB library, it's going to be useful and future-proof.

I have detected 5 drives with files start to corrupt over the past 25 years, thanks to good backup practices and file hashing. Files will start corrupting long before you start getting checkdsk and smart errors.
Post Reply