identifying duplicates from checksums
identifying duplicates from checksums
I was wondering if duplicate files could be quickly identified by generating and storing checksums for all files. I saw a response that this would be too much of a burden on system resources, particularly with respect to files that are constantly changing. This seems like a good point with respect to constantly changing files, but many files have content that changes rarely if ever. So maybe checksums could be calculated and stored for all files after a certain time has passed without data change. Does anybody know if there is already such a mechanism in place in a filesystem to do this, or if there is utility software for this purpose?
Re: identifying duplicates from checksums
AFAIK no file system has such an integrated feature.
Re: identifying duplicates from checksums
There is voidhash by someone we know .
(Search will turn up at least 2 threads.)
---
Burden? Not sure it would be any more of a burden then anything else?
Now, it would/could be a burden during the actual scanning phase (IOW, Everything would be busy during that time).
And then the fact that hashes are not particularly ? expected to be, not lost, per say.
---
Programs, two of the best IMO:
AllDup & DuplicateCleaner.
---
Hash programs. Oh, that's a real tough one - depending on your needs.
Fast, xxHash.
ramble (& still a work in progress)...
---
I believe FastCopy might (or might have option) to store hashes in ADS?
(Search will turn up at least 2 threads.)
---
Burden? Not sure it would be any more of a burden then anything else?
Now, it would/could be a burden during the actual scanning phase (IOW, Everything would be busy during that time).
And then the fact that hashes are not particularly ? expected to be, not lost, per say.
---
Programs, two of the best IMO:
AllDup & DuplicateCleaner.
---
Hash programs. Oh, that's a real tough one - depending on your needs.
Fast, xxHash.
ramble (& still a work in progress)...
Code: Select all
---
hash:
-r, recursive (directory tree)
-c, check (of .md5 .sha1 ...)
-xxh, xxh hash & if not, then what?
file, individual file(s)
dir, directory(s)
LFN & Unicode ?
vhash (voidhash)
-r, but not "standard"
-c, no
-xxh, no, but has MD5 & up
file, NO - only on directories, not individual file(s) !
dir, YES
LFN & Unicode, most likely are not an issue
-r, (is by default)
BUT its' recursive writes a hash file into EVERY directory (WITHOUT touching date/time ;-))
rather then a single hash file containing all hashed files
output is standard, in a DOS manner (so no, '*filename', only 'filename')
works with Salamander (& anything else that would deal with "standard" hash files) [of which there is no "standard" ;-)]
LFN & Unicode, most likely are not an issue
hash (FcHash)
-r, but not "standard"
-c, NO
-xxh, YES + MD5 & up
file, YES
dir, YES
LFN & Unicode, most likely are not an issue
no output is "standard"
- so only good for checking against "other" of its own output
(like [externally] comparing 2 output files of its' creation)
IOW, not easily working with other tools
hash --recurs --non_stop /1/ > out1
hash --recurs --non_stop /2/ > out2
compare out1 out2
doing something like DEL a dir_tree is NOT really feasible
-r BUT it but only /demarks/ directories with a directory /HEADER/ - NOT as, dir/name !
c:/out/ :
file 1
file 2
file 3
c:/out/X/ :
xfile 1
xfile 2
xfile 3
so... you cannot then manipulate all the files in the list
like if you wanted to delete them all
%s/^/DEL "/
%s/$/"/
del "file 1"
del "file 2"
del "file 3"
del "x/xfile 1" - NOT FOUND
del "x/xfile 2" - NOT FOUND
del "x/xfile 3" - NOT FOUND
if all you wanted is something to COMPARE directory trees
then this is fine (& it even defaults to XXH3 which is theoretically the quickest...)
fsum
-r, YES
-c, YES
-xxh, no (OK, so be it, but does have many other hashes, incl. sha512, so /kind of/ futureproof)
file, YES
dir, YES, but... <FAIL, also>!
UNIX-like, so includes * in its output
which Salamander does not like! (%s/\s\*/ /), but a switch to turn it off... ;-)
default output is MD5
& (with -jnc switch) is exactly like md5sum
but, -sha1 (& most all else)
then oddly "says" it is sha1, so, *file1, becomes ?SHA1*file1 (or ?SHA256*file1)
(again, if there were a switch to turn it off...)
dir <FAIL> !!!
so if /test/X/ & /test/X/1.txt & /test/X/X.txt
fsum X/* says:
hashof 1.txt
*BUT* any file name that starts with an X (like X.txt) gives:
NOT FOUND ******* X.txt <------- HUH ?
777 (7zip)
-r, YES, by default
-c, NO
-xxh, NO, also no MD5 - essentially... only SHA1 !
file, YES
dir, YES
LFN & Unicode, most likely are not an issue
output is non-standard, but workable
xxh (xxhsum)
-r, NO !!! - a great limitation (so anything done, can only deal with a single directory)
-c, YES (but given it doesn't recurse... fine if only dealing with 1 directory, but otherwise...)
-xxh, of course ;-) - though ONLY xxh, NOTHING else
for checking/comparing hashes of sets of files (passed to it), it's fine :-)
md5sum | sha1sum
-r, NO
-c, YES
-xxh, ONLY does MD5 (SHA1) - NOTHING MORE !
- only thing different here is you can do stuff like pipe to it or use STDIN
so you can, DIR | md5sum - not that i'm sure what value there is in that ?
(& both of these are rather slow) [so in general, a no-go]
(both functions are available in 'busybox')
hashmyfiles (nirsoft)
didn't seem particularly feasible (for the command line)
-
void
for dir (ONLY)
so long as you're OK with a hash <sidecar, is that what he calls it ?>
in /EVERY/ directory... (& if that's what you want, probably no easier
way to do it...)
(but, i'm still missing the use case?
for static, relatively static directory structures, for easy comparison /between
such/, hashtree.bat is fine, i'd think & does all that void does, difference
being a single file with all the hashes vs. a hash in every directory.)
hash
in general, is fine
but output is non-standard
so if you wanted to manipulate... it'll be hard
fsum
in general, is fine
standard output (with addition of '*')
- odd issue with dir
fsum -r X, is fine, finding X/x.txt
fsum -r X/*, or fsum X/x*, FAILS !!
AH!
/d specifies the "working directory"
fsum /dX *, sets the "working directory" to X, then searches for *
^--- WORKS
fsum X *, (seeming) does an fsum on X, & as X is a directory * then finds the files within
^--- WORKS
fsum /dX/*, presumably ? sets..., oh, it must be looking for /directories/ within X,
rather then the /files/ in the directory X
^--- this fails
777
limited to sha1 ONLY (essentially)
in general, is fine
standard output (with addtion of file 'Size')
but is workable
does (seemingly) CWD, then subdirectories
- where fsum does subdirectories then CWD
(so file ordering between the 2 do differ)
AH!
777 does NOT output its file listing in a consistent order as it traverses a tree?
- that makes it far more difficult to compare (what should be, relatively, the same outputs)!
/unix/lss /unix/lss
/unix/sed /unix/sed
/unix/TAR/ /unix/unz
/unix/TAR/gzip /unix/zip
/unix/unz /unix/TAR/
well, it does - kind of, but not in a way that is beneficial for being able to compare
2 trees... /each/ tree can be output, consistently, but 2 of the same trees - presumably
cause of the way, order ?, they happened to be written to, will be /traversed/ in a
different order, resulting in differing order listings for the same "data" trees.
- if it were simply a (alpha) sequential, top down, all would be fine, but as it is,
without consistent output, "value" drops for 777...
fsum IS consistent in its output!
xxh
limited to xxh ONLY
limited to files, or directories (whatever can be passed to it)
but does NOT recurse !
does do -c
/probably/ will have issues with LFN ?
so, for what it does, it is fine
md5sum/sha1sum
too slow (compared to the others)
too limited in scope
- though... they are what all else sprung from ;-)
hashtree.bat:
:: HashTree - SjB 03-08-2022
FcHash.exe --xxh --recurs --non_stop . 2>&1 | tee 0hashtree
HARDWARE, BABY !!!
K:\fcp>tail 00*
==> 0007 <==
WARNING: Cannot open 1 file
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 16:35:01 2022
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 16:35:01 2022
TimeThis : End Time : Thu Oct 20 16:39:16 2022
TimeThis : Elapsed Time : 00:04:15.147
==> 0007-home <==
WARNING: Cannot open 1 file
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 23:50:41 2022
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 23:50:41 2022
TimeThis : End Time : Fri Oct 21 00:02:17 2022
TimeThis : Elapsed Time : 00:11:35.923
K:\fcp>
K:\ffc\LIB>tail 00*
==> 0007-home <==
WARNING: Cannot open 1 file
TimeThis : Command Line : 777.exe h * -r -scrcsha1
TimeThis : Start Time : Thu Oct 20 23:28:42 2022
TimeThis : Command Line : 777.exe h * -r -scrcsha1
TimeThis : Start Time : Thu Oct 20 23:28:42 2022
TimeThis : End Time : Thu Oct 20 23:42:43 2022
TimeThis : Elapsed Time : 00:14:01.374
==> 0007-office <==
WARNING: Cannot open 1 file
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 16:26:48 2022
TimeThis : Command Line : 7hash
TimeThis : Start Time : Thu Oct 20 16:26:48 2022
TimeThis : End Time : Thu Oct 20 16:32:01 2022
TimeThis : Elapsed Time : 00:05:13.716
K:\ffc\LIB>
we're talking HUGE time diff here...
is it /simply/ hardware (CPU, & guess, mainly ?)
or is there something more involved ?
(like... 1 USB port vs another, both USB 2.0 AFAIK
- all are black, none are blue...)
"slow" (which i've never considered "slow")
i5-3570k, 4-core 3.4 GHz, 16 GB RAM, HDD's (but i don't see where that would matter, as i'm accessing USB flash drive)
"fast" dell AIO, i7-?, 8 GB RAM, SSD (but again, i'm accessing USB flash drive)
fsum can specifiy a "base" /directory/, followed /file/ spec to hash
fsum -d/tmp/ *.bat
- hash *.bat files in /tmp/ directory
rhash - CANNOT ???
rhash /tmp/*.bat
- /tmp/*.bat: no such file or directory ???
fsum & exf are ALIKE, with diffs
hash methods
/d directory spec, /dX vs. /d X (space or not)
exf is more "forgiving" & more interobable - maybe
exf -c can check fsum checksum files
exf -c /may/ be able to check void checksum files
exf can do sha1 in 2 ways; ?SHA1* & just '*'
can ouput fullpath (rather then none or just relative)
K:/lib/dac vs /lib/dac
switches can be in any order, basically
exf * -sha1 vs. exf -sha1 *
-c can be verbose or not
most all support foward-slash
sha1 c:/tmp/out vs. sha1 c:\tmp\out
directory traversal is NOT the same between fsum & exf !!!
exf may ? be using more RAM then fsum, 15% vs. 5% PERHAPS <--- NEED 2 VERIFY !?
***NONE*** of them traverse "the same" trees - with different parents, equally - potentially !!!
/LIB/ETC2/BASIS/JCS vs /FFC/ETC2/BASIS/JCS
...file1 ...file1
...file2 ...file4
...file3 ...file2
...file4 ...file3
I believe FastCopy might (or might have option) to store hashes in ADS?
Re: identifying duplicates from checksums
I've been looking for something to give me a single hash of an entire directory, not just individual hashes of each file.
Any ideas please?
Any ideas please?
Re: identifying duplicates from checksums
May be the following:
HashCheck Shell Extension (archive) can be used to get a hash of a directory. This can be done by:
Using HashCheck on the directory.
This will generate a .md5 file which contains a listing of the hashes of each file in that directory, including all files in sub-directories.
Use HashCheck again on the .md5 file it generated above.
This final generated .md5 file contains a hash of the entire directory.
http://code.kliu.org/hashcheck/
Re: identifying duplicates from checksums
7-zip gives you "something", just not sure what it is, but maybe it is a "directory" (for data:, for data and names:) hash?
(And you could parse its' output, if that is in fact the expected data...)
(And you could parse its' output, if that is in fact the expected data...)
Code: Select all
7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning
7 files, 267227 bytes (261 KiB)
SHA1 Size Name
---------------------------------------- ------------- ------------
5a2e3423e5d4890871810a8153fedb4a32fe85ac 283 0hash - Copy.md5
bc70620da9b1f2824624d43b707a30685b3daaf4 335 0hash.md5
e7726eefe954042687f527c7e3a4a36dc21f71f9 3210 fcp-home-LIB-ntfs.TXT
6a5bd73a3906e4aed2e384b4c1c4e34b96f8b20c 112188 UsbTreeView381-home.TXT
c2a006a4aed4226325eac306d31a53cf4ee1555a 149650 UsbTreeView381-office.TX
T
381c8c1a61a774d6dffbed64119ca170ce2d84e5 869 UsbTreeView381.ini
7e668b379e2314a5c784bb8b7b808881c8b37da1 692 xxx
---------------------------------------- ------------- ------------
df90fa505f81103fdee9f82fc919112ecc16ac88-00000004 267227
Files: 7
Size: 267227
SHA1 for data: df90fa505f81103fdee9f82fc919112ecc16ac88-00000004
SHA1 for data and names: 0589c73b768e62094d2d98d26fffb81523b4cc39-00000004
[code]
Re: identifying duplicates from checksums
Thanks, there seem t be a few ways..I'm using a programme called file hashes, which appears to be similar to hash check I'm using 7zip on the directory to compress, or using the hash facility in 7zip to generate a file, then saving the hash file. Or I'm just zipping the directory into two different locations and using the hash tab on the properties menu to compare.
Seems to work, but it's a lot of fuffing around, there doesn't appear to be an easy way. I'm quite surprised that there isn't anything to hash directories easily.
Seems to work, but it's a lot of fuffing around, there doesn't appear to be an easy way. I'm quite surprised that there isn't anything to hash directories easily.
Re: identifying duplicates from checksums
See this link for a list of the active and inactive HashTab/HashCheck project forks, and pick the one to your liking. They're all effectively the same design and system, with some differences. HashCheck (not HashTab) is the only fork I know of that gives the ability to generate and verify checksum digest files for individual files, selected groups of files, or entire folder structures.
https://github.com/gurnec/HashCheck/iss ... -886823066
https://github.com/gurnec/HashCheck/iss ... -886823066
Re: identifying duplicates from checksums
Thanks, I'll look at the latest version of Hashcheck.
Completely off subject, over the past couple of years I've subscribed to various Github projects to tell me whwn there is an update. Do you have any idea if I can get a list of everything I've subscribed to?
Completely off subject, over the past couple of years I've subscribed to various Github projects to tell me whwn there is an update. Do you have any idea if I can get a list of everything I've subscribed to?
Re: identifying duplicates from checksums
Visit https://github.com/watching and https://github.com/notifications/subscriptions while logged in.
You can find this link by going to your Settings > Notifications (https://github.com/settings/notifications)
Re: identifying duplicates from checksums
That's brilliant, thanks so much. I've been looking for that for ages...
Re: identifying duplicates from checksums
Thanks all for the great information and links!
Re: identifying duplicates from checksums
To find potential duplicates in Everything, include the following in your search:
dupe:size
This search is instant.
To find duplicates in Everything, include the following in your search:
dupe:size;sha256
This search will calculate the checksum for files that share the same size.
ReFS supports CRC64
I'll look into adding this property to Everything.
dupe:size
This search is instant.
To find duplicates in Everything, include the following in your search:
dupe:size;sha256
This search will calculate the checksum for files that share the same size.
ReFS supports CRC64
I'll look into adding this property to Everything.
Re: identifying duplicates from checksums
That's great, Void! Is the sha256 calculation persistent so that it doesn't have to be recalculated if that search is repeated?
Re: identifying duplicates from checksums
The sha256 calculation will last until you close the search window.
You can change the search and instantly go back to your previously completed dupe:size;sha256 search.
I recommend using sha256sum to create a .sha256 file in each folder.
Everything can quickly pull the sha256 values from the .sha256 file with the sha256sum SHA-256 property.
dupe:size;sha256sum-sha256
You can change the search and instantly go back to your previously completed dupe:size;sha256 search.
I recommend using sha256sum to create a .sha256 file in each folder.
Everything can quickly pull the sha256 values from the .sha256 file with the sha256sum SHA-256 property.
dupe:size;sha256sum-sha256
Re: identifying duplicates from checksums
Is it possible to get a one hash for an entire folder, rather than the individual files?
Re: identifying duplicates from checksums
I'll consider a property to do this.
Thank you for the suggestion.
For now, please consider 7zip to calculate the folder CRC/SHA.
Thank you for the suggestion.
For now, please consider 7zip to calculate the folder CRC/SHA.
Re: identifying duplicates from checksums
Thanks, a folder hash would be useful.
Re: identifying duplicates from checksums
Harryray, do you mean a single checksum calculated from the data in all the files in the folder? If so, how can that be used? I guess it just tells you that there has been no file in the folder has been added / deleted / changed? Or maybe you mean a single checksum file that contains multiple checksums, one for each file in the folder?
Re: identifying duplicates from checksums
Various reasons, for example, if I want to copy a folder and make sure it's been copied correctly, or just to check whether the folders are identical.
Re: identifying duplicates from checksums
Be careful with that - as the way a directory is parsed will make a difference.Is it possible to get a one hash for an entire folder, rather than the individual files?
If you always use the same tool, & that tool does not change, it shouldn't matter. Otherwise...
Without thinking too much into that, I'm not sure there is a good enough reason for doing like that as compared to comparing the hashes of the individual files within (which obviously you can write to a singular check file & with that you only need to compare that one check file [against another output check file or against the output copied files]).if I want to copy a folder and make sure it's been copied correctly, or just to check whether the folders are identical
Code: Select all
T:\K-ORSAIR\Log>777 h * -scrcSHA1
7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning
8 files, 1319415 bytes (1289 KiB)
SHA1 Size Name
---------------------------------------- ------------- ------------
ee86395d36c52773a7ae0aa6a13bcbdcc3068bf0 1313166 20221115-195929-0.log
01750351dca6f91b37bf305744d90dfbdad9dbf5 979 20221116-001646-0.log
11ecb2eca08a228ef468868fc99352395b1c81f8 1239 20221116-001744-0.log
6297de9cff646c0be16080423f00d8b6d8efa626 1239 20221116-002009-0.log
236396fa87738d0be7f85d67a6acd6442507e93b 1327 20221116-002103-0.log
72dd99c5993e1e575217c6af727e827fa9dc38a8 388 Log.sha1
4d2875229f5de8e627f38de486560d015e4f2b23 385 Log2.sha1
411282674d0f66f3bd79bc78d59fdcb918646b71 692 log3.sha1
---------------------------------------- ------------- ------------
85faf581c17aaa65d3b4b04364ca46471884477e-00000004 1319415
Files: 8
Size: 1319415
SHA1 for data: 85faf581c17aaa65d3b4b04364ca46471884477e-00000004
SHA1 for data and names: 4ccf490bdad59382671d3b883a3ba545b73bfea9-00000004
Everything is Ok
T:\K-ORSAIR\Log>dir /b | sha1sum
77b804d9ebcb71e3660cfce17f09b3c630f1e815 *-
T:\K-ORSAIR\Log>
does the /ordering/ of the files make a difference ?
ah, it does!
so depending on how a particular utility /lists/ a directory
output may differ !!!
Code: Select all
T:\K-ORSAIR\Log>dir /b | sha1sum
105be8f554fd4cc877d06cf327dc37d5f0b7b0fe *-
T:\K-ORSAIR\Log>dir /b /os | sha1sum
26de98c63110a755bf17107e635b2970a861cf1f *-
Code: Select all
T:\K-ORSAIR\Log\X>777 h -scrcsha1
7-Zip 22.01 (x64) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning
1 file, 334 bytes (1 KiB)
SHA1 Size Name
---------------------------------------- ------------- ------------
a7cd0d7356f553cec274181707afc78a1bfe29f4 334 x
---------------------------------------- ------------- ------------
a7cd0d7356f553cec274181707afc78a1bfe29f4 334
Size: 334
SHA1 for data: a7cd0d7356f553cec274181707afc78a1bfe29f4
Everything is Ok
T:\K-ORSAIR\Log\X>dir/b | sha1sum
b57bc04cd1994067ff0f49c776e5f7553d3aeed4 *-
T:\K-ORSAIR\Log\X>dir
Volume in drive T is TOSH_8TB
Volume Serial Number is 6044-BACB
Directory of T:\K-ORSAIR\Log\X
11/16/2022 09:30 AM <DIR> .
11/16/2022 09:30 AM <DIR> ..
11/16/2022 09:30 AM 334 x
1 File(s) 334 bytes
and also as you can see, a "directory" hash of a directory with only a single file is the same hash as the file itself.
Re: identifying duplicates from checksums
Just noting that with 7-Zip version 24.05 it can now also do XXH64 & BLAKE2sp hashes.
Code: Select all
C:\P\2P>7hash.bat *
C:\P\2P>777.exe h -scrcXXH64 *
7-Zip 24.05 (x64) : Copyright (c) 1999-2024 Igor Pavlov : 2024-05-14
Scanning
6 files, 45377 bytes (45 KiB)
XXH64 Size Name
---------------- ------------- ------------
6AFD3BB85F8FB268 5512 APT
6F51B77A5B55BE32 4293 MONTHLY.CASH
640B92242DAF8175 24417 RENTROLL
A248DB093F2110EE 5456 SEC.DEP
05623D5EE68F1510 5336 TEN
467A1DA3626C8438 363 VACANCY
---------------- ------------- ------------
2C7FBB6270B19C45-00000002 45377
Files: 6
Size: 45377
XXH64 for data: 2C7FBB6270B19C45-00000002
XXH64 for data and names: B3148A072B0CE03B-00000003
Everything is Ok
C:\P\2P>
Re: identifying duplicates from checksums
(I've not messed with it, particularly.)I've been looking for something to give me a single hash of an entire directory, not just individual hashes of each file.
paq Hash file or directory recursively.
Powered by blake3 cryptographic hashing algorithm.
See also, Everything: Is there is away to generate folder hash ?
Re: identifying duplicates from checksums
Hi Raccoon,How I can get a list of everything I've subscribed to?
Visit https://github.com/watching and https://github.com/notifications/subscriptions while logged in.
You can find this link by going to your Settings > Notifications (https://github.com/settings/notifications)
A bit old this, but do you know of anyway I can download a list of watched files from Github, all at once.
Thanks.
Re: identifying duplicates from checksums
7-Zip 24.09
7-Zip now can calculate the following [additional] hash checksums: SHA-512, SHA-384, SHA3-256 and MD5