Detect duplicate files inside the serahc result list..

Plug-in and third party software discussion.
Post Reply
Christian.Ziegelt
Posts: 11
Joined: Sun Jun 02, 2024 1:57 pm

Detect duplicate files inside the serahc result list..

Post by Christian.Ziegelt »

Hi eveyrbody,

Maybe someone of you allready had the same topic and might share her/his ideas about it.

As a software developer I often use everything to filter soure code files and try to match changes over many different folders.
Basically it's something that git can do if folder / libraries / code is organized in a good way.

As most know - thats not always the case.

Sometimes one ends up with 20 folder with the same file - where most of those are 100%-99% the same.
But it realy sucks to manually choose two of them and test with a diff/merge tool to find the ones that differ.

What I'm searching for is an extra collumn in the result list showing a hash value or some other criteria of "same or similar".

I know there are lots of dupe finding tools - but all I tested did not help me.
I want to explicitly compare the resulted list of files as found by my search.

It doe snot need to be "instant" - so the hash must not be inside the index - its fine to calculate it on command for the selected files or the whole result list.

Thanks for your ideas and pointing to something existing (if so available :-) )

Best
Christian
horst.epp
Posts: 1456
Joined: Fri Apr 04, 2014 3:24 pm

Re: Detect duplicate files inside the serahc result list..

Post by horst.epp »

You can have such columns in Everything but how will a hash column help here ?
If a file is binary different to another one, its hash value will also be differing.
Christian.Ziegelt
Posts: 11
Joined: Sun Jun 02, 2024 1:57 pm

Re: Detect duplicate files inside the serahc result list..

Post by Christian.Ziegelt »

Yes, you're right.

Basically 17 of those 20 results will be binary the same.
Unfortunately, I can't tell this apart by filename, size or date - as these are mostly the same (or "corrupted" but git / windows / etc).

So I'd like to see which ones are "exactly" the same so I don't need to care for those - or even delete those to handle the rest.

My ideal would be - grouping files by "category" in the result list (sorting is kind of grouping here) and then coloring the result lines in Everything by these "groups".

Like sort for HASH and color all same hash resulst in the same color.
Sort by Size and color all same sized files in one color.

This would help a lot and working efficently.
NotNull
Posts: 5517
Joined: Wed May 24, 2017 9:22 pm

Re: Detect duplicate files inside the serahc result list..

Post by NotNull »

When sizes of files differ, they very likely have a different hash too.

Does the following help? (requires Everything 1.5)

Code: Select all

ext:cpp dupe:size,md5
(replace cpp with the actual file extension you are looking for).

Duplicates will be grouped together, separated by a line.
To use color-coded grouping, see here
void
Developer
Posts: 17149
Joined: Fri Oct 16, 2009 11:31 pm

Re: Detect duplicate files inside the serahc result list..

Post by void »

Everything 1.5 will have support for a sha256 column (and other hash columns)

Temporarily add the sha256 column when searching for cpp files:

*.cpp addcol:sha256
Christian.Ziegelt
Posts: 11
Joined: Sun Jun 02, 2024 1:57 pm

Re: Detect duplicate files inside the serahc result list..

Post by Christian.Ziegelt »

Wow - exactly what I was searching for.
Unfortunately I was a bit impatient and invested 4h to build a helper tool with the SDK.

But since it's part of Everything now I'll skip the development of this tool ;-)

Would be great if I could enable / disable this column based on - lets say the filter profile.
So if I fast search for coding files I don't want the hash calculation to slow down files - but if I'm looking for dupes - I select the other filter setting.
therube
Posts: 5056
Joined: Thu Sep 03, 2009 6:48 pm

Re: Detect duplicate files inside the serahc result list..

Post by therube »

If you Index (a hash, or whatever) you can limit the index based on file type (extension).

If you don't Index (a hash), you can still "gather" the hash.
So long as the hash column is outside of the current viewport, the has will not be computed.
Bring the column into focus, & it will.

So if you add md5 hash column, & your window is sized such that the md5 column is not seen, by default, so off to the right outside of normal view, hashes will not be computed. Scroll to the right such that the md5 column becomes visible, & the hashes will then be computed (& only for those files that are currently visible, too).


You could add a bookmark, where the hash column is included (where normally it might not be).
Last edited by therube on Thu Jun 06, 2024 4:55 pm, edited 1 time in total.
Reason: s/scene/seen/
Christian.Ziegelt
Posts: 11
Joined: Sun Jun 02, 2024 1:57 pm

Re: Detect duplicate files inside the serahc result list..

Post by Christian.Ziegelt »

Thanks for the feedback - I deleted my question as I saw it live as described (mostly) :-)

So I'm using a filter to display hash values column and it works like a charm.
I only need to figure out, how to sort a coulmn (like name or size) bot still color group by Hash column.

Or maybe sort by first and second column.

That way I could sort multiple file Names and have them grouped and secondly sort by the hash value to group same files under the filename group.
Post Reply