Recursive dupes

Discussion related to "Everything" 1.5 Alpha.
Post Reply
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Recursive dupes

Post by anmac1789 »

Is there a way to do recursive duplicate searching ? For example, I want to find all duplicates, by comparing custom columns, comparing 2 folders like this

C:\folder1\subfolder1-1\subfolder1-2
D:\folder1\testing folders\subfolder1-1\subfolder1-2

lets say folder1\ in C:\ and D:\ matches but subfolder1-1\ in C:\folder1\ and D:\folder1\testing folders doesn't match then how can I ensure that I don't delete folder1\ in either volumes not knowing that everything below it doesn't match ?
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: Recursive dupes

Post by raccoon »

What we need is an easy way to produce a column that shows relative paths below the starting directories of A and B that are being compared for dupes. I cannot think of an easy means to do this yet.

Everything does have a Fullpath column, and also a way to turn on fullpaths under the Name column, but no convenient way to snip off the D:\Backup2005\ and the Z:\Backup2009\ prefixes so that the remainder relative paths can be name-duped against each other. You would have to perform some string manipulation for each scenario in order to design such a col1 or regmatch1 column yourself.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

raccoon wrote: Sat Feb 11, 2023 3:25 am What we need is an easy way to produce a column that shows relative paths below the starting directories of A and B that are being compared for dupes. I cannot think of an easy means to do this yet.

Everything does have a Fullpath column, and also a way to turn on fullpaths under the Name column, but no convenient way to snip off the D:\Backup2005\ and the Z:\Backup2009\ prefixes so that the remainder relative paths can be name-duped against each other. You would have to perform some string manipulation for each scenario in order to design such a col1 or regmatch1 column yourself.
I think a better approach would be if then else statement used in a search syntax...

Another idea I have is to use something like viceversa pro does...its like a folder comparison...but it lacks an organized view of dates and times but it places a red dot on either side of a folder compare to view which is older/newer...this only works for files as their company is reluctant for updates ...maybe you could perhaps incorporate some of their leverage in terms of comparing folders/files? Their website is: https://www.tgrmn.com/?camp=goog_cmt&gc ... gJB1fD_BwE

Another prospect is to have a folder tree view together with custom columns? You already have a folders view to begin with..
void
Developer
Posts: 16672
Joined: Fri Oct 16, 2009 11:31 pm

Re: Recursive dupes

Post by void »

How large are the folders?

Would calculating the sha256 sum of a folder be useful?
Calculate the sha256 sum of just content or filenames and content? or an option to choose?

I have a compare folders feature in the works (has been on my TODO list for a long time)
Check Tools -> Compare Folders for an idea of how this might work..
The plan is to be able to compare two folders from either the index, a folder from disk or a file list.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

void wrote: Fri Feb 17, 2023 4:55 am How large are the folders?

Would calculating the sha256 sum of a folder be useful?
Calculate the sha256 sum of just content or filenames and content? or an option to choose?

I have a compare folders feature in the works (has been on my TODO list for a long time)
Check Tools -> Compare Folders for an idea of how this might work..
The plan is to be able to compare a two folders from either the index, a folder from disk or a file list.
The folders range from 1 GB up to 22 GB folders. I'm not certain what sha256 is? Is it like a unique identifier of each file and folder like spectrometry ?

Yes, I saw the compare folders and I was shocked that I couldn't do anything with it lol. I saw it in the 1338 update. Something like a blend of vice versa pro and everything would make best use of compare folders
void
Developer
Posts: 16672
Joined: Fri Oct 16, 2009 11:31 pm

Re: Recursive dupes

Post by void »

SHA256 is a hash algorithm.

Calculating the folder sha256 involves calculating the sha256 hash value from all the file content inside this folder/subfolders.

The final folder sha256 hash will be unique.
You can compare it against another folder.
If the hashes match, the folders contain the same data.
If the hashes differ, the folders contain different data. (could be missing data, modified data or added data)

Calculating the hash will take some time.
A rough guess is around 100MB/s



7zip can calculate the sha256 hash of a folder (file content, or file content and filenames)
It might be useful to add these properties to Everything.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

void wrote: Fri Feb 17, 2023 7:19 am SHA256 is a hash algorithm.

Calculating the folder sha256 involves calculating the sha256 hash value from all the file content inside this folder/subfolders.

The final folder sha256 hash will be unique.
You can compare it against another folder.
If the hashes match, the folders contain the same data.
If the hashes differ, the folders contain different data. (could be missing data, modified data or added data)

Calculating the hash will take some time.
A rough guess is around 100MB/s



7zip can calculate the sha256 hash of a folder (file content, or file content and filenames)
It might be useful to add these properties to Everything.
100 Mb/s is more than enough for my purposes. Usually my folders are less than 7 or 8 GB in total size. There are only a few excwptions for creating large archive zip files which can be from 20-30 GB total size.

I do agree calculating hash 256 for file and folder contents or times and filenames and folder names would be one more effective way to identify duplicates among long line codes of search syntaxes of finding duplicates or similarities.
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: Recursive dupes

Post by raccoon »

If I've not misread, I think your ask isn't to detect whether two folders are identical, but to identify which files make the two folders nonidentical. Or more specifically, which files ARE identical so they can be deleted to save space and allow the remaining dissimilar files to be given attention to figure out why they are dissimilar.

At least that's usually my objective in detecting duplicates. Dismiss the duplicates and scrutinize the remaining files for quality or modernity.
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: Recursive dupes

Post by therube »

Again I'll point to, viewtopic.php?p=53680#p53680.

IMO, you use the tool that is correct for the job.
So sometimes you use 1, next time you use the other, & somethings you mix & match.

Since Everything is great at finding everything, once you've found your wanted files or file locations, take that information & plug it into a different tool.

If you want to "sync" or "update" a sync or update tool is what you need.
So use a sync tool, be it ViceVersa or FreeFileSync or...

If you want to compare directories, or directory trees (& then maybe update 1 to the other or whatnot),
IMO Salamander's Directory Compare feature works very well, in particular with 2 particular directories,
very easily pointing out differences (based on various criteria).

It can also handle trees (subdirectories). It will identify "different" branches, but at that point, that is
all that you know, they're different, so it is a bit limited in that respect. You'd have to actually traverse
into those directories (again running a Directory Compare) to know what specifically is different.
(And that is not to say dealing with trees is lacking, it just may not be the correct tool to use - depending
on a particular situation.)
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: Recursive dupes

Post by therube »

sha256 hash ... 100 Mb/s is more than enough
If you're looking for "equality" (duplicates), then IMO you don't need a "crypto" hash (like md5, sha1, sha256...) - where other, faster hashes exist - just not crypto, but will give equally relevant results.

"100 Mb/s".

"I/O" & cache matter.
IOW, your hash program can only hash as quickly as it can read (be sent) data.
So if your "sending" is slow, the hash you use (as far as speed is concerned) is relatively unimportant.
So if you're sending at 30Mb/s & your slowest hash reads at 100Mb/s, it doesn't matter.
But if you're sending at 20000Mb/s, then you will most certainly see a difference between faster & slower hashes.

Cache matters. (And if you have lots of cash you can purchase a very large cache.)
If you're reading from "cache" (of various sorts) that can be exponentially faster then reading "cold".

Don't recall, maybe it was a 4GB file or something? But... speed was along the lines of...

Code: Select all

IF a file is cached (& you use --use_cache, hash can COMPUTE the hash  (2.2 sec)
quicker then sha1 can VERIFY the hash                                 (13.7 sec)

if file is not cached, you're limited by "BUS", & everything,
compute & check will both take                                  (2.min 30.  sec)
(In this case, "hash" is xxHash using the xxh128 algorithm, & sha1 is sha1. And sha256 would be slower yet then sha1.
IMO, sha256 is overkill - depending on needs.)
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

Uh...im lost for words but I think directory compare would suffice for my purpose
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

I probably said this before, but I think the current setup of Everything is not well suited for such queries.

In Everything you can search for a file by specifying part of that files name, specifying a date range and a wealth of other attributes/propertiesthat identify this file. Everything is very good at that.
Relations with other files/folders is basically outside Everything's comfort-zone.

Compare it to people:
I'm looking for a male person, between 18 and 80 years old, brown hair and shoesize 12.
That's easy (if you're Everything ;))
Everything already has implemented functions for close relationships, like parent, child and brother/sister. All require specific functions. That list is quite long already.
I bet that in the future someone want to find male a person, between 18 and 80 years old, brown hair and shoesize 12, who has an uncle with a dog named Tarzan.
The list of (complicated) relation-functions will explode. It will be ugly (it already is, imo).


IF this relationship-feature will be part of Everything, a separate interface is more fitting here, with 3 main input fields:
1. Main object properties (use Everything search syntax)
2. Relationship(s) (haven't thought this through; maybe clicking a tree in a GUI or selecting relationship from a list)
3. Relation properties (use Everything search syntax or "same xyz property as main object")
(there can be more than 1 relation, so relation1 + relation1 properties;relation2 + relation2 properties

(and maybe a SQL syntax for complex queries. or even better: SQLplus query syntax)

Then a lot of those child- parent- sibling- functions can be removed.



Anyway, that is my opinion. Back on topic ....
raccoon wrote: Sat Feb 11, 2023 3:25 am What we need is an easy way to produce a column that shows relative paths below the starting directories of A and B that are being compared for dupes. I cannot think of an easy means to do this yet.
That looked like a bit of fun to have (thanks!).

I just wrote a filter that should be able to get this done.
In my 2 minutes of testing, this worked perfectly. Ergo: this is perfect ;)

Create a new Filter:

Code: Select all

Name : Compare=
Search : regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote: | regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:  -add-column:regexmatch1
Macro : comp<search>
How to use:
  • enable the Compare= filter
  • Use the following search syntax:
    "c:\some\folder";"x:\another\path"

    note the ; that separates the two folders. Don't use spaces before of after it

    The extra Regular Expression Match 1 column will show the part of thepaths that come after "c:\some\folder" or "x:\another\path"
  • Right-click the Regular Expression Match 1 column header
  • Select Find Regular Expression Match 1 duplicates
All files/folders that are the same " on both sides" will be listed.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

I will give this a test comparing two folder structures. Does it only compare filename and foldername or does it compare only the properties under the regex column ?
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

This Everything Filter does not summon dark forces and does not kill puppies, so you could just try and see for yourself. Takes 1 minute ...

On the other hand: this is just a demo to find out what the possibilities are.
It wil likely give false results when a foldername contains a semicolon (;), although I did not test this.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

interesting …
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

How can I use the normal files: and find-dupes: or dcdupe: etc..functions with regex kind of syntax? Also when you say macro: comp<search> do i have to substitute something inplace of search or is that actually part of the command?

It seems like I cannot add custom columns together with regex columns is this for a reason ?
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

the comp: macro was added for 2 reasons
- to get a parameter to process (parameter= "c:\folder1";"x:\folder2"
- to add extra search options, like you are asking now.

So you can do things like:
file: comp:"c:\folder1";"x:\folder2"

(I virtually don't know anything about dupe-functions, so can't help you there)


BTW:
I wanted to call the macro comp= as it compares the same subpaths, but didn't know if that would give issues. Maybe you can test?
(and I had in mind a comp> and comp< macro to search for files that are not "on the other side". Maybe that will come someday ..)
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Sat Feb 18, 2023 7:27 pm the comp: macro was added for 2 reasons
- to get a parameter to process (parameter= "c:\folder1";"x:\folder2"
- to add extra search options, like you are asking now.

So you can do things like:
file: comp:"c:\folder1";"x:\folder2"

(I virtually don't know anything about dupe-functions, so can't help you there)


BTW:
I wanted to call the macro comp= as it compares the same subpaths, but didn't know if that would give issues. Maybe you can test?
(and I had in mind a comp> and comp< macro to search for files that are not "on the other side". Maybe that will come someday ..)

When I compare 2 folders, I am only getting the results of the 1st folder not together with the second
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

What does your search look like? What is the active filter?

2023-02-20 19_37_35-file_   comp__t__sync_;_c__tools_ - Everything (1.5a) 1.5.0.1338a (x64).png
2023-02-20 19_37_35-file_ comp__t__sync_;_c__tools_ - Everything (1.5a) 1.5.0.1338a (x64).png (83.29 KiB) Viewed 10493 times
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

My active search filter is compare=
Screenshot 2023-02-20 134353.jpg
Screenshot 2023-02-20 134353.jpg (76.07 KiB) Viewed 10488 times
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

and your search query?
(replace actual paths with something similar if privacy requires so)
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Mon Feb 20, 2023 6:47 pm and your search query?
(replace actual paths with something similar if privacy requires so)
My search query is

file: comp:"E:\Users\username\Downloads";"C:\Users\username\Downloads"

Should I set it to everything filter or is that not necessary ?
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

anmac1789 wrote: Mon Feb 20, 2023 6:50 pm Should I set it to everything filter or is that not necessary ?
Yes, please.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Mon Feb 20, 2023 7:47 pm
anmac1789 wrote: Mon Feb 20, 2023 6:50 pm Should I set it to everything filter or is that not necessary ?
Yes, please.

Okay, I see that it works now. Thank you
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

Will this work for 3 paths or 4 ? How many paths can comp: take?
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

anmac1789 wrote: Mon Mar 06, 2023 10:53 pmHow many paths can comp: take?
Two.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Tue Mar 07, 2023 8:54 am
anmac1789 wrote: Mon Mar 06, 2023 10:53 pmHow many paths can comp: take?
Two.
can it be expanded to 3 or N directories? or should I use multiple comp:|comp: entries like this to compare more than 2 folders ?
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

The following is for 3 folders; you can expand to as many as you like. I think you will get the pattern on closer inspection...

Code: Select all

regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote: | regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:  | regex:#quote:#regex-escape:<#element:<search:,;,3>>(\\.*$)#quote:  -add-column:regexmatch1
Search for "c:\folder1";"c:\folder2";"c:\folder3" and activate the compare filter.
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Fri Mar 17, 2023 8:01 pm The following is for max. 3 folders; you can expand to as many as you like. I think you will get the pattern on closer inspection...

Code: Select all

regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote: | regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:  | regex:#quote:#regex-escape:<#element:<search:,;,3>>(\\.*$)#quote:  -add-column:regexmatch1
Search for "c:\folder1";"c:\folder2";"c:\folder3" and activate the compare filter.
Thank you for suggesting this. I understand that the element number has changed and everything else has remained the same.
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

Correct. Things might get clearer when the line is split up:

Code: Select all

regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,3>>(\\.*$)#quote:

-add-column:regexmatch1
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

NotNull wrote: Fri Mar 17, 2023 9:30 pm Correct. Things might get clearer when the line is split up:

Code: Select all

regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,3>>(\\.*$)#quote:

-add-column:regexmatch1
Funny you typed it out like that because I did that with notepad just before I typed my last post lol
anmac1789
Posts: 668
Joined: Mon Aug 24, 2020 1:16 pm

Re: Recursive dupes

Post by anmac1789 »

anmac1789 wrote: Fri Mar 17, 2023 10:05 pm
NotNull wrote: Fri Mar 17, 2023 9:30 pm Correct. Things might get clearer when the line is split up:

Code: Select all

regex:#quote:#regex-escape:<#element:<search:,;,1>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,2>>(\\.*$)#quote:
 | 
regex:#quote:#regex-escape:<#element:<search:,;,3>>(\\.*$)#quote:

-add-column:regexmatch1
Funny you typed it out like that because I did that with notepad just before I typed my last post lol
"c:\folder1";"c:\folder2";"c:\folder3"


Can I type it out like this or should I prefix the 3 paths with comp:"path to folder1";"path to folder2";"path to folder3" ??
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

Your method will work too as it is "just" another way to activate the filter.

So you can use the following search "c:\folder1";"c:\folder2";"c:\folder3" and enable the Compare= filter fromthe list
- or-
Use the following search: comp:"c:\folder1";"c:\folder2";"c:\folder3"
NotNull
Posts: 5458
Joined: Wed May 24, 2017 9:22 pm

Re: Recursive dupes

Post by NotNull »

TIP:

If you create the 3-folder compare filter, you can use it on 2 folders too by adding a non-existing dummy folder:
comp:"c:\folder1";"c:\folder2";"dsfdsfdvdfvfdvbdd"

Or by adding one of the folers twice:
comp:"c:\folder1";"c:\folder2";"c:\folder1"
Post Reply