Function to find Partial Name Dupes
Function to find Partial Name Dupes
One of the tasks I keep running up against is the ability to locate partial name dupes. This is tricky because my search string isn't necessarily a verbatim string, but rather whether multiple files or folders share the same substring of a given length or position.
Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.
DupeLeft:15
Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.
DupeRegex:"^(.{15,})"
DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.
Thoughts?
Simple Method: Find all records with duplicates matching the first N characters. Eg, All files with the same first 15 characters.
DupeLeft:15
Advanced Method: Allow the user compose a regular expression pattern that defines the parameters of substring length and composition that a record must match to be compared against other records. Portions of the pattern in backref are rendered and matched against other records for dupe comparison, and other portions of the pattern are generic qualifier filtering.
DupeRegex:"^(.{15,})"
DupeRegex:"^(.*)(?:19|20)\d\d"
The above example, any files that contain /(?:19|20)\d\d/ are compared for substring duplication of the portion of the name preceding that number, the /(.*)/ backref, so the number (year) need not necessarily match between duplicates, only the substring to the left of it.
Thoughts?
Re: Function to find Partial Name Dupes
I like the dupeleft: idea.
The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.
Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.
Thank you for the suggestions.
The DupeRegex: search could work. Although, performing a regex search for each filename would be very slow.
Maybe something like:
regex:"^(.*)(?:19|20)\d\d" dupestartwith:\1
-the result would have to match the first regex search and a duplicate would have to exist that starts with the first captured sub-expression.
-a startwith search for each filename would be instant.
Thank you for the suggestions.
Re: Function to find Partial Name Dupes
Aye, I recognize the regex thing would have to be a multi-pass recursion. Though, I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record, and then all records scanned again in kind. Seems basically like my idea, but limiting \1 to the left-side of the string. Maybe some savings if plain string compare is faster than PCRE. But, perhaps, just creating an index of the value of \1\2\3\4\5... for each record is enough, and just fast search / sort / compare / hashtable lookup those.
DupeRegex:"\b(\w{8,})\b"
Records:
I'm not sure it's necessary to attempt to match multiple resolves per entry. ie, no need to pull your hair out over supporting //g patterns.
Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.
DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.
DupeRegex:"\b(\w{8,})\b"
Records:
Code: Select all
foo documents bar.ext resolves to: \1 == documents
baz documents quux.ext resolves to: \1 == documents (match)
aaa raspberries bbb butts.ext resolves to: \1 == raspberries
butts ccc raspberries ddd.ext resolves to: \1 == raspberries (match)
Another example but with multiple backrefs. We just clobber them together into a single [invalid-file-character...say-colon] delimited string.
DupeRegex:"(\w{6,}) \d{4}.*\.([^.]+)" // 6-or-more word characters, followed by a space, 4 numbers, and a file extension.
Records must match the 6-or-more word characters and have the same file extension.
Code: Select all
doctor voidtools 2020 DVD.mp4 resolves to: \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4:::::::
mr voidtools 2019 facebook.mp4 resolves to: \1:\2:\3:\4:\5:\6:\7:\8:\9 == voidtools:mp4::::::: (match)
Re: Function to find Partial Name Dupes
Everything has this information already from the previous regex: search term. It is stored in the regex matches.I'm not sure that your solution reduces that recursive property as \1 would have to be resolved for each record
Generating the startwith term from \1 would be instant.
A startwith search is really a lookup in Everything and is instant (it really is only a few instructions).
Currently I have the following search planned for a future release:
path:regex:(.*)\.mp4 fileexists:\1\.jpg
This will return mp4 files where a jpg exists with the same stem in the same folder.
This search is instant.
dupestartwith: would be similar to fileexsits, except it will never match the current file.
I've added dupestartwith: to my TODO list.
Re: Function to find Partial Name Dupes
Cool beans, will be fun to get my hands on!
What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
Re: Function to find Partial Name Dupes
It might work, I'm not sure where to specify the \1:\2:\3:\4...What do you think about the intricacies of my last post, where you create a column of resolved \1:\2:\3:\4:\5:\6:\7:\8:\9 and then just find dupes within that column? I feel this is totally doable and perhaps easier than your proposal, and doesn't limit the user to left-side alignment (arbitrary).
It could be done as a search term...
I'm looking into a "duplicate view" mode with the following options:
Show all items.
Show duplicates only.
Show unique items only.
Show only one instance of each value.
These dupe modes would be based of the current sort.
For example, sort by size and select duplicate view -> show duplicates only to show results where only files with the same size exist.
There will be a new column 'regex match 1' to show the captured regex match 1. You will be able to sort by this column.
For example, search for regex:"\b(\w{8,})\b", sort by regex match 1, select duplicate view -> show duplicates only to show files where the captured regex match 1 are all the same.
This has the limitation of only matching one captured regex match.
I already had a solution working for size and name. However, I think I'll rewrite it to support all columns..
Re: Function to find Partial Name Dupes
Basically this, but instead of just 'regex match 1', it will contain all regex matches, and they would be tokenized with a colon delimiter (invalid file character). And within this column, you look for duplicates (or uniques if that's your thing). This way the regex pattern can contain more-than-one back-reference instead of just one. And, as well, the duplicate matching doesn't have to be left-aligned to the file name or file path, since the back-reference(s) may appear anywhere in the pattern.
Here's my example again, with coloring.
DupeRegex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Name | Regex Matches
------------------------------------------- | --------------------------------------------------
doctor voidtools 2020 DVD.mp4 | voidtools:mp4
mr voidtools 2019 facebook.mp4 | voidtools:mp4 <-- look, a duplicate!
Quarterly Report 2008.xls | Report:xls
Copy of Quarterly Report 2008.xls | Report:xls <-- look, a duplicate!
Budget Finance Report 2013.xls | Report:xls <-- look, another duplicate!
All we're really doing here is the same as any old Regex:"pattern" search, but pulling out [all] back references and seeing if any other records share matching back references.
Last edited by raccoon on Mon Apr 06, 2020 7:37 am, edited 1 time in total.
Re: Function to find Partial Name Dupes
I'll add a 'all regex matches' column.
Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.
I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.
Thanks for the suggestion.
Then you'll be able to search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
sort by 'all regex matches'
select duplicate view -> find duplicates.
I'll consider a duperegex: search which will automate this, ie: the sort by 'all regex matches' would be done behind the scene.
Thanks for the suggestion.
Re: Function to find Partial Name Dupes
This functionality has been added to Everything 1.5 Alpha
Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.
Search for:
regex:"(\w{6,})\s\d{4}.*\.([^.]+)"
Right click the column header, under the Search submenu, click Regular Expression Match 0.
Right click the Regular Expression Match 0 column header and click Find Regular Expression Match 0 Duplicates.
Re: Function to find Partial Name Dupes
I'm trying to find all files with the pattern
filename (n).ext
I can find these with regex:^(.+)\s[(]\d+[)]\.([^.]+)
as the filename is captured in the first set of brackets and the extension in the second I also wanted any "unnumbered" files matching "filename.ext"
so I tried
regex:^(.+)\s[(]\d+[)]\.([^.]+)|^\1\.\2
however this did not work I only get all the files matching "filename (n).ext"
Is this task possible, I want to find files that have the (n) at the end of the filename due to copy/paste of files into same directory
Thanks
filename (n).ext
I can find these with regex:^(.+)\s[(]\d+[)]\.([^.]+)
as the filename is captured in the first set of brackets and the extension in the second I also wanted any "unnumbered" files matching "filename.ext"
so I tried
regex:^(.+)\s[(]\d+[)]\.([^.]+)|^\1\.\2
however this did not work I only get all the files matching "filename (n).ext"
Is this task possible, I want to find files that have the (n) at the end of the filename due to copy/paste of files into same directory
Thanks
Re: Function to find Partial Name Dupes
You want to find filename (n).ext where filename.ext exists?
Please try:
regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2
You want to find filename.ext where filename (2).ext or filename (3).ext or filename (4).ext or ... exists?
Please try:
regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\)\.\2 | fileexists:\1" "\(3\)\.\2 | fileexists:\1" "\(4\)\.\2 | fileexists:\1" "\(5\)\.\2 | fileexists:\1" "\(6\)\.\2 | fileexists:\1" "\(7\)\.\2 | fileexists:\1" "\(8\)\.\2 | fileexists:\1" "\(9\)\.\2
You want to combine both of these?
<regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\)\.\2 | fileexists:\1" "\(3\)\.\2 | fileexists:\1" "\(4\)\.\2 | fileexists:\1" "\(5\)\.\2 | fileexists:\1" "\(6\)\.\2 | fileexists:\1" "\(7\)\.\2 | fileexists:\1" "\(8\)\.\2 | fileexists:\1" "\(9\)\.\2>
Please try:
regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2
You want to find filename.ext where filename (2).ext or filename (3).ext or filename (4).ext or ... exists?
Please try:
regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\)\.\2 | fileexists:\1" "\(3\)\.\2 | fileexists:\1" "\(4\)\.\2 | fileexists:\1" "\(5\)\.\2 | fileexists:\1" "\(6\)\.\2 | fileexists:\1" "\(7\)\.\2 | fileexists:\1" "\(8\)\.\2 | fileexists:\1" "\(9\)\.\2
You want to combine both of these?
<regex:^(.+)\s[(]\d+[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\)\.\2 | fileexists:\1" "\(3\)\.\2 | fileexists:\1" "\(4\)\.\2 | fileexists:\1" "\(5\)\.\2 | fileexists:\1" "\(6\)\.\2 | fileexists:\1" "\(7\)\.\2 | fileexists:\1" "\(8\)\.\2 | fileexists:\1" "\(9\)\.\2>
Re: Function to find Partial Name Dupes
Thanks for the quick reply. I tried putting regex: in front of fileexists: eg.
<regex:^(.+)\.([^.]+)$ regex:fileexists:^$1:" "\(\d+\)\.$2:$> but that gave nothing, so obviously can't use regex with fileexists: ...
So I used the last example you gave me with the hardcoded number (1) and used
file: online: <regex:^(.+)\s[(]1[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(1\)\.\2>
It found suspect redundant copies, got size and folder matched (ran Dupe on size column).
Also ran pairs through "Beyond Compare" (Binary compare) just to be sure.....
When I confirmed the exact copy "filename (1)", I deleted it. However the partner "filename" is still left behind in the results.
How do I refresh so these disappear? F5 did nothing?
<regex:^(.+)\.([^.]+)$ regex:fileexists:^$1:" "\(\d+\)\.$2:$> but that gave nothing, so obviously can't use regex with fileexists: ...
So I used the last example you gave me with the hardcoded number (1) and used
file: online: <regex:^(.+)\s[(]1[)]\.([^.]+)$ fileexists:\1\.\2> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(1\)\.\2>
It found suspect redundant copies, got size and folder matched (ran Dupe on size column).
Also ran pairs through "Beyond Compare" (Binary compare) just to be sure.....
When I confirmed the exact copy "filename (1)", I deleted it. However the partner "filename" is still left behind in the results.
How do I refresh so these disappear? F5 did nothing?
Re: Function to find Partial Name Dupes
Change the search (eg: add a space to the end) to refresh the results.
Re: Function to find Partial Name Dupes
Refined to DUPE on the "\1.\2" combo using column1
<regex:^(.+)\s[(]2[)]\.([^.]+)$ fileexists:\1\.\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\).\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> addcolumn:column1 sort:path;name-descending dupe:column1;size
I notice that fileexists uses "\" ahead of "." "(" and ")" character. This is using escape \ which is used in regex? Is it possible in future for fileexists: to
use full regex in future? Then there would be no need to hardcode the 2,3 etc in a chain as in your answer above and use \d ?
<regex:^(.+)\s[(]2[)]\.([^.]+)$ fileexists:\1\.\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> | <regex:^(.+)\.([^.]+)$ fileexists:\1" "\(2\).\2 column1:=$regular-expression-match-1:.$regular-expression-match-2:> addcolumn:column1 sort:path;name-descending dupe:column1;size
I notice that fileexists uses "\" ahead of "." "(" and ")" character. This is using escape \ which is used in regex? Is it possible in future for fileexists: to
use full regex in future? Then there would be no need to hardcode the 2,3 etc in a chain as in your answer above and use \d ?
Re: Function to find Partial Name Dupes
Everything 1.5.0.1341a makes some improvements to sibling:
Please try the following search:
regex:^(.+)\.([^.]+)$ regex:sibling:$1:\s\(\d+\)\.$2:
Please note this search is rather slow.
$1: will now be correctly escaped for a regex: search in 1341a+.
regex: will now override ; list syntax in 1341a+.
fileexists: will continue to match an absolute filename in the index.
Please try the following search:
regex:^(.+)\.([^.]+)$ regex:sibling:$1:\s\(\d+\)\.$2:
Please note this search is rather slow.
$1: will now be correctly escaped for a regex: search in 1341a+.
regex: will now override ; list syntax in 1341a+.
fileexists: will continue to match an absolute filename in the index.