[strikethrough]Everything[/strikethrough] ***I*** have a problem in identifying duplicate content.

General discussion related to "Everything".
Post Reply
ChrisGreaves
Posts: 697
Joined: Wed Jan 05, 2022 9:29 pm

[strikethrough]Everything[/strikethrough] ***I*** have a problem in identifying duplicate content.

Post by ChrisGreaves »

Unless your file is a direct Windows Copy/Paste of a file, Everything has a problem in identifying duplicate content. I remain convinced that we can NOT truly detect duplicate audio files using solely a computer program.

In particular, in Everything 1.5, when can we trust size in finding duplicates?

I have copied my 20,000+ music files to an external USB F:
I want to eliminate duplicate audio files.
The topic Find Duplicates is my starting-point, but please see also these topics.
Dupes_01.jpg
Dupes_01.jpg (86.79 KiB) Viewed 319 times
In this exercise I am focusing on a file “Coco” which I had previously found to be a duplicate in name and size.
The size is reported here as 2,273 KB.
Dupes_02.jpg
Dupes_02.jpg (58.89 KiB) Viewed 319 times
The first file is reported as 2,326,924 bytes, and 2,330,624 bytes on disk.
Dupes_03.jpg
Dupes_03.jpg (69.69 KiB) Viewed 319 times
The second file is reported as 2,326,942 bytes, and 2,330,624 bytes on disk.
Subtle, huh?!!???
I loaded each file into Audacity and listened; to my ear they are the same audio. I always suspect that a slight difference in length-of-sound can be caused if I have edited long applause from an end of the track. I think too that I should be looking at the SHA256 or some other clever feature of Everything.
Dupes_04.jpg
Dupes_04.jpg (64.56 KiB) Viewed 319 times
Appending “name”: makes no difference to my original search results, since dupe: defaults to Name.
Dupes_05.jpg
Dupes_05.jpg (43.97 KiB) Viewed 319 times
Appending “size” DOES make a difference to my original search results. The empty result lists suggests to me that Everything has recognised the “Size” figure in preference to the “Size on disk” figure from properties.
Dupes_06.jpg
Dupes_06.jpg (48.24 KiB) Viewed 319 times
Undaunted I replace “size: with SHA256. I am not surprised that a checksum value differs between the files. After all, if one file is 18 bytes longer than the other file, then I would expect a checksum to report a difference.
For perhaps twenty years I have known that defining duplicates in audio files is a tricky business. My 50-year involvement with Gilbert and Sullivan operas will give me the edge in detecting different recordings of “The Gondoliers”, but to your ears, it will be the same opera. Likewise for Maurice Andre vs. Winston Marsalis when it comes to trumpet audios.
Dupes_07.jpg
Dupes_07.jpg (59.58 KiB) Viewed 319 times
Undaunted I import both stereo tracks into Audacity 3.4.2 and play them through (five minutes). Above I show the last few seconds of the two tracks. The audio waves have the same pattern, but there are subtle differences. It seems to me that one audio file is amplified over the other.
Although the tracks differ by 18 bytes, the screenshot above suggests that they are the same length. But then:-
(a) I am not sure that my eyes can detect 18/2273000 of a file and
(a) Perhaps Audacity neatly aligns two imported tracks.
How do I know my test is valid? How do I know, for sure, that both tracks are playing and that I am not listening to just one track?
Dupes_08.jpg
Dupes_08.jpg (161.47 KiB) Viewed 319 times
Easy! I select and delete about one second of audio from the first track and play back. The orange-circled parts are clearly offset in time, and sure enough at that point the playback goes into stutter-mode; too far apart to be an echo; the tracks are evidently out-of-step.
It follows that had they been out of step by a fraction of a second in the original form, I would have heard a stutter, or at least an echo effect.

I do not expect Everything to ignore the so-called silence at the start or end of an audio track. That silence is not silence; it is background noise including the sound of your refrigerator in the kitchen, traffic outside in the road, .... Likewise I do not expect Everything to recognize “null content” in video of graphic files (pictures, images). Likewise I would not expect Everything to ignore any “null” content, such as the <space>characters at the start, end, and especially not in the middle of any form of text file.

In short, Everything in a general sense ought ONLY perform bit-wise comparisons, which is, I think, what is provided in any checksum approach, including SHA256.
Conclusion: Unless your file is a direct Windows Copy/Paste of a file, Everything has a problem in identifying duplicate content by SIZE.

Cheers, Chris
Last edited by ChrisGreaves on Tue Dec 24, 2024 1:48 pm, edited 2 times in total.
therube
Posts: 5056
Joined: Thu Sep 03, 2009 6:48 pm

Re: Everything has a problem in identifying duplicate content.

Post by therube »

dupe: here is dup'ing on Name, & "01 - Coco B.mp3" is dup'd in two different places.
So yes, you have a dupe.

If you did a sizedupe:, then those two files would not dup, cause they are different sizes.

Size, you are displaying as KB, & KB is not "exact".
If you display Bytes, it would be readily apparent that the file sizes are different.
(Well readily - so long as you catch the size diff.)


> I am focusing on a file “Coco” which I had previously found to be a duplicate in name and size

Coco is same name - not same size.
You did not find it to be the same size. (It is only close in size, not the same.)

> Everything has recognised the “Size” figure in preference to the “Size on disk” figure from properties

Of course.

> Everything has a problem in identifying duplicate content by SIZE

Well it does not (have a problem), because size is not an indicator of content.
A duplicate by size is just that, size - nothing more.

If one wants to infer that a duplicate size (which you do not even have here) & also a duplicate name equate to the files being the "same", you can certainly infer that, you might hope that be the case, but it is a false assumption.

Duplicate size simply means duplicate size - nothing more.
Duplicate name simply means duplicate name - nothing more.
Duplicate size & name simply means that, size & name are the same - nothing more.

You cannot determine equality other then in size, or name, or size & name (because that is all you checked for).

None of them can say that the content is the same - because you have done nothing to find duplicate content - only size & name.

And for media, the "physical" (if you will) audio/video can be the same - exactly, hash verified, where the file size (& name & date & ...) need not be. And there are ways to verify exactness of the "media" itself.
void
Developer
Posts: 17149
Joined: Fri Oct 16, 2009 11:31 pm

Re: Everything has a problem in identifying duplicate content.

Post by void »

I recommend
dupe:size
to instantly find possibly duplicated files.

I recommend
dupe:size;sha256
to find duplicated content.
The sha256 will only be calculated for files with the same size.
Calculating sha256 will take a very long time.
For the best performance, combine the dupe:size;sha256 search with other search filters.

Finding duplicates in Everything 1.5
ChrisGreaves
Posts: 697
Joined: Wed Jan 05, 2022 9:29 pm

Re: Everything has a problem in identifying duplicate content.

Post by ChrisGreaves »

void wrote: Thu Dec 19, 2024 8:33 pmI recommend
dupe:size
to instantly find possibly duplicated files.
Thank you Void for this quick response.
Dupes_10.jpg
Dupes_10.jpg (164.53 KiB) Viewed 244 times
Sadly, on my collection of 20,000 MP3 files it finds 2,632 items, so about 1,300 duplicate pairs.
Now some of them might indeed be the same audio track but misnamed. Haydn Symphony 093.MP3 and Haydn Symphony 091.MP3 might easily be an error in my naming/saving of the file, but I have listened to both and they are different audios. (I am not capable of listening to an audio and declaring the name of it, but I recognize different tunes when I hear them!).
The "size"s are the same, so that works, but I strongly suspect that this is partly due to an audio editor outputting "packets" of 64 bytes of audio. The MP3 format has a header, a trailer, and a series of packets of data; so a quantum effect comes into play. But please don't lets go any deeper than that.
I recommend
dupe:size;sha256
to find duplicated content. The sha256 will only be calculated for files with the same size.
Dupes_11.jpg
Dupes_11.jpg (177.63 KiB) Viewed 244 times
Agreed. This search produces 16 items from 20,000, and only two of the eight "duplicate" pairs are NOT duplicates
Calculating sha256 will take a very long time.
It does indeed, but that's what sleeping-at-night time is for (grin!)
For the best performance, combine the dupe:size;sha256 search with other search filters.
My guess at this time that not only is a definition of "duplicate" an insoluble problem, but the features that can be installed in Everything would need to consider every Audio format, every Video format, and every Picture/Image format to provide a full solution.
That is, for now, SHA256 is our best hope.
I shall consider testing duplicate images, of which I have about 102,000 spinning around at 7200 rpm.

You know I love Everything, but locating audio duplicates has been plaguing me for a long time. Over several years I have wrestled with the definition of "duplicate audio" and it is probably impossible to define. Let alone "my" versus "your" definition, I have my own definitions that shift over time.
Some years ago I couldn't get enough of "If" by Bread, so I downloaded and made a half-dozen copies. Seven Duplicates. After a a year I cut that down to four copies. Duplicates. Now am I am down to just one copy on my system. No duplicates. So the definition of "duplicate audio (to be removed) varies over time for just one user.
And that just for MP3 files.
I suspect that a rational and isolated definition of "duplicates" is an insoluble problem.

Thanks again for your insights.
Cheers, Chris
ChrisGreaves
Posts: 697
Joined: Wed Jan 05, 2022 9:29 pm

Re: Everything has a problem in identifying duplicate content.

Post by ChrisGreaves »

therube wrote: Thu Dec 19, 2024 7:36 pmSo yes, you have a dupe. ...
Therube, thanks for wading through my lengthy post, and for your comments.
I agree that using (as an example) combinations of "size" and "name" will not be especially good at detecting what to MY ear are duplicates.
SHA256 is an improvement, but I suspect that not only is there no easy definition of a "duplicated audio file" that can be used by all users, but that such a definition can never be made, which takes us out of Everything and into the deep pool of Philosophy.
Cheers, Chris
therube
Posts: 5056
Joined: Thu Sep 03, 2009 6:48 pm

Re: Everything has a problem in identifying duplicate content.

Post by therube »

> dupe:size;sha256
This search produces 16 items from 20,000, and only two of the eight "duplicate" pairs are NOT duplicates
Run that by me again.
You're saying that Shostakovich <> 05, & that Bach <> Bruckner?

That using a sha256 hash & you've run into two sets of (hash) collisions?

If so, something is wrong, somewhere.


Run those two sets of files in a different HashMyFiles program & see if they still compare or not?
Or use a different hash algorithm in Everything - on those 4 files; dupe:size;sha1 or dupe:size;sha512 or dupe:size;md5 (or whatever else Everything/Windows may afford).
therube
Posts: 5056
Joined: Thu Sep 03, 2009 6:48 pm

Re: Everything has a problem in identifying duplicate content.

Post by therube »

to MY ear are duplicates
Duplicates.
If you want "duplicates", files that are exactly the same, then mathematics my boy, mathematics.
Or you can compare content of the files in question.
(Or A.I. - cause A.I. is all knowing, of course. [Not!])

Duplicates.
If you want files that are similar, or similar enough, for you to consider duplicates, that is an entirely different matter.


But if you knock "true" duplicates out of the picture, then all that are left are ones that you need to deal with in some other manner. Might make the work load a little bit lighter (depending on ones situation).
void
Developer
Posts: 17149
Joined: Fri Oct 16, 2009 11:31 pm

Re: Everything has a problem in identifying duplicate content.

Post by void »

Agreed. This search produces 16 items from 20,000, and only two of the eight "duplicate" pairs are NOT duplicates
"Shostakovich - Symphony No 05"... and "05- Why was I born"... have the same content.
"J. S. Bach Trio Sonata BWV 1039"... and "Bruckner - Symphony No.7"... have the same content.

If you play them do they the exact same audio?

They are listed in your
dupe:size
search, no need to do the dupe:size;sha256 search again.
NotNull
Posts: 5517
Joined: Wed May 24, 2017 9:22 pm

Re: Everything has a problem in identifying duplicate content.

Post by NotNull »

ChrisGreaves wrote: Thu Dec 19, 2024 7:02 pm I remain convinced that we can NOT truly detect duplicate audio files using solely a computer program.
A couple of ways I can think of, but indeed: in several stages of impractical to infeasible.


Would it be possible to "invert" the amplitudes of song A in Audacity (or similar program) and superimposing that output on song B?
If the songs are exactly the same, the + output and the - output should cancel each other out. A bit like how noise cancelling works.
If they largely cancel each other out, shifting song A a bit to the left/right (on the time axis) should fix silence at the beginning/end.


There is also "audio fingerprinting". Based on a couple of characteristics of a song, this generates a unique "number".
Mobile apps like soundhound and shazam operate this way.
On Windows, there is musicbrainz picard. Here musicbrainz in the online database of fingerprints and picard the client application that generates a fingerprint, searches for matches in the database and can "tag" your music files with the information from the database.

This is quite helpful to detect songs, but I expect it to be less useful in case of classical music as this is basically "the same old song, but by another orchestra and another conductor" ;) . But maybe fingerprinting is advanced enough to distinguish between them.
Regular songs can possibly be problematic too. For example: Pink Floyd's album (The) Dark Side of the Moon is available in 1400+ different versions (!!). Doubt if fingerprinting can detect each one.


What might help in Everything, is showing these tags. That gives extra information abut the musicfiles:
- In Everything, riight-click the result-list header (for example on "Name")
- Select Add Columns
- In the left pane, select Audio
- In the right pane, 'CTRL click' all properties of interest
- Press OK

Now the resultlist has some extra columns with information that might help in finding duplicates (you might need to scroll to the right for that)

ChrisGreaves wrote: Fri Dec 20, 2024 2:46 pm into the deep pool of Philosophy
... and I'll jump right in, without any swimming certificate .. to tell a (short) story about "Less is more".

I went through a similar process and ended up deleting most of the music on my systems. Now I'm down to less than 10.000 songs (estimated) and instead of "cattle" that needs to be moved around, I now have lots of "pets" to cherish.
It's well structured, but in the off-chance that there are duplicates: I don't care.
ChrisGreaves
Posts: 697
Joined: Wed Jan 05, 2022 9:29 pm

***I*** have a problem in identifying duplicate content.

Post by ChrisGreaves »

My sincere apologies to all who have responded to date. I thought I knew what I was doing, but I did not really understand SHA256 and other features of Everything. In particular I fell afoul of my life-long cautions about the use of the term "duplicate".

I shall spend part of this Christmas break setting up TWO more sets of files:-
(1) my current 20,000 tracks which play on my jukebox 24/7 and
(2) my accumulated weekly backup of, perhaps, 30,000 tracks, whose greater number results in part from repeated downloads, and sometimes by moving folders around.
That is, duplicates will have arisen as genuinely innocent acts on my part; I am not looking for carefully-constructed examples to thwart Everything.
"Shostakovich - Symphony No 05"... and "05- Why was I born"... have the same content.
"J. S. Bach Trio Sonata BWV 1039"... and "Bruckner - Symphony No.7"... have the same content.
@VOID you are correct, those two pairs of files are indeed identical in content

Thanks again, Chris
Post Reply