How to search for en dash or em dash?

MartinPC · Post by **MartinPC** » Thu Dec 03, 2020 7:39 pm

With its default settings, at least, Everything appears to conflate hyphens [-], en dashes [–], and em dashes [—]. Whether I search for – or — or #x2013: or #x2014:, my results list includes files that have only hyphens in them.

I'm pretty sure I have some filenames with en dashes in them, and possibly some with em dashes. I'd like to be able to find them without having to wade through 300,000+ "hyphens-only" files.

Is there a configuration setting or search operator that would allow me to override Everything's default dash conflation and search only for the specific, literal type of dash I've entered?

Hopefully, I haven't overlooked something obvious ... but I wouldn't count on it!

Thanks for any help anyone is able to provide.

Post by **NotNull** » Thu Dec 03, 2020 8:08 pm

You can enable the Match Diacritics filter (Menu:Search > Match Diacritics ) [1] and search for:

Code: Select all

– | —

to find all file- and foldernames that contain a en-dash and/or an m-dash

Alternatively, search for

Code: Select all

diacritics:— | diacritics:–

(No need to enable the Match Diacritics filter in this case)

[1] Don't forget to set change filter back to Everything afterwards; otherwise all your following searches will be "diacritics-sensitive"

Note:
I hope the forum software doesn't mess up the dashes. But I suspect you get the point

Post by **therube** » Thu Dec 03, 2020 8:22 pm

regex:\x{2013}

Post by **NotNull** » Thu Dec 03, 2020 8:28 pm

And another suggestion if you want to replace the dashes with regular hyphens:

Everything has a multi-rename feature that you can use to replace patterns with something else in multiple filenames at once.

Use one of the searches from above
Select the files where you want to replace the dashes with hypens (or something else)
Menu:File > Rename
Enable Regex
Old Format:
–|—
(don't use spaces here
New format: - (I used __ here for clarity
A preview of the new names is shown in the New Filenames box
Something like this:

2020-12-03 21_22_23-Rename.png (17.62 KiB) Viewed 9672 times
If all looks good, press the OK button

MartinPC · Post by **MartinPC** » Fri Dec 04, 2020 3:27 am

Wow — lots of feedback, all of it prompt and useful! Thank you all!

I think the "Match diacritics" Search menu setting is going to serve me best, most of the time.

I can see how conflating all types of dashes might be a good default for most users, but I happened to come across a similarly useful conflation — possibly a more useful conflation — that wasn't an Everything default:

By default, Everything doesn't seem to find results with either single-character ligatures or their equivalent two-character sequence. Examples:

æ and ae

œ and oe

ß and ss

ĳ [Unicode hexadecimal character 0133] and ij

If you search for æ with Match diacritics disabled, you (apparently) get results containing æ or any variant of a (a, à, etc.) but not ae. If you search for æ with Match diacritics enabled, you only get results containing æ (and not ae).

Ditto for œ, mutatis mutandis.

If you search for ß, you only get results containing ß, regardless of whether Match diacritics is enabled or not.

Ditto for ĳ, mutatis mutandis.

And if you search for a two-character sequence, you don't get results containing the equivalent ligature.

This is important because some filenames and content may contain the "proper" ligatures, and others may have been typed by people who used the quick-and-dirty two-character substitute. (There may also be different conventions in different countries, e.g., Switzerland and Liechtenstein, where ß has fallen out of use and is generally replaced with ss.) Typically, users will want to find both (and only both).

Is there a general setting anywhere that would enable "ligature conflation"? Or is it necessary to use special "or" search syntax each time? And if there isn't a general setting, do you think it's worth proposing one as a new feature?

I realize this is off-topic, but my curiosity got piqued and the respondents to this thread seem fairly knowledgeable, so I thought I'd bring it up on the off chance someone has an answer.

Again, thank you all for the very helpful feedback. It fixed my problem! I appreciate it!

Post by **therube** » Fri Dec 04, 2020 4:20 pm

Future version, 1.5:

æ + Match Diacritics, finds, Ágætis byrjun.mp3
æ + (no match), finds both: Ágætis byrjun.mp3 & also Antrum Sibyllae.mp3

ß + Match Diacritics, finds Arne Zank - Ich weiß es nicht.mp3
ß + (no match), finds both: Ich weiß es nicht.mp3 & also Miss Moon.mp3

The others, not sure about offhand.

Post by **NotNull** » Fri Dec 04, 2020 5:25 pm

œ and æ are 'composed' characters (probably not the right term for it), like ä and ò.
You can search for - for example - halos to find hælos and HÆLOS

In the upcoming major upgrade, Everything 1.5 (currently in development), the mechanism to find those characters will change. Then you can use - next to what therube already mentioned - :

- ss to find ß
- oe to find œ
- etc.

Quick tip:
I use this to find all non-standard ASCII characters:

Code: Select all

regex:"[^ -~]"

P.S. : i liked reading your posts!

MartinPC · Post by **MartinPC** » Sat Dec 05, 2020 5:43 am

Very cool news that more sophisticated digraph/ligature/diacritic support will be coming with version 1.5!

This is actually a somewhat more involved subject than I initially assumed. For example:

The uppercase version of ĳ [U+0133] is Ĳ [U+0132] in Dutch but a standard capital Y [U+0059] in Afrikaans.

In German, ä, ö and ü were derived from ae, oe, and ue and are still replaced with the original two characters in a pinch — but not necessarily in other languages.

The Icelandic eth (Ð and ð) and thorn (Þ and þ) are usually transliterated as D and d and TH/Th and th on keyboards and in character sets that don't support them.

In French, there is at least one word that can be correctly spelled using either ñ or ny: cañon / canyon). (Would it be worth including a substitution for that one word? Probably not.)

Successfully searching for any type of quotation mark in a French filename or document would require returning ", “, ”, «, and ». And if you're searching for a word or phrase inside quotation marks, you would have to ignore any kind of space following « or preceding ». Normally, you're supposed to use narrow no-break spaces inside French quotation marks [« word »] or, failing that, regular no-break spaces [« word »], but most typists use breaking regular spaces [« word »] or no spaces at all [«word»]. (Would it be worth providing substitutions for quotes? Absolutely.)

I'm sure there are lots of other examples of common variances between the typographic ideal and what real-world typists do in practice. I'm not familiar with any non-Latin alphabets or syllabaries, but I'm guessing that some of them present similar challenges.

I haven't thought this through, but I wonder whether it would ultimately be useful to add language-specific "character substitution tables," distinct from the the global "Match diacritics" setting. I'm not a coder, but I have a hunch it wouldn't be conceptually or technically difficult — just tedious work fleshing out the contents for each language. And for the end user, it could be as easy as simply enabling or disabling common character substitutions for a given language. But as I said, I haven't really thought it through.

Thanks once again to everyone who chimed in. I'm grateful for such a responsive and supportive forum!

voidtools forum

How to search for en dash or em dash?

How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?

Re: How to search for en dash or em dash?