Search by file name language type

Ismale.d · Post by **Ismale.d** » Tue Feb 21, 2023 6:25 am

As per title, can this be done?
eg if it contain Chinese, or Chinese and English?

Thanks!

Post by **void** » Tue Feb 21, 2023 6:30 am

Please try the following searches:
Chinese (Han):
regex:[\p{Han}]

English (Latin):
regex:[\p{Latin}]

Chinese and English:
regex:[\p{Han}] regex:[\p{Latin}]

Chinese and English ignoring the extension:
regex:[\p{Han}].*\.[^.]*$ regex:[\p{Latin}].*\.[^.]*$

The following scripts are also supported:
regex:[\p{Adlam}]
regex:[\p{Ahom}]
regex:[\p{Anatolian_Hieroglyphs}]
regex:[\p{Arabic}]
regex:[\p{Armenian}]
regex:[\p{Avestan}]
regex:[\p{Balinese}]
regex:[\p{Bamum}]
regex:[\p{Bassa_Vah}]
regex:[\p{Batak}]
regex:[\p{Bengali}]
regex:[\p{Bhaiksuki}]
regex:[\p{Bopomofo}]
regex:[\p{Brahmi}]
regex:[\p{Braille}]
regex:[\p{Buginese}]
regex:[\p{Buhid}]
regex:[\p{Canadian_Aboriginal}]
regex:[\p{Carian}]
regex:[\p{Caucasian_Albanian}]
regex:[\p{Chakma}]
regex:[\p{Cham}]
regex:[\p{Cherokee}]
regex:[\p{Chorasmian}]
regex:[\p{Common}]
regex:[\p{Coptic}]
regex:[\p{Cuneiform}]
regex:[\p{Cypriot}]
regex:[\p{Cypro_Minoan}]
regex:[\p{Cyrillic}]
regex:[\p{Deseret}]
regex:[\p{Devanagari}]
regex:[\p{Dives_Akuru}]
regex:[\p{Dogra}]
regex:[\p{Duployan}]
regex:[\p{Egyptian_Hieroglyphs}]
regex:[\p{Elbasan}]
regex:[\p{Elymaic}]
regex:[\p{Ethiopic}]
regex:[\p{Georgian}]
regex:[\p{Glagolitic}]
regex:[\p{Gothic}]
regex:[\p{Grantha}]
regex:[\p{Greek}]
regex:[\p{Gujarati}]
regex:[\p{Gunjala_Gondi}]
regex:[\p{Gurmukhi}]
regex:[\p{Han}]
regex:[\p{Hangul}]
regex:[\p{Hanifi_Rohingya}]
regex:[\p{Hanunoo}]
regex:[\p{Hatran}]
regex:[\p{Hebrew}]
regex:[\p{Hiragana}]
regex:[\p{Imperial_Aramaic}]
regex:[\p{Inherited}]
regex:[\p{Inscriptional_Pahlavi}]
regex:[\p{Inscriptional_Parthian}]
regex:[\p{Javanese}]
regex:[\p{Kaithi}]
regex:[\p{Kannada}]
regex:[\p{Katakana}]
regex:[\p{Kayah_Li}]
regex:[\p{Kharoshthi}]
regex:[\p{Khitan_Small_Script}]
regex:[\p{Khmer}]
regex:[\p{Khojki}]
regex:[\p{Khudawadi}]
regex:[\p{Lao}]
regex:[\p{Latin}]
regex:[\p{Lepcha}]
regex:[\p{Limbu}]
regex:[\p{Linear_A}]
regex:[\p{Linear_B}]
regex:[\p{Lisu}]
regex:[\p{Lycian}]
regex:[\p{Lydian}]
regex:[\p{Mahajani}]
regex:[\p{Makasar}]
regex:[\p{Malayalam}]
regex:[\p{Mandaic}]
regex:[\p{Manichaean}]
regex:[\p{Marchen}]
regex:[\p{Masaram_Gondi}]
regex:[\p{Medefaidrin}]
regex:[\p{Meetei_Mayek}]
regex:[\p{Mende_Kikakui}]
regex:[\p{Meroitic_Cursive}]
regex:[\p{Meroitic_Hieroglyphs}]
regex:[\p{Miao}]
regex:[\p{Modi}]
regex:[\p{Mongolian}]
regex:[\p{Mro}]
regex:[\p{Multani}]
regex:[\p{Myanmar}]
regex:[\p{Nabataean}]
regex:[\p{Nandinagari}]
regex:[\p{New_Tai_Lue}]
regex:[\p{Newa}]
regex:[\p{Nko}]
regex:[\p{Nushu}]
regex:[\p{Nyakeng_Puachue_Hmong}]
regex:[\p{Ogham}]
regex:[\p{Ol_Chiki}]
regex:[\p{Old_Hungarian}]
regex:[\p{Old_Italic}]
regex:[\p{Old_North_Arabian}]
regex:[\p{Old_Permic}]
regex:[\p{Old_Persian}]
regex:[\p{Old_Sogdian}]
regex:[\p{Old_South_Arabian}]
regex:[\p{Old_Turkic}]
regex:[\p{Old_Uyghur}]
regex:[\p{Oriya}]
regex:[\p{Osage}]
regex:[\p{Osmanya}]
regex:[\p{Pahawh_Hmong}]
regex:[\p{Palmyrene}]
regex:[\p{Pau_Cin_Hau}]
regex:[\p{Phags_Pa}]
regex:[\p{Phoenician}]
regex:[\p{Psalter_Pahlavi}]
regex:[\p{Rejang}]
regex:[\p{Runic}]
regex:[\p{Samaritan}]
regex:[\p{Saurashtra}]
regex:[\p{Sharada}]
regex:[\p{Shavian}]
regex:[\p{Siddham}]
regex:[\p{SignWriting}]
regex:[\p{Sinhala}]
regex:[\p{Sogdian}]
regex:[\p{Sora_Sompeng}]
regex:[\p{Soyombo}]
regex:[\p{Sundanese}]
regex:[\p{Syloti_Nagri}]
regex:[\p{Syriac}]
regex:[\p{Tagalog}]
regex:[\p{Tagbanwa}]
regex:[\p{Tai_Le}]
regex:[\p{Tai_Tham}]
regex:[\p{Tai_Viet}]
regex:[\p{Takri}]
regex:[\p{Tamil}]
regex:[\p{Tangsa}]
regex:[\p{Tangut}]
regex:[\p{Telugu}]
regex:[\p{Thaana}]
regex:[\p{Thai}]
regex:[\p{Tibetan}]
regex:[\p{Tifinagh}]
regex:[\p{Tirhuta}]
regex:[\p{Toto}]
regex:[\p{Ugaritic}]
regex:[\p{Unknown}]
regex:[\p{Vai}]
regex:[\p{Vithkuqi}]
regex:[\p{Wancho}]
regex:[\p{Warang_Citi}]
regex:[\p{Yezidi}]
regex:[\p{Yi}]
regex:[\p{Zanabazar_Square}]

PCRE Unicode character properties

Ismale.d · Post by **Ismale.d** » Wed Feb 22, 2023 10:30 pm

hi thanks for the reply, however can you give me more hints on what term should I user to search for the sytanx for say Japanese?

pcre2pattern spec is too technical for me and I have tried to search for "pre2 unicode language list", "pre2 unicode Japanese" ..etc and I can't find anything that work.

Post by **void** » Wed Feb 22, 2023 11:40 pm

To search for Hiragana OR Katakana:
regex:[\p{Hiragana}\p{Katakana}]

To search for Hiragana OR Katakana OR Han:
regex:[\p{Hiragana}\p{Katakana}\p{Han}]

Using unicode ranges might be better.

For example, Kanji:
regex:[\u4E00-\u9FFF]

https://stackoverflow.com/questions/19899554/unicode-range-for-japanese
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

Ismale.d · Post by **Ismale.d** » Mon Mar 06, 2023 8:45 am

wow this is pretty new to me, thanks for the help and reference! Couldn't have understand it otherwise

Ismale.d · Post by **Ismale.d** » Mon Mar 06, 2023 10:06 am

void wrote: ↑Wed Feb 22, 2023 11:40 pm To search for Hiragana OR Katakana:
regex:[\p{Hiragana}] | regex:[\p{Katakana}]

To search for Hiragana OR Katakana OR Han:
regex:[\p{Hiragana}] | regex:[\p{Katakana}] | regex:[\p{Han}]

Using unicode ranges might be better.

For example, Kanji:
regex:[\u4E00-\u9FFF]

https://stackoverflow.com/questions/19899554/unicode-range-for-japanese
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

oh I played it around abit, actually the syntax of using ranges doesn't work, may be additional symbol is needed? I also tried the range from other language, and especially english, and none of it work. (AC00, D743; U+0000, U+007F)

Post by **void** » Tue Mar 07, 2023 4:58 am

The PCRE syntax is:
\x{hhh..} character with hex code hhh..

Use a - inside [ and ] to specify a range.

Please try the following:

regex:[\p{Hiragana}\p{Katakana}\x{4E00}-\x{9FFF}]

PCRE Non-printing characters

Ismale.d · Post by **Ismale.d** » Tue Mar 07, 2023 1:23 pm

work perfectly! Really appreicate the help!

samiaziz · Post by **samiaziz** » Tue Jan 23, 2024 3:39 pm

That is very useful. However, what can I do to search for filenames written in a given language (like Korean) exclusively without any characters from another language?

Post by **NotNull** » Tue Jan 23, 2024 4:28 pm

The Korean alphabet is called Hangul (says Internet ..)

With that:

Code: Select all

!regex:[^\p{Hangul}]

Explanation:
regex:[^\p{Hangul}] = Show all files/folders that have non-Korean characters in them anywhere.
!regex:... = show alkl files/folders, except the ones found above, meaning only files with Korean characters exclusively.

samiaziz wrote: ↑Tue Jan 23, 2024 3:39 pm what can I do to search for filenames written in a given language (like Korean) exclusively without any characters from another language?

Note that the search query above will not list files with a "normal" extension, like .txt, .jpg, .zip as those are non-Korean characters. Same goes for files with numbers (0...9) in them.
So I don't know how practical this will be, but this is what you asked

Regular Expressions Syntax

Post by **void** » Wed Jan 24, 2024 2:55 am

Please consider the following search to ignore the extension:

regex:^[\p{Hangul}]+\.[a-z]+$

samiaziz · Post by **samiaziz** » Sat Jan 27, 2024 6:34 pm

void wrote: ↑Wed Jan 24, 2024 2:55 am Please consider the following search to ignore the extension:

regex:^[\p{Hangul}]+\.[a-z]+$

Thanks a lot. That is exactly what I was looking for.

The following search gives the same result of ignoring the extension:

regex:stem:^[\p{Hangul}]+$

samiaziz · Post by **samiaziz** » Sat Jan 27, 2024 6:55 pm

NotNull wrote: ↑Tue Jan 23, 2024 4:28 pm
So I don't know how practical this will be, but this is what you asked

Thank you for your response,

I have some downloaded files with a name in a foreign language only and I want to add a translation to my language to the name of these files without removing the original names.

When I search for the Korean language in the file name for example, the search result lists:

file names in Korean only,

and file names in Korean and other languages (which I have already changed).

I simply want to exclude the second category of files from the search results by searching file names in Korean only.

Post by **NotNull** » Sat Jan 27, 2024 7:29 pm

I understand. Thanks for explaining. What I meant was that usually the file extension is *not* in Korean, so that would skip lots of files that still might be of interest. But you mentioned:

exclusively without any characters from another language?

And any .txt file contains characters from another language, namely t,x and t.

But now I get that you wanted the filename *without extension* to be Korean-only.

Anyway .. problem solved

voidtools forum

Search by file name language type

Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type

Re: Search by file name language type