Content Indexing - html tokens

Discussion related to "Everything" 1.5 Alpha.
Post Reply
aaathemtheyzzz
Posts: 10
Joined: Wed Mar 04, 2020 10:23 pm

Content Indexing - html tokens

Post by aaathemtheyzzz »

Everything will treat the following file types as plain text files: htm
Does this mean that when I search for "to be or not to be" everything will also search all those html tokens which may be in the middle of the phrase like

Code: Select all

to <b>be</b> or <i>not</i> to <b>be</b> 
If is this the case, when someone will search for "type javascript" almost every htm in face of earth will be in the results :shock:

I think html to text parsing can be easy, even with the help of external preprocessors like HTMLAsText from nirsoft.net
Because of various needs and industry conflicts I think that external preprocessors like pdftotext and calling other COM class eg. word, excel or any format showed up in the future will be a must.
void
Developer
Posts: 16676
Joined: Fri Oct 16, 2009 11:31 pm

Re: Content Indexing - html tokens

Post by void »

htm and html is treated as text/plain

to <b>be</b> or <i>not</i> to <b>be</b>
is treated as:
to <b>be</b> or <i>not</i> to <b>be</b>

content:"to <b>be</b> or <i>not</i> to <b>be</b>"
would match:
to <b>be</b> or <i>not</i> to <b>be</b>

content:"to be or not to be"
would *not* match:
to <b>be</b> or <i>not</i> to <b>be</b>
as the spaces are treated as literal.

content:<to be or not to be>
would match:
to <b>be</b> or <i>not</i> to <b>be</b>
as the search expression is expanded to: to AND be AND or AND not AND to AND be

content:<type javascript>
would match most htm files.

You could do something like:
content:javascript !regex:content:"<.*javascript.*>"
To ignore javascript inside < and >

content:"<script type=&quot:text/javascript&quot:"
would match:
<script type="text/javascript"
Post Reply