Content Indexing - html tokens

aaathemtheyzzz · Post by **aaathemtheyzzz** » Tue Apr 27, 2021 5:41 am

Everything will treat the following file types as plain text files: htm

Does this mean that when I search for "to be or not to be" everything will also search all those html tokens which may be in the middle of the phrase like

Code: Select all

to <b>be</b> or <i>not</i> to <b>be</b>

If is this the case, when someone will search for "type javascript" almost every htm in face of earth will be in the results

I think html to text parsing can be easy, even with the help of external preprocessors like HTMLAsText from nirsoft.net
Because of various needs and industry conflicts I think that external preprocessors like pdftotext and calling other COM class eg. word, excel or any format showed up in the future will be a must.

Post by **void** » Tue Apr 27, 2021 11:40 am

htm and html is treated as text/plain

to be or not to be
is treated as:
to be or not to be

content:"to be or not to be"
would match:
to be or not to be

content:"to be or not to be"
would *not* match:
to be or not to be
as the spaces are treated as literal.

content:<to be or not to be>
would match:
to be or not to be
as the search expression is expanded to: to AND be AND or AND not AND to AND be

content:<type javascript>
would match most htm files.

You could do something like:
content:javascript !regex:content:"<.*javascript.*>"
To ignore javascript inside < and >

content:"<script type=&quot:text/javascript&quot:"
would match:
<script type="text/javascript"

voidtools forum

Content Indexing - html tokens

Content Indexing - html tokens

Re: Content Indexing - html tokens