Only content index the first X plain text lines of file type

cbear92 · Post by **cbear92** » Thu Feb 03, 2022 2:01 pm

Hi all, I have a bunch of plain text files with data that can be quite large, but the first 10 lines of so contain all the pertinent information that I want to be able to search. Indexing all of the file content is impossible without consuming several GBs of memory or more.

Is it possible to only index the first X lines of a plain text file of specified file extension? If not, is that something that people would be interested in as a feature?

Running 1.5a.

Post by **NotNull** » Thu Feb 03, 2022 6:00 pm

Topic moved to the Everything 1.5 Alpha forum.

horst.epp · Post by **horst.epp** » Thu Feb 03, 2022 6:34 pm

I don't think that an option to index only a few line makes sense.
Everything can query the Windows index content
so let Windows index your text files
The Windows indexer will not need any large memory for doing it regardles of the file sizes.

raccoon · Post by **raccoon** » Thu Feb 03, 2022 8:49 pm

Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.

Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.

horst.epp · Post by **horst.epp** » Fri Feb 04, 2022 7:26 am

raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.

Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.

The size limit is a global content setting.
If he sets this to limit the content for text files
it will make almost useless any other sort of content indexing.

Post by **void** » Fri Feb 04, 2022 7:40 am

I will consider an option to only index xx bytes when using content indexing.
Thanks for the suggestion.

For now, please consider copying all your text files with only the first few lines and indexing those.

-or-

Disable content indexing and store your txt files on an NVMe SSD.
Everything will max out your NVMe SSD read speeds.

-or-

Use Windows Indexing to index your file content as mentioned by horst.epp.
Use the si: search function to search your system index.

-or-

Disable content indexing, store your txt files on an SSD/HDD and use the startwith:binary:content: search.

This will treat the content as binary (you can only search ASCII text)
Only the required content is read from disk. (not the whole file)
For example, startwith:binary:content:hello will only read 5 bytes from each file.

If the content is unknown, the wildcards: modifier can also be used: wildcards:binary:content:

For example, wildcards:binary:content:??llo* will only read 5 bytes from each file.

binary:

raccoon · Post by **raccoon** » Fri Feb 04, 2022 4:52 pm

horst.epp wrote: ↑Fri Feb 04, 2022 7:26 am
raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.

Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
The size limit is a global content setting.
If he sets this to limit the content for text files
it will make almost useless any other sort of content indexing.

Are we talking about the same thing? "First 512 Bytes" is a literal Named Property and not a setting or subsetting. It's its own independent collection, like Artist and Album. This should not interfere with other content indexing.

horst.epp · Post by **horst.epp** » Sat Feb 05, 2022 4:54 pm

raccoon wrote: ↑Fri Feb 04, 2022 4:52 pm Are we talking about the same thing? "First 512 Bytes" is a literal Named Property and not a setting or subsetting. It's its own independent collection, like Artist and Album. This should not interfere with other content indexing.

So we talking about property indexing and not content indexing.

raccoon · Post by **raccoon** » Sat Feb 05, 2022 8:12 pm

Properties are content too. At least as I understand it. I don't think adding First 512 Bytes Property creates a determent.

horst.epp · Post by **horst.epp** » Sun Feb 06, 2022 11:31 am

raccoon wrote: ↑Sat Feb 05, 2022 8:12 pm Properties are content too. At least as I understand it. I don't think adding First 512 Bytes Property creates a determent.

For me any properties are a form of meta data and not the content.

raccoon · Post by **raccoon** » Sun Feb 06, 2022 4:05 pm

horst.epp wrote: ↑Sun Feb 06, 2022 11:31 am For me any properties are a form of meta data and not the content.

I guess that's a more accurate way of putting it, yes. As you would use First512Bytes:<stuff> to interact with the "First 512 Bytes" property. I just didn't understand the statement "If he sets this to limit the content for text files. it will make almost useless any other sort of content indexing." I don't think that is accurate.

horst.epp · Post by **horst.epp** » Sun Feb 06, 2022 4:51 pm

raccoon wrote: ↑Sun Feb 06, 2022 4:05 pm
horst.epp wrote: ↑Sun Feb 06, 2022 11:31 am For me any properties are a form of meta data and not the content.
I guess that's a more accurate way of putting it, yes. As you would use First512Bytes:<stuff> to interact with the "First 512 Bytes" property. I just didn't understand the statement "If he sets this to limit the content for text files. it will make almost useless any other sort of content indexing." I don't think that is accurate.

If he would set the possible size limit for content indexing to small
there would be no usefull index for other types like document files for example.

raccoon · Post by **raccoon** » Sun Feb 06, 2022 6:45 pm

horst.epp wrote: ↑Sun Feb 06, 2022 4:51 pm If he would set the possible size limit for content indexing to small
there would be no usefull index for other types like document files for example.

But, again, This is an isolated and independent PROPERTY unto itself. First512Bytes is a fully qualified PROPERTY and not a SETTING to be set.

cbear92 · Post by **cbear92** » Wed Feb 09, 2022 8:43 pm

Apologies for the late reply. Had a funeral to deal with this week. Also wrote out a reply and then browser crashed so I lost it.

horst.epp wrote: ↑Thu Feb 03, 2022 6:34 pm I don't think that an option to index only a few line makes sense.
Everything can query the Windows index content
so let Windows index your text files
The Windows indexer will not need any large memory for doing it regardles of the file sizes.

Thanks for the reply. I was hoping to avoid using windows indexer if at all possible, especially since it is indexing my NAS drives, and was wanting to have it all under the single server instance. But I will look into that more as I'm not super famililar with it.

raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.

Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.

I tried this and it worked well. 512 Bytes covers probably 99% of my use cases I reckon, but I would like to aim for completeness with 1024 Bytes so I think I may put in a feature request. Only problem is that the property index doesn't carry forward from server to client, so it needs to be indexed twice.

void wrote: ↑Fri Feb 04, 2022 7:40 am I will consider an option to only index xx bytes when using content indexing.
Thanks for the suggestion.

For now, please consider copying all your text files with only the first few lines and indexing those.

-or-

Disable content indexing and store your txt files on an NVMe SSD.
Everything will max out your NVMe SSD read speeds.

-or-

Use Windows Indexing to index your file content as mentioned by horst.epp.
Use the si: search function to search your system index.

-or-

Disable content indexing, store your txt files on an SSD/HDD and use the startwith:binary:content: search.

This will treat the content as binary (you can only search ASCII text)
Only the required content is read from disk. (not the whole file)
For example, startwith:binary:content:hello will only read 5 bytes from each file.

If the content is unknown, the wildcards: modifier can also be used: wildcards:binary:content:

For example, wildcards:binary:content:??llo* will only read 5 bytes from each file.

binary:

Thanks for the reply. I have a rather lot of files that are continously being produced, so unfortunately duplicating them (even truncated) is not an option with my current NAS storage.

I tried using the startwith:binary:content search modifier but it wouldn't bring up any results for me, even when I pointed it to the same directory as a test file. The only way that I got the search to work using "first-512-bytes" was to first use python to convert my search string to utf-8 hex format and then search using that string. A modifier function that would convert plain text to hex would be really useful in this case as that is what the property is expecting. The wildcard modifier worked great though.

As I mentioned above, is there a way for the property indexing to carry forward to client instance, rather than indexing the properties on both instances seperately?

Post by **void** » Sat Feb 12, 2022 1:23 am

Had a funeral to deal with this week.

Sorry for your loss.

Only problem is that the property index doesn't carry forward from server to client, so it needs to be indexed twice.

A Property Server is planned for Everything 1.6.
The Everything Server in 1.5 will be limited to filenames only.

I tried using the startwith:binary:content search modifier but it wouldn't bring up any results for me, even when I pointed it to the same directory as a test file.

Could you please give some more details on the files you are searching.
The binary: search modifier will treat your files as a byte stream.

For example:
Is the file text/plain?
What is the encoding of the file?

If I add a function/property to search only x bytes of text, the function would have to be very strict with the encoding.

The 'first x bytes' properties are designed for binary file types.

Ideally, Everything needs a 'first x characters' property.
I have put this on my TODO list.

As I mentioned above, is there a way for the property indexing to carry forward to client instance, rather than indexing the properties on both instances seperately?

Currently, no.
Each instance must maintain it's own property index.

voidtools forum

Only content index the first X plain text lines of file type

Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type

Re: Only content index the first X plain text lines of file type