Only content index the first X plain text lines of file type
Only content index the first X plain text lines of file type
Hi all, I have a bunch of plain text files with data that can be quite large, but the first 10 lines of so contain all the pertinent information that I want to be able to search. Indexing all of the file content is impossible without consuming several GBs of memory or more.
Is it possible to only index the first X lines of a plain text file of specified file extension? If not, is that something that people would be interested in as a feature?
Running 1.5a.
Is it possible to only index the first X lines of a plain text file of specified file extension? If not, is that something that people would be interested in as a feature?
Running 1.5a.
Re: Only content index the first X plain text lines of file type
Topic moved to the Everything 1.5 Alpha forum.
Re: Only content index the first X plain text lines of file type
I don't think that an option to index only a few line makes sense.
Everything can query the Windows index content
so let Windows index your text files
The Windows indexer will not need any large memory for doing it regardles of the file sizes.
Everything can query the Windows index content
so let Windows index your text files
The Windows indexer will not need any large memory for doing it regardles of the file sizes.
Re: Only content index the first X plain text lines of file type
Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.
Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
Re: Only content index the first X plain text lines of file type
The size limit is a global content setting.raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.
Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
If he sets this to limit the content for text files
it will make almost useless any other sort of content indexing.
Re: Only content index the first X plain text lines of file type
I will consider an option to only index xx bytes when using content indexing.
Thanks for the suggestion.
For now, please consider copying all your text files with only the first few lines and indexing those.
-or-
Disable content indexing and store your txt files on an NVMe SSD.
Everything will max out your NVMe SSD read speeds.
-or-
Use Windows Indexing to index your file content as mentioned by horst.epp.
Use the si: search function to search your system index.
-or-
Disable content indexing, store your txt files on an SSD/HDD and use the startwith:binary:content: search.
This will treat the content as binary (you can only search ASCII text)
Only the required content is read from disk. (not the whole file)
For example, startwith:binary:content:hello will only read 5 bytes from each file.
If the content is unknown, the wildcards: modifier can also be used: wildcards:binary:content:
For example, wildcards:binary:content:??llo* will only read 5 bytes from each file.
binary:
Thanks for the suggestion.
For now, please consider copying all your text files with only the first few lines and indexing those.
-or-
Disable content indexing and store your txt files on an NVMe SSD.
Everything will max out your NVMe SSD read speeds.
-or-
Use Windows Indexing to index your file content as mentioned by horst.epp.
Use the si: search function to search your system index.
-or-
Disable content indexing, store your txt files on an SSD/HDD and use the startwith:binary:content: search.
This will treat the content as binary (you can only search ASCII text)
Only the required content is read from disk. (not the whole file)
For example, startwith:binary:content:hello will only read 5 bytes from each file.
If the content is unknown, the wildcards: modifier can also be used: wildcards:binary:content:
For example, wildcards:binary:content:??llo* will only read 5 bytes from each file.
binary:
Re: Only content index the first X plain text lines of file type
Are we talking about the same thing? "First 512 Bytes" is a literal Named Property and not a setting or subsetting. It's its own independent collection, like Artist and Album. This should not interfere with other content indexing.horst.epp wrote: ↑Fri Feb 04, 2022 7:26 amThe size limit is a global content setting.raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.
Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
If he sets this to limit the content for text files
it will make almost useless any other sort of content indexing.
Re: Only content index the first X plain text lines of file type
So we talking about property indexing and not content indexing.
Re: Only content index the first X plain text lines of file type
Properties are content too. At least as I understand it. I don't think adding First 512 Bytes Property creates a determent.
Re: Only content index the first X plain text lines of file type
I guess that's a more accurate way of putting it, yes. As you would use First512Bytes:<stuff> to interact with the "First 512 Bytes" property. I just didn't understand the statement "If he sets this to limit the content for text files. it will make almost useless any other sort of content indexing." I don't think that is accurate.
Re: Only content index the first X plain text lines of file type
If he would set the possible size limit for content indexing to smallraccoon wrote: ↑Sun Feb 06, 2022 4:05 pmI guess that's a more accurate way of putting it, yes. As you would use First512Bytes:<stuff> to interact with the "First 512 Bytes" property. I just didn't understand the statement "If he sets this to limit the content for text files. it will make almost useless any other sort of content indexing." I don't think that is accurate.
there would be no usefull index for other types like document files for example.
Re: Only content index the first X plain text lines of file type
But, again, This is an isolated and independent PROPERTY unto itself. First512Bytes is a fully qualified PROPERTY and not a SETTING to be set.
Re: Only content index the first X plain text lines of file type
Apologies for the late reply. Had a funeral to deal with this week. Also wrote out a reply and then browser crashed so I lost it.
I tried using the startwith:binary:content search modifier but it wouldn't bring up any results for me, even when I pointed it to the same directory as a test file. The only way that I got the search to work using "first-512-bytes" was to first use python to convert my search string to utf-8 hex format and then search using that string. A modifier function that would convert plain text to hex would be really useful in this case as that is what the property is expecting. The wildcard modifier worked great though.
As I mentioned above, is there a way for the property indexing to carry forward to client instance, rather than indexing the properties on both instances seperately?
Thanks for the reply. I was hoping to avoid using windows indexer if at all possible, especially since it is indexing my NAS drives, and was wanting to have it all under the single server instance. But I will look into that more as I'm not super famililar with it.
I tried this and it worked well. 512 Bytes covers probably 99% of my use cases I reckon, but I would like to aim for completeness with 1024 Bytes so I think I may put in a feature request. Only problem is that the property index doesn't carry forward from server to client, so it needs to be indexed twice.raccoon wrote: ↑Thu Feb 03, 2022 8:49 pm Everything can index the first N or last N bytes as searchable content, but not lines. Make a rough estimate of how many bytes is in 10 lines and use that.
Options are First or Last 1/2/4/8/16/32/64/128/256/512 Bytes. If you choose the largest option, "First 512 Bytes," that averages out to about 50 bytes per line for a 10 line capture, which seems reasonable. If you really need First 1024 or 2048 or 4096 or 8192 bytes, then issue a feature request.
Thanks for the reply. I have a rather lot of files that are continously being produced, so unfortunately duplicating them (even truncated) is not an option with my current NAS storage.void wrote: ↑Fri Feb 04, 2022 7:40 am I will consider an option to only index xx bytes when using content indexing.
Thanks for the suggestion.
For now, please consider copying all your text files with only the first few lines and indexing those.
-or-
Disable content indexing and store your txt files on an NVMe SSD.
Everything will max out your NVMe SSD read speeds.
-or-
Use Windows Indexing to index your file content as mentioned by horst.epp.
Use the si: search function to search your system index.
-or-
Disable content indexing, store your txt files on an SSD/HDD and use the startwith:binary:content: search.
This will treat the content as binary (you can only search ASCII text)
Only the required content is read from disk. (not the whole file)
For example, startwith:binary:content:hello will only read 5 bytes from each file.
If the content is unknown, the wildcards: modifier can also be used: wildcards:binary:content:
For example, wildcards:binary:content:??llo* will only read 5 bytes from each file.
binary:
I tried using the startwith:binary:content search modifier but it wouldn't bring up any results for me, even when I pointed it to the same directory as a test file. The only way that I got the search to work using "first-512-bytes" was to first use python to convert my search string to utf-8 hex format and then search using that string. A modifier function that would convert plain text to hex would be really useful in this case as that is what the property is expecting. The wildcard modifier worked great though.
As I mentioned above, is there a way for the property indexing to carry forward to client instance, rather than indexing the properties on both instances seperately?
Re: Only content index the first X plain text lines of file type
Sorry for your loss.Had a funeral to deal with this week.
A Property Server is planned for Everything 1.6.Only problem is that the property index doesn't carry forward from server to client, so it needs to be indexed twice.
The Everything Server in 1.5 will be limited to filenames only.
Could you please give some more details on the files you are searching.I tried using the startwith:binary:content search modifier but it wouldn't bring up any results for me, even when I pointed it to the same directory as a test file.
The binary: search modifier will treat your files as a byte stream.
For example:
Is the file text/plain?
What is the encoding of the file?
If I add a function/property to search only x bytes of text, the function would have to be very strict with the encoding.
The 'first x bytes' properties are designed for binary file types.
Ideally, Everything needs a 'first x characters' property.
I have put this on my TODO list.
Currently, no.As I mentioned above, is there a way for the property indexing to carry forward to client instance, rather than indexing the properties on both instances seperately?
Each instance must maintain it's own property index.