HOWTO elasticsearch caching highlights

Global cache management logic¶

In Elasticsearch, cache expiry is not based on elapsed time, but based on used memory and need to reuse this memory for a different request/data set when there is not enough free memory.

Different kind of data is cached, and caching decision does not follow the same rules for each of these kinds:

fields data
filter bitsets results

Bitsets caching¶

Bitsets are result of matching a range of documents against a filter (e.g. type:apache AND tenant:mytenant)

Bitsets are cached automatically in case a new request reusing the same filtering subpart soon enough (several times in the last 256 requests). This means that the full request result itself is not cached, but only useful intermediate results that may accelerate other requests that have common filtering parts.

For example : - an update/automatic refresh of a dashboard, or a complement of filtering on the same scope, - the switch from a 'counting' request to a 'searching' request if the customer has filtered enough to reduce the interesting amount of logs and now wants to actually see/summarize/export the matched logs, or apply a complex dashboard to this selection)

Bitsets are incrementally updated by Elasticsearch if new documents are indexed in the same time scope as the bitset while it is still cached (this for efficient 'real-time' refreshing of request/dashboards)

Information source from Elasticsearch documentation : <https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html>_

Fieldsdata caching¶

Fields data is the actual value contained in each field of the document (i.e raw message or parsed field), for all documents within an Elasticsearch index. This is not to be mistaken with the "indexing" information that Elasticsearch keeps always in memory for all opened indexes, and that allow to do filtering/boolean selection of documents from the index)

Fields data is used for aggregation done by the requests. An "aggregation" on field data may be to display the field value, sort it, build time histogram on it, run regexp/wildcard expression on it... that is anything except pure filtering.

Elasticsearch ALWAYS assumes that if you do an aggregation on a field from an index, then you will be interested in making other future aggregation requests on the same field for ALL documents in this index (even if you have provided a filter in your query, that selects only 100 documents from the index !)

Therefore, each aggregated field that is used in a dashboard will load a lot of data into memory, for each index that is selected by your query. With standard LMC configuration, this may be one index for each (TENANT;DAY;LOG TYPE). So do not be too generous with your time scope selection, and always restrict the request to the log type(s) and days that may really interest you, and the dashboards panes that you really need at a given time.

With the memory settings configured in standard PunchPlatform deployment, Elasticsearch will automatically "evict" old field data from memory if you issue a new request and there is not enough room in memory for the related data. But because of the massive amount of log data that you may have in store, this unloading/reloading may cause a lot of IOs and wait time ; so combination of many fields/indicators with many days is a killer for response-time and overall platform availability for ALL its users.

If your request (or any of the multiple requests that hide behind your dashboard panes) requires too much fields data at the same time as compared to overall memory available on the servers for fieldsdata, then your request may be rejected by the cluster to avoid breakdown.

Information source from Elasticsearch documentation : <https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html>_

Good usage of Elasticsearch cache and query design efficiency¶

Avoid fully Displaying complex dashboard on a wide time scope¶

Displaying multiple histograms and indicators require actual loading in Elasticsearch memory of the related fields indexes and data. While you are browsing timescope to locate the interesting time range, and while you have not yet reduced the amount of interesting documents, avoid unfolding all panels of your favorite complex dashboard. If you wait until you have incrementally refined your time selection and filtering before opening up your additional dashboard panels, then you will have faster response time while drilling down, and the when you unfold the more detailed indicators panels in your dashboard, Kibana will only make Elasticsearch load useful data in memory.

Avoid running regexp/wildcard search on a wide document selection¶

Running a "regexp/wildcard" filter on a text field (especially the whole raw message) is heavily CPU consuming, and requires LOTS of memory to load all logs in order to search them. To avoid getting bad response time, and loading the whole platform for your own use (emptying caches for other people data), the proper way is to :

first, design appropriate boolean filters of the interesting logs, building them on a small time selection (e.g : on 30minutes of my logs, tenant:mytenant AND type:arkoon AND action:drop AND obs.ip:192.168.0.254)
Then, when you have reduced the number of logs to search to the bare minimum, design your actual regexp and check that you are happy with the result (e.g : on 30minutes of my logs, tenant:mytenant AND type:arkoon AND action:drop AND obs.ip:192.168.0.254 AND target.host.name:lk)
Then, apply your request in the "discover" tab of Kibana (NOT YOUR DASHBOARD) to the full time scope that interests you. This allows you to build filtered bitsets in Elasticsearch memory, without yet loading the full data. This way you can check how many logs are found in your whole time scope.
Lastly, once you are comfortable with the number of documents that will be retrieved, you can either request export of the logs (using the export plugin), or repeat this request in your favorite dashboard.