-
Notifications
You must be signed in to change notification settings - Fork 8.5k
Description
Log Rate Analysis uses the following to identify significant text patterns:
categorizeis run over the full time range to identify log message category patterns- each returned categorize pattern is then queried for document counts in the baseline and deviation time ranges.
The code reuses CATEGORY_LIMIT = 1000, which means worst case we can end up with 2 msearch queries for baseline and deviation with each 1000 queries looking for counts of categorize patterns.
CATEGORY_LIMIT: https://github.com/elastic/kibana/blob/main/x-pack/platform/packages/shared/ml/aiops_log_pattern_analysis/create_category_request.ts#L17fetchCategoryCounts: x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_category_counts.ts`
On an outer level for running analysis on keyword and text fields we apply chunking and maintain a queue to split queries but that's not happening in this case for these inner msearch queries. This can overload and potentially OOM clusters.
This code is used 1) on the main AIOps Log Rate Analysis page in the ML plugin, 2) on O11Y Alert Details pages contextual insights.
Since CATEGORY_LIMIT is hard coded and not configurable, this needs to be fixed to make the feature more resilient.
Options to fix this
Quick fix: Remove text field analysis from contextual insights here: https://github.com/elastic/kibana/blob/main/x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_log_rate_analysis_for_alert.ts#L158 - Obviously the analysis will no longer identify significant text patterns (will still identify significant keyword fields) but it eliminats the risk of running into large msearches completely
Quick fix alternative: Overwrite the CATEGORY_LIMIT to something much smaller, for example in x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_categories.ts with something like:
request.query = query;
if (request.aggs.categories?.categorize_text) {
request.aggs.categories.categorize_text.size = 50;
}
This means we'll likely miss low-count/long-tail but still significant text patterns but it will still identify high-count ones.
Longer term fix: Similar to how we chunk/queue requests on the outer level for the list of fields to analyse, we'd need to create another inner chunked queue for these requests to not throw all requests into a single msearch.
Maybe related: On the log rate analysis page in the ML plugin there's a setting to exclude frozen data which is ON to exclude by default. That is likely not the case on the O11Y alert details page where the feature gets used as part of contextual insights.