Skip to content

[ML] AIOps: Log Rate Analysis: Limit msearch items #235562

@walterra

Description

@walterra

Log Rate Analysis uses the following to identify significant text patterns:

  • categorize is run over the full time range to identify log message category patterns
  • each returned categorize pattern is then queried for document counts in the baseline and deviation time ranges.

The code reuses CATEGORY_LIMIT = 1000, which means worst case we can end up with 2 msearch queries for baseline and deviation with each 1000 queries looking for counts of categorize patterns.

On an outer level for running analysis on keyword and text fields we apply chunking and maintain a queue to split queries but that's not happening in this case for these inner msearch queries. This can overload and potentially OOM clusters.

This code is used 1) on the main AIOps Log Rate Analysis page in the ML plugin, 2) on O11Y Alert Details pages contextual insights.

Since CATEGORY_LIMIT is hard coded and not configurable, this needs to be fixed to make the feature more resilient.

Options to fix this

Quick fix: Remove text field analysis from contextual insights here: https://github.com/elastic/kibana/blob/main/x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_log_rate_analysis_for_alert.ts#L158 - Obviously the analysis will no longer identify significant text patterns (will still identify significant keyword fields) but it eliminats the risk of running into large msearches completely

Quick fix alternative: Overwrite the CATEGORY_LIMIT to something much smaller, for example in x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_categories.ts with something like:

request.query = query;
if (request.aggs.categories?.categorize_text) {
    request.aggs.categories.categorize_text.size = 50;
}

This means we'll likely miss low-count/long-tail but still significant text patterns but it will still identify high-count ones.

Longer term fix: Similar to how we chunk/queue requests on the outer level for the list of fields to analyse, we'd need to create another inner chunked queue for these requests to not throw all requests into a single msearch.

Maybe related: On the log rate analysis page in the ML plugin there's a setting to exclude frozen data which is ON to exclude by default. That is likely not the case on the O11Y alert details page where the feature gets used as part of contextual insights.

Metadata

Metadata

Assignees

Labels

:mlbugFixes for quality problems that affect the customer experiencev9.3.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions