[ML] AIOps: Log Rate Analysis: Limit msearch items

Log Rate Analysis uses the following to identify significant text patterns:

categorize is run over the full time range to identify log message category patterns
each returned categorize pattern is then queried for document counts in the baseline and deviation time ranges.

The code reuses CATEGORY_LIMIT = 1000, which means worst case we can end up with 2 msearch queries for baseline and deviation with each 1000 queries looking for counts of categorize patterns.

CATEGORY_LIMIT: https://github.com/elastic/kibana/blob/main/x-pack/platform/packages/shared/ml/aiops_log_pattern_analysis/create_category_request.ts#L17
fetchCategoryCounts: x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_category_counts.ts`

On an outer level for running analysis on keyword and text fields we apply chunking and maintain a queue to split queries but that's not happening in this case for these inner msearch queries. This can overload and potentially OOM clusters.

This code is used 1) on the main AIOps Log Rate Analysis page in the ML plugin, 2) on O11Y Alert Details pages contextual insights.

Since CATEGORY_LIMIT is hard coded and not configurable, this needs to be fixed to make the feature more resilient.

Options to fix this

Quick fix: Remove text field analysis from contextual insights here: https://github.com/elastic/kibana/blob/main/x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_log_rate_analysis_for_alert.ts#L158 - Obviously the analysis will no longer identify significant text patterns (will still identify significant keyword fields) but it eliminats the risk of running into large msearches completely

Quick fix alternative: Overwrite the CATEGORY_LIMIT to something much smaller, for example in x-pack/platform/packages/shared/ml/aiops_log_rate_analysis/queries/fetch_categories.ts with something like:

request.query = query;
if (request.aggs.categories?.categorize_text) {
    request.aggs.categories.categorize_text.size = 50;
}

This means we'll likely miss low-count/long-tail but still significant text patterns but it will still identify high-count ones.

Longer term fix: Similar to how we chunk/queue requests on the outer level for the list of fields to analyse, we'd need to create another inner chunked queue for these requests to not throw all requests into a single msearch.

Maybe related: On the log rate analysis page in the ML plugin there's a setting to exclude frozen data which is ON to exclude by default. That is likely not the case on the O11Y alert details page where the feature gets used as part of contextual insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] AIOps: Log Rate Analysis: Limit `msearch` items #235562

Options to fix this

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] AIOps: Log Rate Analysis: Limit msearch items #235562

Description

Options to fix this

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[ML] AIOps: Log Rate Analysis: Limit `msearch` items #235562