Concurrent fetch of azure metricdefinitions and batchApi usage by MichaelKatsoulis · Pull Request #41790 · elastic/beats

MichaelKatsoulis · 2024-11-26T12:08:18Z

The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:

monitor
container_registry
container_instance
container_service
compute_vm
compute_vm_scaleset
database_account
storage_account

A new configuration parameter is introduced enable_batch_api of type boolean.
If set to false(default) nothing changes in the way the metrics are collected for these metricsets.

If set to true:

The metric definitions of resources are collected asynchronously and write the results in a channel.
The channel is read and when the number of definitions collected reach 50 (batch API limit)
The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve
metrics of multiple resources with one api call.

Grouping criteria are

Namespace
SubscriptionID
Location
Names
TimeGrain
Dimensions

Proposed commit message

WHAT: Introduce enable_batch_api parameter for concurrent fetching of azure metric definitions and metric values collection using Batch Api
WHY: Helps mitigating scalability problems

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

[ ]

How to test this PR locally

Related issues

Relates Improve Azure Monitor scalability and performance #38624

Use cases

Screenshots

Logs

mergify · 2024-11-26T12:09:02Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b concurrent-fetch-of-azure-metricdefinitions upstream/concurrent-fetch-of-azure-metricdefinitions
git merge upstream/main
git push upstream concurrent-fetch-of-azure-metricdefinitions

mergify · 2024-11-26T12:09:02Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @MichaelKatsoulis? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-26T12:09:03Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

zmoog · 2025-01-10T11:54:50Z

Microsoft.DocumentDb/databaseAccounts (1 resource)

resource type: Microsoft.DocumentDb/databaseAccounts
resource count: 1 resource
versions tested:

8.17.1 (branch 8.17)
9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

I created one "Azure Cosmos DB for NoSQL", with Provisioned throughput (default settings)
I set up the standard Metricbeat database account module

# x-pack/metricbeat/modules.d/azure.yml
- module: azure
  metricsets:
  - database_account
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s

~~8.17.1 and 9.0.0 are creating the same metrics (cardinality and values).~~

UPDATE: I didn't build the right version, I'm re-testing 9.0.0

8.17.1

9.0.0

Data collected regularly: yes

Issues

(1) Timegrain for azure.database_account.create_account.count is empty

In version 8.17.1, the timegrain for this field is PT5M.

(2) The azure.database_account.service_availability.avg (timegrain PT1H) is missing

Version 9.0.0 always collects 7 documents with PT5M, while version 8.17.1 collect 7 documents PT5M + 1 document PT1H during the first iteration and again every 60 mins.

Is 9.0.0 missing the PT1H document on the first iteration? Waiting for the next iteration to double-check.

After 75 mins, no azure.database_account.service_availability.avg field with PT1H.

UPDATE: tested by @MichaelKatsoulis

I managed to collect azure.database_account.service_availability.avg field with PT1H with the PR code. The problem is that the API requests metric values for metrics ServiceAvailability and ReplicationLatency for Average aggregation. When values for both metrics are requested, service_availability.avg is always nil. If we remove the ReplicationLatency and we just request values for ServiceAvailability the service_availability.avg is returned ok! Still do not know the reason of that.

zmoog · 2025-01-10T12:42:59Z

UPDATE: I built the wrong version, I'm re-testing 9.0.0 with Microsoft.DocumentDb/databaseAccounts (1 resource) and I'll update the previous comment.

My apologies for the noise.

zmoog · 2025-01-10T14:28:18Z

Microsoft.KeyVault/vaults (10 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 10 resources
versions tested:

8.17.1 (branch 8.17)
9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r10"    
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

Notes:

When the key vaults are unused (like in this resource group), they only generates a subset of metrics:

Availability
API Hits
API Results.

8.17.1

In progress.

I can see the three metrics (Availability, API Hits, API Results), grouped in two documents. So 2 documents x 10 resources = 20 documents per iteration:

9.0.0

In progress.

First iterations are okay. I get the same number of documents (20) as 8.17.1 and same values.

Still checking, but this case looks good.

zmoog · 2025-01-10T15:01:17Z

@MichaelKatsoulis, I found a couple of issues relate to timegrain in the Microsoft.DocumentDb/databaseAccounts (1 resource) test.

zmoog · 2025-01-10T17:49:42Z

Microsoft.ContainerRegistry/registries (1 resource)

resource type: Microsoft.ContainerRegistry/registries
resource count: 1 resource
versions tested:

8.17.1 (branch 8.17)
9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults

- module: azure
  metricsets:
  - container_registry
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s

Since we had issue with PT1H metrics, I tried another metricset with this timegrain.

8.17.1

After one iteration, 8.17.1 collected:

1 document with PT5M every 5 minutes
1 document with PT1H every 60 minutes

9.0.0

After one iteration, 8.17.1 collected:

1 document with PT5M every 5 minutes
1 document with PT1H every 60 minutes

Conclusion

✅ With the recent code changes 8.17.1 and 9.0.0 yield the same outcome.

Metrics docs

https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-containerregistry-registries-metrics

…-azure-metricdefinitions

zmoog

The performance gains from the new sdk/monitor/query/azmetrics package with batch API are extremely compelling.

I would love to simplify the internal structure, but I am also okay with going with the PR as-is, collect customer feedback, and switch to the batch API in the next release.

I added a few non-blocking comments for things we may want to address before merging.

zmoog · 2025-04-14T08:23:35Z

x-pack/metricbeat/module/azure/_meta/docs.asciidoc

+_boolean_
+Optional, by default is set to False. Set this to True when facing scalability issues. When configured, the azure batch api will be used
+to fetch metrics of multiple resources in one api call. 
+Currently supported metricsets are monitor, container_registry, container_instance, container_service, compute_vm, compute_vm_scaleset, database_account and storage.


Can we also add storage to the list, or remove it because the metricset supports all the metricsets, right?

Isn't storage in this list?

x-pack/metricbeat/module/azure/client_utils.go

…d batchApi usage (#43923) * Concurrent fetch of azure metricdefinitions and batchApi usage (#41790) * Use concurrency in metricsdefinition collection * Change ResourceConfigurations.Metrics to a map * Use batch API * New queryResourceClient per location * Wait for 50 reource ids before fetching the metrics * Set timegrain if is equal to ''" * Use batch API as feature * Use baseclient to tackle code duplication * Add unit tests for concurrent fetching of metric definitions * Add batch client unit tests * Add support of batch API for storage accounts * Update docs and add unit tests form storage client * Split metric names by 20 (cherry picked from commit 13f8fde) # Conflicts: # go.mod # go.sum * Resolve conflicts --------- Co-authored-by: Michalis Katsoulis <michaelkatsoulis88@gmail.com>

* Use concurrency in metricsdefinition collection * Change ResourceConfigurations.Metrics to a map * Use batch API * New queryResourceClient per location * Wait for 50 reource ids before fetching the metrics * Set timegrain if is equal to ''" * Use batch API as feature * Use baseclient to tackle code duplication * Add unit tests for concurrent fetching of metric definitions * Add batch client unit tests * Add support of batch API for storage accounts * Update docs and add unit tests form storage client * Split metric names by 20 (cherry picked from commit 13f8fde) # Conflicts: # go.mod # go.sum # metricbeat/docs/modules/azure.asciidoc

…nd batchApi usage (#44243) * Concurrent fetch of azure metricdefinitions and batchApi usage (#41790) * Use concurrency in metricsdefinition collection * Change ResourceConfigurations.Metrics to a map * Use batch API * New queryResourceClient per location * Wait for 50 reource ids before fetching the metrics * Set timegrain if is equal to ''" * Use batch API as feature * Use baseclient to tackle code duplication * Add unit tests for concurrent fetching of metric definitions * Add batch client unit tests * Add support of batch API for storage accounts * Update docs and add unit tests form storage client * Split metric names by 20 (cherry picked from commit 13f8fde) # Conflicts: # go.mod # go.sum # metricbeat/docs/modules/azure.asciidoc * Resolve conflicts --------- Co-authored-by: Michalis Katsoulis <michaelkatsoulis88@gmail.com>

…nd batchApi usage (#44241) The changes affect azure monitor and relevant metricsets. The list of metricsets affected are: - `monitor` - `container_registry` - `container_instance` - `container_service` - `compute_vm` - `compute_vm_scaleset` - `database_account` - `storage_account` A new configuration parameter is introduced `enable_batch_api` of type boolean. If set to `false`(default) nothing changes in the way the metrics are collected for these metricsets. If set to `true`: - The metric definitions of resources are collected asynchronously and write the results in a channel. - The channel is read and when the number of definitions collected reach 50 (batch API limit) - The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve metrics of multiple resources with one api call. 1. Grouping criteria are - Namespace - SubscriptionID - Location - Names - TimeGrain - Dimensions

* Use concurrency in metricsdefinition collection * Change ResourceConfigurations.Metrics to a map * Use batch API * New queryResourceClient per location * Wait for 50 reource ids before fetching the metrics * Set timegrain if is equal to ''" * Use batch API as feature * Use baseclient to tackle code duplication * Add unit tests for concurrent fetching of metric definitions * Add batch client unit tests * Add support of batch API for storage accounts * Update docs and add unit tests form storage client * Split metric names by 20 (cherry picked from commit 13f8fde) # Conflicts: # go.mod # go.sum # metricbeat/docs/modules/azure.asciidoc

…nd batchApi usage (#44242) The changes affect azure monitor and relevant metricsets. The list of metricsets affected are: - `monitor` - `container_registry` - `container_instance` - `container_service` - `compute_vm` - `compute_vm_scaleset` - `database_account` - `storage_account` A new configuration parameter is introduced `enable_batch_api` of type boolean. If set to `false`(default) nothing changes in the way the metrics are collected for these metricsets. If set to `true`: - The metric definitions of resources are collected asynchronously and write the results in a channel. - The channel is read and when the number of definitions collected reach 50 (batch API limit) - The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve metrics of multiple resources with one api call. 1. Grouping criteria are - Namespace - SubscriptionID - Location - Names - TimeGrain - Dimensions

shabbywu · 2025-07-08T02:28:34Z

x-pack/metricbeat/module/azure/azure.go

+		return fmt.Errorf("no resources were found based on all the configurations options entered")
+	}
+
+	metricStores := make(map[ResDefGroupingCriteria]*MetricStore)


metricStores only work on a single goroutine, so no mutex is needed?

shabbywu · 2025-07-08T02:36:20Z

x-pack/metricbeat/module/azure/client_utils.go

+	groupedMetrics := map[ResDefGroupingCriteria][]Metric{
+		criteria: store.GetMetrics(),
+	}
+	metricValues := client.GetMetricsInBatch(groupedMetrics, referenceTime, report)


In another points, if MetricStore need mutex, here should have an bug when another goroutine AddMetric into store.

bug for array copy will not share the resize action, see playground

But for now, MetricStore is only work in single goroutine, so it work well for now, right?

MichaelKatsoulis added 9 commits November 7, 2024 14:07

Use concurrency in metricsdefinition collection

5b9beae

Fix conflicts

a103910

Handle errors

0486d0e

Remove commented code

1b8314e

Change ResourceConfigurations.Metrics to a map

b1180db

Use batch API

245a8e3

New queryResourceClient per location

33a8e0f

Updates

121b69f

Fix error handling

a25ce30

MichaelKatsoulis requested review from a team as code owners November 26, 2024 12:08

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2024

MichaelKatsoulis marked this pull request as draft November 26, 2024 12:08

MichaelKatsoulis requested a review from zmoog November 26, 2024 12:08

mergify bot assigned MichaelKatsoulis Nov 26, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 26, 2024

MichaelKatsoulis added 3 commits November 28, 2024 10:41

Wait for 50 reource ids before fetching the metrics

976d38a

Handle metric definitions update

9456997

Fix error in storage accounts

ed7c6f8

MichaelKatsoulis added 3 commits January 14, 2025 14:53

Set timegrain if is equal to ''

550a83f

remove comments

d1af82c

Set correct endtime

166f9a2

Merge remote-tracking branch 'upstream/main' into concurrent-fetch-of…

fe312e1

…-azure-metricdefinitions

MichaelKatsoulis requested a review from zmoog April 14, 2025 07:19

MichaelKatsoulis added backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches and removed backport-8.x Automated backport to the 8.x branch with mergify labels Apr 14, 2025

zmoog approved these changes Apr 14, 2025

View reviewed changes

MichaelKatsoulis added 2 commits April 14, 2025 15:06

Remove not needed comment from the code

ea710a3

Merge branch 'main' into concurrent-fetch-of-azure-metricdefinitions

1483a38

MichaelKatsoulis merged commit 13f8fde into elastic:main Apr 15, 2025
179 of 182 checks passed

mergify bot mentioned this pull request Apr 15, 2025

[9.0](backport #41790) Concurrent fetch of azure metricdefinitions and batchApi usage #43923

Merged

6 tasks

MichaelKatsoulis mentioned this pull request May 5, 2025

Add enable_batch_api option in azure resource metrics elastic/integrations#13783

Merged

5 tasks

zmoog added backport-active-all Automated backport with mergify to all the active branches backport-8.17 Automated backport with mergify backport-8.18 Automated backport to the 8.18 branch backport-8.19 Automated backport to the 8.19 branch labels May 7, 2025

shabbywu reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent fetch of azure metricdefinitions and batchApi usage#41790

Concurrent fetch of azure metricdefinitions and batchApi usage#41790
MichaelKatsoulis merged 36 commits intoelastic:mainfrom
MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions

MichaelKatsoulis commented Nov 26, 2024 •

edited by zmoog

Loading

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

zmoog commented Jan 10, 2025 •

edited by MichaelKatsoulis

Loading

zmoog commented Jan 10, 2025 •

edited

Loading

zmoog commented Jan 10, 2025 •

edited

Loading

zmoog commented Jan 10, 2025

zmoog commented Jan 10, 2025 •

edited

Loading

zmoog left a comment

zmoog Apr 14, 2025

MichaelKatsoulis Apr 14, 2025

Uh oh!

Uh oh!

shabbywu Jul 8, 2025

shabbywu Jul 8, 2025

Labels

6 participants

Conversation

MichaelKatsoulis commented Nov 26, 2024 • edited by zmoog Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Checklist

Disruptive User Impact

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

mergify bot commented Nov 26, 2024

zmoog commented Jan 10, 2025 • edited by MichaelKatsoulis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft.DocumentDb/databaseAccounts (1 resource)

8.17.1

9.0.0

Issues

(1) Timegrain for azure.database_account.create_account.count is empty

(2) The azure.database_account.service_availability.avg (timegrain PT1H) is missing

zmoog commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zmoog commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft.KeyVault/vaults (10 resources)

8.17.1

9.0.0

zmoog commented Jan 10, 2025

zmoog commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Microsoft.ContainerRegistry/registries (1 resource)

8.17.1

9.0.0

Conclusion

zmoog left a comment

Choose a reason for hiding this comment

zmoog Apr 14, 2025

Choose a reason for hiding this comment

MichaelKatsoulis Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shabbywu Jul 8, 2025

Choose a reason for hiding this comment

shabbywu Jul 8, 2025

Choose a reason for hiding this comment

Labels

6 participants

MichaelKatsoulis commented Nov 26, 2024 •

edited by zmoog

Loading

zmoog commented Jan 10, 2025 •

edited by MichaelKatsoulis

Loading

zmoog commented Jan 10, 2025 •

edited

Loading

zmoog commented Jan 10, 2025 •

edited

Loading

zmoog commented Jan 10, 2025 •

edited

Loading