Use Azure blob batch API to delete blobs in batches by nicktindall · Pull Request #114566 · elastic/elasticsearch

nicktindall · 2024-10-11T06:36:33Z

This PR implements blob deletion as one or more blob batch requests, rather than deleting each blob individually.

The reason this wasn't implemented originally was due to concerns around the blob batch API's SAS token auth support.

The difference in the approach in this PR is down to the use of a container-scoped client which sends an additional request parameter (restype=container). Using the API in this way means that SAS tokens are supported.

I ran this branch through the elasticsearch / periodic pipeline (results here) and everything passed. If I'm reading it correctly, that includes running the AzureStorageCleanupThirdPartyTests using a SAS token, and that test includes code paths that use the new bulk delete.

Closes ES-9777

nicktindall · 2024-10-14T02:18:24Z

modules/repository-azure/build.gradle

  api "com.azure:azure-identity:1.13.2"
  api "com.azure:azure-json:1.2.0"
  api "com.azure:azure-storage-blob:12.27.1"
+  api "com.azure:azure-storage-blob-batch:12.23.1"


This is the version consistent with the others from the BOM

…_deletions_in_azure

nicktindall · 2024-10-14T06:32:19Z

modules/repository-azure/src/main/java/module-info.java

-    requires reactor.core;
    requires reactor.netty.core;
    requires reactor.netty.http;
+    requires com.azure.storage.blob.batch;


IntelliJ seemed to optimize the requires, the ones removed above are all transitively required by com.azure.storage.blob.batch.

…thout inspecting the body)

nicktindall · 2024-10-14T06:52:23Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

        PUT_BLOCK("PutBlock"),
-        PUT_BLOCK_LIST("PutBlockList");
+        PUT_BLOCK_LIST("PutBlockList"),
+        BLOB_BATCH("BlobBatch");


We can't be specific about the type of operation we're performing in a batch without inspecting the request body. I think it's better to track BlobBatch than potentially erroneously track BatchDelete (if one day we start using batch to "set access tier")

elasticsearchmachine · 2024-10-14T06:55:58Z

Pinging @elastic/es-distributed (Team:Distributed)

…_deletions_in_azure

…re errors

…_deletions_in_azure

ywangd

Sorry for the delay here. I took a second and closer look at the changes. I think we might want to consider adding controls for resource usages (heap and concurrent requests).

Btw, the PR should be now labelled as :>enhancement due to the new setting.

ywangd · 2024-10-22T03:16:24Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

        // locationMode is set per repository, not per client
        this.locationMode = Repository.LOCATION_MODE_SETTING.get(metadata.settings());
        this.maxSinglePartUploadSize = Repository.MAX_SINGLE_PART_UPLOAD_SIZE_SETTING.get(metadata.settings());
+        this.maxDeletesPerBatch = Repository.DELETION_BATCH_SIZE_SETTING.get(metadata.settings());


Nit: Can we rename this field to be deletionBatchSize which is consistent with the setting name and avoid clashing with the static MAX_ELEMENTS_PER_BATCH?

ywangd · 2024-10-22T03:34:02Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/SocketAccess.java


+    public static <E extends Exception> void doPrivilegedVoidExceptionExplicit(Class<E> exception, StorageRunnable action) throws E {
+        doPrivilegedVoidException(action);
+    }


Is this necessary? The existing code works ok without the explicit throws? If we want to change this, I'd prefer to update the existing method so that it explicit throws IOException in its catch block if the cause is an IOException. Since it requires some cascading changes, I think a separate PR would be better.

ywangd · 2024-10-22T03:49:05Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+            for (BlobItem blobItem : blobContainerClient.listBlobs(options, null)) {
+                if (blobItem.isPrefix()) {
+                    continue;
+                }
+                blobNames.add(blobItem.getName());
+                bytesDeleted.addAndGet(blobItem.getProperties().getContentLength());
+                blobsDeleted.incrementAndGet();
+            }
+            if (blobNames.isEmpty() == false) {
+                deleteListOfBlobs(client, blobNames.iterator());


I wonder whether there is an issue in materializing all blobItems from the listing before invoking delete. If there are a large number of items, it could be rather inefficient. IIUC, listBlobs returns an Iterable that lazily load. I think this change means we no longer leverage it?

Good call, I've changed this now to use Flux all the way through. I think that should pipeline all this stuff better.

ywangd · 2024-10-22T04:08:27Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+        final List<Mono<Void>> batchResponses = new ArrayList<>();
+        while (blobNames.hasNext()) {
+            final BlobBatch currentBatch = batchAsyncClient.getBlobBatch();
+            int counter = 0;
+            while (counter < maxDeletesPerBatch && blobNames.hasNext()) {
+                currentBatch.deleteBlob(container, blobNames.next());
+                counter++;
            }
+            batchResponses.add(batchAsyncClient.submitBatch(currentBatch));


This is more likely a theoretical concern. Technically the number of concurrent requests here are also unbounded while previously it is hard-coded to 100.

I've limited these. There is an underlying limit imposed at the node level (max open connections for a pool shared between clients, which defaults to 50 and Azure is HTTP/1.1 so that's effectively a global max concurrent requests for a node) and also for the execution of these blocks that dispatch the request there's a thread pool limit in the reactor runtime (which seems to default to 5 threads).

So I think with a limit of 100 the actual number of concurrent requests would be much lower. In any case I've added an explicit limit which is configurable and defaults to 10. Because these are bulk requests that means by default that's a maximum of 2560 concurrent individual deletes.

Thanks for explaining. Makes sense from the networking perspective. I should have been more clear. By unbounded number of requests, I mostly mean the number of request objects that are constructed during this process. I guess they are eagerly instantiated even when the underlying network stack is not ready to consume them? If so, they consume memory and in extreme case may even lead to oom.

Ah yep I understand. I think the limitation I added should restrict that as it'll limit the number of concurrent subscribers. As I understand it nothing happens until a subscriber asks for the next value(s) so there should only be at most 10 batch requests being processed at any one time.

ywangd · 2024-10-22T04:15:38Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+        final CountDownLatch allRequestsFinished = new CountDownLatch(deleteTasks.size());
+        final List<Throwable> errors = new CopyOnWriteArrayList<>();


Similarly, I think we should limit the number of errors. Also, seeing the CountDownLatch makes me think whether it is possible to leverage the Flux approach similar to how existing deletion code reiles on Flux.then().blocks(). Maybe something like Flux#fromIterable so that it takes a custom Iterable implementation that internally constructs deletion requests which in turn consumes the listing response. I feel it could somewhat address my previous comments about limiting resource usages. It's just a rough idea. There maybe issues that I just haven't noticed.

Fixed in b913fd0

…_deletions_in_azure

elasticsearchmachine · 2024-10-23T05:52:05Z

Hi @nicktindall, I've created a changelog YAML for you.

ywangd

LGTM

Thanks for the iterations!

docs/reference/snapshot-restore/repository-azure.asciidoc

ywangd · 2024-10-24T05:16:42Z

...lusterTest/java/org/elasticsearch/repositories/azure/AzureStorageCleanupThirdPartyTests.java

+            logger.info("Using SAS token authentication");
            secureSettings.setString("azure.client.default.sas_token", System.getProperty("test.azure.sas_token"));
        } else {
+            logger.info("Using key authentication");


Nit: can we add --> in the beginning of the logging messages? It's an informal conventional to make these test logging messages easier to search.

Added in d9ce4b5

ywangd · 2024-10-24T05:23:59Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+        // We need to use a container-scoped BlobBatchClient, so the restype=container parameter
+        // is sent, and we can support all SAS token types
+        // See https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch?tabs=shared-access-signatures#authorization
+        final BlobBatchAsyncClient batchAsyncClient = new BlobBatchClientBuilder(
+            azureBlobServiceClient.getAsyncClient().getBlobContainerAsyncClient(container)
+        ).buildAsyncClient();


I assume this still works with non-container-scoped tokens and other azure crendential types that we support?

I've just kicked off the tests with the latest changes https://buildkite.com/elastic/elasticsearch-periodic/builds/4557

EDIT: third party tests all still pass :)

ywangd · 2024-10-24T05:28:47Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+                });
+            }, maxConcurrentBatchDeletes).collectList().block();
+            if (errors.isEmpty() == false) {
+                final IOException ex = new IOException("Error deleting batches");


Nit: I think we can include a brief message about exactly how many errors have been encountered if errorsCollected is greater than 10 so that it is clear that some errors are skipped.

Added in 84ff3c2

ywangd · 2024-10-24T05:33:03Z

modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureBlobStore.java

+        } catch (RuntimeException e) {
+            throw new IOException("Error deleting batches", e);


Mostly for my own education: Why are we specifically catching RuntimeException here? Is there a concrete concern of anything thrown here or is it to match the existing code. The existing code seems to catch a broader Exception instead?

It was really because there are no checked exceptions thrown in the try/catch block, perhaps it's safer just to catch exception in case there's a SocketAccess.doPrivilegedVoidException-type scenario going on in there somewhere.

Broadened to Exception in 103aec4

ywangd · 2024-10-24T05:35:53Z

...les/repository-azure/src/main/java/org/elasticsearch/repositories/azure/AzureRepository.java

+        /**
+         * The maximum number of concurrent batch deletes
+         */
+        static final Setting<Integer> MAX_CONCURRENT_BATCH_DELETES_SETTING = Setting.intSetting("max_concurrent_batch_deletes", 10, 1);


I suggest we give it a sensible max value, e.g. 100.

Added in 6c4697e

Co-authored-by: Yang Wang <ywangd@gmail.com>

…usion

…_deletions_in_azure

Closes ES-9777

elasticsearchmachine added the v9.0.0 label Oct 11, 2024

Experimental batch-blob deletion support WIP

8e08e60

nicktindall force-pushed the ES-9777_support_batch_deletions_in_azure branch from 2d9b166 to 8e08e60 Compare October 11, 2024 06:38

nicktindall added 3 commits October 11, 2024 18:09

Handle null responses on success?! (temporary workaround)

8f67b0a

Check error responses, track metrics

86a59e9

Skip metrics on batched requests more explicitly

22cfdb8

nicktindall commented Oct 14, 2024

View reviewed changes

nicktindall added 8 commits October 14, 2024 13:29

Test that batch delete is tracked

83ef063

Undo broken request time tracking

798b83b

Merge remote-tracking branch 'origin/main' into ES-9777_support_batch…

57286c4

…_deletions_in_azure

Use batch delete when deleting blob directory

219cf40

Fix naming

d0fc6ee

Tidy up response templates

c566966

Remove most debug logging from AzureHttpHandler

d5aa10f

Submit all batch deletes concurrently

87b08e6

nicktindall commented Oct 14, 2024

View reviewed changes

nicktindall added 2 commits October 14, 2024 17:33

Don't swallow exception

b034c50

Track "blob batch" instead of "batch delete" (we can't be specific wi…

ebc47e8

…thout inspecting the body)

nicktindall added >non-issue :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Oct 14, 2024

nicktindall commented Oct 14, 2024

View reviewed changes

nicktindall marked this pull request as ready for review October 14, 2024 06:55

nicktindall requested a review from a team as a code owner October 14, 2024 06:55

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Oct 14, 2024

nicktindall added 2 commits October 14, 2024 18:06

Fix date/time format

e3cb397

Add test for batch delete failure metrics/behaviour

5df3384

nicktindall changed the title ~~Experimental batch-blob deletion support WIP~~ Oct 14, 2024

nicktindall changed the title ~~Use Azure blob batch API to delete files in batches~~ Oct 14, 2024

Merge remote-tracking branch 'origin/main' into ES-9777_support_batch…

caf1ab7

…_deletions_in_azure

nicktindall added 2 commits October 17, 2024 10:09

Write our own subscriber to the delete tasks to more accurately captu…

2f37338

…re errors

Merge remote-tracking branch 'origin/main' into ES-9777_support_batch…

dda6e8a

…_deletions_in_azure

ywangd reviewed Oct 22, 2024

View reviewed changes

nicktindall added 10 commits October 22, 2024 21:39

Use reactor to list and delete

d546ec5

Merge remote-tracking branch 'origin/main' into ES-9777_support_batch…

8588247

…_deletions_in_azure

Add docs for max_concurrent_batch_deletes

4641580

Tidy up

6b94158

Remove SocketAccess#doPrivilegedVoidExceptionExplicit

8efb2e1

Naming

09f27b8

Limit amount of suppressed errors

b913fd0

Randomise max_concurrent_batch_deletes

ac772ec

Comment wording

dfe0d9c

Tidy

6818ec7

nicktindall added >enhancement and removed >non-issue labels Oct 23, 2024

Update docs/changelog/114566.yaml

eab8700

ywangd approved these changes Oct 24, 2024

View reviewed changes

nicktindall and others added 7 commits October 24, 2024 16:44

Update docs/reference/snapshot-restore/repository-azure.asciidoc

93bfe04

Co-authored-by: Yang Wang <ywangd@gmail.com>

Add info log prefix pattern

d9ce4b5

Indicate the total count of errors when it exceeds our limit for incl…

84ff3c2

…usion

Catch Exception instead of RuntimeException

103aec4

Set maximum for max_concurrent_batch_deletes

6c4697e

Merge remote-tracking branch 'origin/main' into ES-9777_support_batch…

beae254

…_deletions_in_azure

Fix batch error handling

5f1143e

nicktindall merged commit 7599d4c into elastic:main Oct 24, 2024

nicktindall deleted the ES-9777_support_batch_deletions_in_azure branch October 24, 2024 08:51

davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Oct 24, 2024

Use Azure blob batch API to delete blobs in batches (elastic#114566)

a6cb9b3

Closes ES-9777

georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024

Use Azure blob batch API to delete blobs in batches (elastic#114566)

6c69ae6

Closes ES-9777

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024

Use Azure blob batch API to delete blobs in batches (elastic#114566)

ffec0a0

Closes ES-9777

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Azure blob batch API to delete blobs in batches#114566

Use Azure blob batch API to delete blobs in batches#114566
nicktindall merged 46 commits intoelastic:mainfrom
nicktindall:ES-9777_support_batch_deletions_in_azure

nicktindall commented Oct 11, 2024 •

edited

Loading

nicktindall Oct 14, 2024

nicktindall Oct 14, 2024

nicktindall Oct 14, 2024

elasticsearchmachine commented Oct 14, 2024

ywangd left a comment

ywangd Oct 22, 2024

nicktindall Oct 23, 2024

ywangd Oct 22, 2024

ywangd Oct 22, 2024

nicktindall Oct 23, 2024

ywangd Oct 22, 2024

nicktindall Oct 23, 2024

ywangd Oct 23, 2024

nicktindall Oct 23, 2024

ywangd Oct 22, 2024

nicktindall Oct 23, 2024

elasticsearchmachine commented Oct 23, 2024

ywangd left a comment

Uh oh!

ywangd Oct 24, 2024

nicktindall Oct 24, 2024

ywangd Oct 24, 2024

nicktindall Oct 24, 2024 •

edited

Loading

ywangd Oct 24, 2024

nicktindall Oct 24, 2024

ywangd Oct 24, 2024

nicktindall Oct 24, 2024

nicktindall Oct 24, 2024

ywangd Oct 24, 2024

nicktindall Oct 24, 2024

Labels

4 participants

		final CountDownLatch allRequestsFinished = new CountDownLatch(deleteTasks.size());
		final List<Throwable> errors = new CopyOnWriteArrayList<>();

		} catch (RuntimeException e) {
		throw new IOException("Error deleting batches", e);

Conversation

nicktindall commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 14, 2024

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 23, 2024

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

4 participants

nicktindall commented Oct 11, 2024 •

edited

Loading

nicktindall Oct 24, 2024 •

edited

Loading