AWS Debuts a Distributed SQL Database, Amazon S3 Tables for Iceberg

AWS re:Invent 2024: AWS plants lake house tables on its S3 object storage, and debuted a globally-distributed SQL database.

Dec 4th, 2024 2:00pm by Joab Jackson

Featued image for: AWS Debuts a Distributed SQL Database, Amazon S3 Tables for Iceberg

Feature image from the AWS re:Invent keynote livestream.

Staying abreast of all the latest trends in data management, Amazon Web Services has introduced support for Apache Lakehouse tables for its S3 object storage service, as well as debuted a distributed SQL database that offers unlimited scalability with transactional-consistency and low latency.

Matt Garman, who is the new AWS CEO, introduced the technology at the company’s annual AWS re:Invent conference, being held this week in Las Vegas.

A New Bucket Type

For organizations building out multisource open source data lake houses for analytics, the company has introduced managed service offering Apache Iceberg Tables, called Amazon S3 Tables.

The company claims the data store service offers three times faster query performance and up to 10 times more transactions per second for analytics workloads, compared to storing the data in a general purpose S3 bucket.

AWS claims that Amazon S3 Tables, now generally available, is “the first cloud object store with fully-managed support for Apache Iceberg,” though the company is following in the steps of both Snowflake and Databricks, which earlier this year expanded on their support of Apache Iceberg.

A new bucket type, Amazon S3 Tables brings a number of benefits around faster analytics, allowing apps to discover data more quickly through queryable object metadata.

S3 is currently the largest object store in the world, holding over 400 trillion objects for millions of customers. AWS had found that Apache Parquet had become one of the fastest-growing data file formats on S3. It was used to store tabular data, a preferred format for data querying. Iceberg is one of a number of Open Table Formats (OTF) that can manage Apache Parquet files.

One of the most widely used engines for accessing Parquet files is Iceberg, which provides the ability to query the data with SQL through the user’s preferred query engine, such as Apache Spark or Apache Flink.

Iceberg comes with its own challenges, Garman told the audience.

“A lot of customers will tell you that — as many open source projects are — Iceberg is actually really challenging to manage, particularly at scale. It’s hard to manage the performance, the scalability, the security,” Garman said. “And so what happens is, you hire dedicated teams to do this to take care of, things like cable maintenance, data compaction, access controls, all of these things that you go into managing and trying to get better performance out of your iceberg implementations.”

“S3 is completely reinventing object storage specifically for the data lake world to deliver better performance, better cost and better scale.” — Matt Garman, AWS re:Invent

And that is the selling point of Amazon S3 Tables, to take care of all these chores automatically. “We basically improve the performance and scalability of all of your Iceberg tables.

Amazon S3 Tables takes care of the maintenance that comes with Tables, such as compaction and snapshot chores. It also offers row-level transactions, queryable snapshots via time travel functionality, schema evolution, and table-level access controls.

Amazon S3 Tables is integrated (in preview) with the AWS Glue Data Catalog, which provides a gateway to AWS’ own visualization and analysis services such as Amazon Athena, Redshift, EMR, and QuickSight.

Better Metadata

Amazon S3 Tables also eliminates the need for customers to build and/or maintain their own metadata systems.

As the size of user data grows into the petabyte level, object metadata grows exponentially more important; attributes such as date and location of origin can be essential for finding the data you need, Garman explained.

Managing metadata can also be a chore, Garman pointed out. You have to store the data, then associate it with the relevant object, and then build and event processing pipeline to bring it up during searches.

A related feature S3 Metadata, now in preview, automatically generates metadata for each new object, including system information about the object itself (size, source, etc.). Users/Apps can add in their own customized metadata as well (i.e. product SKUs, transaction IDs, content ratings, customer details etc.).

“S3 metadata is the fastest and easiest way for you to instantly discover information about your s3 data,” Garman said.

“We automatically store all of your object metadata in an Iceberg table, and then you can use your favorite analytics tool to easily interact and query that data, so you can quickly learn more about your objects and find the object you’re looking for,” he said. “And as objects change, s3 automatically actually updates that metadata in minutes, so it’s always up to date.”

Amazon Aurora DSQL, a Distributed SQL Database

Garman also introduced a new, distributed version of the company’s Aurora SQL database service, called Amazon Aurora DSQL.

PostgreSQL-compatible DSQL offers nearly unlimited scalability, according to the company, as partitions can be spread out across multiple disks and even across multiple availability zones.

It offers strong consistency and 99.999% multiregion availability.

Single-server database systems can offer strong consistency, though they are confined to a single region where the server lives. There are also distributed databases that can offer multiregion availability, though they suffer in performance, as it takes time to synchronize the database cluster across all the regions.

AWS built DSQL to do both, Garman said.

And, as a managed service, DSQL has no infrastructure to manage. No need to provision, patch, or manage database instances. Updates and security patching happen with no downtime

There are a number of high-performance distributed relational databases, such as CockroachDB, although AWS claims that DSQL is 4x faster than competitors.

Aurora DSQL does this by decoupling transaction processing from storage.

“We actually separated the transaction processing from the storage layer so you don’t need every single statement to go check at commit time,” Garman explained. “Instead, you do the single on commit; we parallelize all of the writes at the same time across all of the regions, so you can get strong consistency across regions with super fast writes to the database.”

The Amazon Elastic Compute Cloud (EC2) instance holding the database is synchronized through the Amazon Time Sync Service, which ensures microsecond-level time precision. As a result, each region’s copy of the database sees each database operation in the exact order in which they occurred.

As companies build an international customer base, they find that a single-node database can’t offer global consistency and sufficiently low latency.

Companies such as Autodesk, Electronic Arts, Klarna, QRT, and Razorpay are exploring the additional benefits a multiregion distributed database would bring. For instance, Razorpay, an Indian financial technology company, could use DSQL to support its growing user base with the strong multiregional consistency needed for financial use cases.

The serverless NoSQL DynamoDB has gone global as well, offering a multiregion, multi-active database that provides 99.999% availability (using the same technology and architecture as DSQL).

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...

Read more from Joab Jackson

TNS owner Insight Partners is an investor in: Databricks.