How to start with DynamoDB in your Ktor application (Part 2)

October 11, 2024

Header

In the first post about DynamoDB, I covered how to integrate DynamoDB with a Ktor application, and I performed basic CRUD operations that should show how easy it is, to start with this type of NoSQL database. In this first follow-up post, I will start exploring more advanced DynamoDB features, starting with the table design including composite keys and global secondary indexes.

This post contains a lot more text and less code because I need to explain some concepts first before diving further into code. So before I start with the topic of table design, it is necessary to explain some important related topics first.

Concepts

What is a partition?

In the last post, I used the word "partition" multiple times but did not explain what it means in the context of DynamoDB. In DynamoDB, a partition is a fundamental concept that refers to the way data is distributed across multiple physical storage units for scalability and performance. DynamoDB automatically divides the table’s data into partitions based on the following key factors:

Partition Key (Hash Key): Each item in DynamoDB has a partition key, which is part of the primary key. The value of the partition key is hashed, and this hash determines the partition in which the item will be stored. Items with the same partition key are stored together.
Size of Data: A partition can store up to 10 GB of data. Once a partition reaches this limit, DynamoDB automatically splits the data across more partitions.
Provisioned Throughput (or On-Demand Mode): DynamoDB also uses partitions to manage read and write capacity. Each partition can support a maximum of 3,000 read capacity units (RCUs) and 1,000 write capacity units (WCUs). If a table exceeds these throughput limits, DynamoDB will increase the number of partitions.

Throughput and Capacity

Throughput is another topic that is important to understand. Throughput refers to the rate at which read and write operations can be performed on the tables. This throughput is measured and controlled using Read Capacity Units (RCUs) and Write Capacity Units (WCUs). The throughput capacity for a DynamoDB table can be specified, either in provisioned mode or on-demand mode, depending on the application’s needs.

Provisioned Mode

It is necessary to explicitly define the number of RCUs and WCUs for the DynamoDB table at creation time.

Read Capacity Units (RCUs): 1 RCU represents one strongly consistent read per second for an item up to 4 KB in size. For eventually consistent reads, 1 RCU can handle two reads per second for items up to 4 KB in size.
Write Capacity Units (WCUs): 1 WCU represents one write per second for an item up to 1 KB in size.

Example:

If I have 10 RCUs, my table can perform 10 strongly consistent reads per second on 4 KB items, or 20 eventually consistent reads per second. If my items are larger than 4 KB, they consume more RCUs. Similarly, if I have 5 WCUs, my table can perform 5 writes per second for items of 1 KB or smaller.

Demand Mode

DynamoDB automatically scales the throughput based on the demand for the application. I don’t need to specify RCUs and WCUs in advance. I’m billed based on the actual read and write requests my application makes. This is perfect for lazy developers but may be expensive on extensive access. So decide wisely which mode to use.

Throttling

If your table exceeds the provisioned throughput (in provisioned mode), DynamoDB throttles requests, causing them to fail with a ProvisionedThroughputExceededException. Properly setting throughput and designing efficient partition keys are important to prevent throttling.

Factors Affecting Throughput:

Item Size: Larger items consume more capacity units.
Consistency of Reads: Strongly consistent reads consume more RCUs than eventually consistent reads.
Traffic Spikes: Sudden increases in traffic can lead to throttling if your throughput is not appropriately configured or if you’re in provisioned mode without auto-scaling.

Consistent vs Eventually Consistent Read

An eventually consistent read might not reflect the results of a recently completed write operation. This means that if I perform a read right after a write, the data returned might not include the latest changes. However, after a short time, (usually a second or two), all copies of the data are updated, and eventually consistent reads will return the latest data.

A strongly consistent read returns the most up-to-date result, reflecting all writes that received a successful response prior to the read. In other words, the read operation ensures that I always get the most recent data for the specified item.

Feature	Eventually Consistent Read	Strongly Consistent Read
Default Behavior	Yes	No
Read Capacity Units (RCU)	1 RCU per 4 KB of data	2 RCUs per 4 KB of data
Latency	Lower latency (faster reads)	Higher latency (slightly slower)
Consistency	Might return stale data	Always returns up-to-date data
Availability	High	Might fail if consistency can’t be guaranteed (in time)

Feature

Eventually Consistent Read

Strongly Consistent Read

Default Behavior

Yes

Read Capacity Units (RCU)

1 RCU per 4 KB of data

2 RCUs per 4 KB of data

Latency

Lower latency (faster reads)

Higher latency (slightly slower)

Consistency

Might return stale data

Always returns up-to-date data

Availability

High

Might fail if consistency can’t be guaranteed (in time)

Eventually Consistent Read: Suitable for most use cases where high availability and throughput are more important than up-to-the-second accuracy. It is the default option and is cost-effective due to lower read capacity usage.

Strongly Consistent Read: Ideal for scenarios where the latest state of the data is essential. It should be used when the application cannot tolerate stale data, even temporarily. Examples include inventory checks in e-commerce or verifying the latest financial transactions.

Now that the necessary concepts are explained, I can continue with the topic of composite keys.

Composite Keys for Tables

In the previous post I created a ProductEntity with a partition key that consists of a single property (productId). I’m sure that every entry in the table has a different value for productId. This is important because a partition key has to be unique across all items. If I try to insert a second product with the same productId the existing item is updated and not a second one is inserted. So what to do if there is no single property that uniquely identifies an item? Besides the creation of an artificial partition key, (e.g. by using a UUID) or adding a suffix to an existing property, using a composite key is an option.

DynamoDB allows tables to have composite keys, which consist of a partition key (hash key) and a sort key (range key). The partition key is used to determine the partition (or node) where the data is stored. The sort key is used to uniquely identify items within the same partition. Composite keys allow for more advanced data modeling and querying capabilities, especially when I need to retrieve multiple related items that share the same partition key.

Now that I’ve talked a lot, let’s have a look at how this can be implemented in my Kotlin application.

data class OrderEntity(
    @DynamoKtPartitionKey
    val customerId: String,
    @DynamoKtSortKey
    val orderDate: Long,
    val productId: String,
    val amount: Double,
    val paymentType: String,
)

You may wonder why using a Long type for representing the OrderDate. This is done for simplicity. I will show in a later post how to use a date type and use a custom converter for storage. DynamoDB not support the storage of date types, so a conversion is necessary on storage and retrieving.

This table design allows me to query orders by customerId and get all of them sorted by the date they were ordered.

I create a repository for the OrderEntity and also a service and http endpoints for testing the functionality. I implement the repository in the same way the ProductRepository of the previous post is implemented. When executing the endpoint for adding a new order, sadly this does not work and returns an exception:

software.amazon.awssdk.services.dynamodb.model.DynamoDbException: The number of conditions on the keys is invalid (Service: DynamoDb, Status Code: 400, Request ID: 7e2df78e-d61e-40c3-a450-ad3bc546e54a)
	at software.amazon.awssdk.services.dynamodb.model.DynamoDbException$BuilderImpl.build(DynamoDbException.java:104)
	at software.amazon.awssdk.services.dynamodb.model.DynamoDbException$BuilderImpl.build(DynamoDbException.java:58)

So what I did wrong? Debugging the exception I can see that the problem occurs inside the findById - method because just providing the partition key is no longer enough, but I also have to provide the sort key as well.

suspend fun findById(customerId: String, orderDate: Long): Order? {
        return table.getItem(
            Key.builder().partitionValue(customerId).sortValue(orderDate).build()
        ).await()?.toOrder()
    }

Updating the repository method for finding an order also makes it necessary to update the OrderService. Restarting the application with these changes makes it possible again to create new orders.

In the next step I also update the deleteById because it also needs the orderDate for deletion of an order.

suspend fun deleteById(customerId: String, orderDate: Long) {
    table.deleteItem(Key.builder().partitionValue(customerId).sortValue(orderDate).build()).await()
}

With this change, the CRUD operations for the order are working the same as for the product.

What other purposes has the sort key?

Sorting Items within a Partition

The primary use of a sort key is to sort items within the same partition. Items that share the same partition key are stored together, and the sort key determines the order in which they are stored and queried.

Example: In an Order table with a partition key CustomerId and a sort key OrderDate, multiple orders for the same customer (i.e., the same CustomerId) will be sorted by OrderDate.

Range Queries on Sort Key

The sort key enables efficient range queries within the same partition. I can query or filter items within a partition based on a range of values of the sort key.

Common range operations on the sort key:

Query by exact match: Find a specific item based on both the partition key and sort key.

see findById(…) above.

Query by range: Find all items where the sort key falls between two values (BETWEEN).

suspend fun findInRange(customerId: String, from: Long, to: Long): List<Order> {
    return buildList {
        table.query(
            QueryConditional.sortBetween(
                Key.builder().partitionValue(customerId).sortValue(from).build(),
                Key.builder().partitionValue(customerId).sortValue(to).build()
            )
        ).asFlow().collect { it.items().stream().forEach { item -> add(item.toOrder()) } }
    }
}

Query by prefix: Find items where the sort key begins with a specific prefix (begins_with).

suspend fun findStartingWith(customerId: String, localDate: LocalDate): List<Order> {
    return buildList {
        table.query(
            QueryConditional.sortBeginsWith(
                Key.builder().partitionValue(customerId).sortValue(localDate.toEpochSecond(
                    LocalTime.MIN, ZoneOffset.UTC
                )).build(),
                )
            ).asFlow().collect { it.items().stream().forEach { item -> add(item.toOrder()) } }
    }
}

Query by comparison: Use comparison operators like >, <, >=, ⇐ to query on the sort key.

suspend fun findBefore(customerId: String, orderDate: Long): List<Order> {
    return buildList {
        able.query(
            QueryConditional.sortLessThan(
                Key.builder().partitionValue(customerId).sortValue(orderDate).build(),
                )
            ).asFlow().collect { it.items().stream().forEach { item -> add(item.toOrder()) } }
    }
}

Storing Multiple Items under the Same Partition Key

A sort key allows me to store multiple items under the same partition key. As already mentioned above without a sort key, DynamoDB tables can only store one item per partition key. Adding a sort key makes it possible to have multiple items with the same partition key, but with different sort keys, making my data model much more flexible.

Example: In the Order table, I can store all orders (OrderId) made by the same customer (CustomerId) and use OrderDate as the sort key to uniquely identify each order under the same customer.

Efficient Pagination

When querying items based on both partition and sort key, DynamoDB can return items in a sorted order (by the sort key), which enables efficient pagination.

Example: In the Order table with CustomerId as the partition key and OrderDate as the sort key, I can paginate through a customer’s orders, retrieving them page by page, ordered by date.

Performing Aggregations

I can model DynamoDB tables to perform time-based aggregations or grouping by using the sort key to organize data.

Example: In the Order table, I can group orders by month or year by using a sort key like OrderDate. This would let me query all orders for a customer for a specific year or month.

Modeling 1-to-Many Relationships

A sort key is essential in 1-to-many relationships, where multiple child items are related to a single parent. I can model such relationships by storing all related items under the same partition key and distinguishing them with the sort key.

Example: In a Customer-Order model, the CustomerId would be the partition key, and the OrderDate could be the sort key, allowing me to store multiple orders for a single customer.

Conditional Writes

The sort key can also be used to enforce conditional uniqueness across items within the same partition key. If I attempt to insert an item with the same partition key and sort key combination, DynamoDB will reject the insert (unless I overwrite the existing item).

Example: In the Order table, two orders cannot have the same OrderId and OrderDate within the same customer (CustomerId). DynamoDB will ensure this condition automatically.

There is another topic that is related to the table design - Global Secondary Index. In the next part I will explain how this can help work with tables.

Global Secondary Index

In DynamoDB, a Global Secondary Index (GSI) is a powerful feature that allows for querying data on non-primary key attributes with the flexibility to define different partition and sort keys compared to the base table.

Purpose of a Global Secondary Index:

Querying on non-primary key attributes: By default, DynamoDB only allows querying based on the primary key (partition key and optionally the sort key) of a table. A GSI provides a way to query on attributes other than the primary key, which increases flexibility in querying patterns.
Improved Query Performance: GSIs allow you to efficiently retrieve a subset of data from a table without scanning the entire table, which is crucial when working with large datasets.
Separate read/write capacity: A GSI has its own provisioned or on-demand capacity settings, allowing you to scale reads and writes independently of the main table.

How to Use a Global Secondary Index:

When I define a GSI, I specify a new partition key (and optionally a sort key) from the existing attributes of my table. I also specify which attributes to project from the main table into the GSI. There are three types of projection:

Keys only: Only the partition and sort keys are projected into the index.
Include: I specify additional attributes to project.
All: All table attributes are projected into the index.

Once the GSI is created, I can use it to query data as I would with the base table’s primary key. I specify the index name, the partition key, and (optionally) the sort key of the GSI. Queries on a GSI are indexed and efficient, similar to querying the main table’s primary key.

Let’s have a look on an example that makes the advantage of using GSI clearer.

Example:

Let’s have a look again on the Order table with the following structure:

Partition Key (PK): CustomerId (a unique ID for each customer who placed the order)
Sort Key (SK): OrderDate (the date the order was placed)

Other attributes include:

PaymentTpe: The type which is used for payment.
OrderTotal: The total value of the order.
ProductId: The ID of the product that was ordered.

This structure is useful when I want to query orders by CustomerId and maybe sort them by OrderDate if I’m fetching a range of orders, but it doesn’t help much if I want to query by other important attributes like PaymentType or ProductId. This especially is relevant if the table contains a large amount of data. When query data using the partition and sort key the performance is very fast, independent of the size of data in the table. But when query for an other attribute, DynamoDB needs to scan the whole table, which gets slower as more items the table contains. This is the use-case for global secondary indexes.

Problem:

I now want to support the following additional query patterns:

Find all orders by a specific PaymentType.
Find all orders for a specific ProductId.

These query patterns are not supported by the main table’s CustomerId and OrderDate primary key. Using a scan operation to filter by PaymentType or ProductId would be inefficient and costly, especially as the table grows.

When creating an GSI it is important to first think about the modeling of the partition/sort key. In contrast to the table in a GSI the key out of partition key and optional sort key does not need to be unique. If the key is not chosen well, the result of a query may return too much items and it is necessary to filter locally.

Solution: Create a GSI

To support these queries, I would create two different GSIs with different primary keys:

GSI #1: Query by OrderId

I want to query the table by PaymentType to retrieve all orders placed by a specific customer. To do this, I can create a GSI with:

Partition Key: PaymentType Sort Key: OrderDate

This index will allow me to efficiently query all orders that belong to a specific order and optionally filter them by date.

GSI #2: Query by ProductId

To find all orders that contain a specific ProductId, I would create another GSI:

Partition Key: ProductId Sort Key: OrderDate

This index would allow me to retrieve all orders that include a particular product.

This time I will use the AWS CLI for the creation of the necessary elements in the local DynamoDB. This should show that all operations can also be done using the CLI. I start by creating the Orders table.

aws dynamodb create-table \
    --table-name OrderEntity \
    --attribute-definitions \
        AttributeName=customerId,AttributeType=S \
        AttributeName=orderDate,AttributeType=N \
    --key-schema \
        AttributeName=customerId,KeyType=HASH \
        AttributeName=orderDate,KeyType=RANGE \
    --billing-mode PAY_PER_REQUEST \
    --endpoint-url http://localhost:8000

In the next step I create both GSI by using an update-table request.

aws dynamodb update-table \
  --table-name OrderEntity \
  --attribute-definitions AttributeName=paymentType,AttributeType=S \
                           AttributeName=orderDate,AttributeType=N \
  --global-secondary-index-updates \
      "[{\"Create\":{\"IndexName\": \"PaymentType-OrderDate-index\",
      \"KeySchema\":[{\"AttributeName\":\"paymentType\",\"KeyType\":\"HASH\"},
                    {\"AttributeName\":\"orderDate\",\"KeyType\":\"RANGE\"}],
      \"Projection\":{\"ProjectionType\":\"ALL\"},
      \"ProvisionedThroughput\":{\"ReadCapacityUnits\":5,\"WriteCapacityUnits\":5}}}]" \
  --endpoint-url http://localhost:8000

aws dynamodb update-table \
  --table-name OrderEntity \
  --attribute-definitions AttributeName=productId,AttributeType=S \
                           AttributeName=orderDate,AttributeType=N \
  --global-secondary-index-updates \
      "[{\"Create\":{\"IndexName\": \"ProductId-OrderDate-index\",
      \"KeySchema\":[{\"AttributeName\":\"productId\",\"KeyType\":\"HASH\"},
                    {\"AttributeName\":\"orderDate\",\"KeyType\":\"RANGE\"}],
      \"Projection\":{\"ProjectionType\":\"ALL\"},
      \"ProvisionedThroughput\":{\"ReadCapacityUnits\":5,\"WriteCapacityUnits\":5}}}]" \
  --endpoint-url http://localhost:8000

With these commands, the table and both GSI are created and I can switch over to the sample project to create the necessary code for querying the data.

In order to use a secondary index for querying data it is necessary to add additional annotations to the OrderEntity class.

data class OrderEntity(
    @DynamoKtPartitionKey
    val customerId: String,
    @DynamoKtSortKey
    @DynamoKtSecondarySortKey(indexNames = [PRODUCT_ID_ORDER_DATE_INDEX, PAYMENT_TYPE_ORDER_DATE_INDEX])
    val orderDate: Long,
    @DynamoKtSecondaryPartitionKey(indexNames = [PRODUCT_ID_ORDER_DATE_INDEX])
    val productId: String,
    val amount: Double,
    @DynamoKtSecondaryPartitionKey(indexNames = [PAYMENT_TYPE_ORDER_DATE_INDEX])
    val paymentType: String,
)

With this the query operations can be implemented as below:

suspend fun findAllByProductIdInRange(productId: String, from: Instant, to: Instant): List<Order>{
    return buildList {
        table.index(PRODUCT_ID_ORDER_DATE_INDEX).query(
            QueryConditional.sortBetween(
                Key.builder().partitionValue(productId).sortValue(from.toEpochMilli()).build(),
                Key.builder().partitionValue(productId).sortValue(to.toEpochMilli()).build()
            )
        ).asFlow().collect { it.items().stream().forEach { item -> add(item.toOrder()) } }
    }
}

suspend fun findAllByPaymentTypeUntil(paymentType: String, until: Instant): List<Order>{
    return buildList {
        table.index(PAYMENT_TYPE_ORDER_DATE_INDEX).query(
            QueryConditional.sortLessThanOrEqualTo(
                Key.builder().partitionValue(paymentType).sortValue(until.toEpochMilli()).build()
            )
        ).asFlow().collect { it.items().stream().forEach { item -> add(item.toOrder()) } }
    }
}

The first query method is using the ProductId-OrderDate-index index to query all orders by productId and orderDate in range.

The second query method is using the PaymentType-OrderDate-index index to query all orders by paymentType that have a orderDate before a given timestamp.

As you can see querying data from an index works nearly the same as for the table. I just need to specify which index I use to run the query on.

Conclusion

In the second part of my posts related to DynamoDB, I started by explaining some of the concepts that are important to understand when working with DynamoDB and want to benefit from its advantages. Comparing to traditional relational databases like Postgresql or MySQL, there is a concept like a partition that you need to understand to store and retrieve items with high performance, independent of the size. For this, it is also important to know that tables have limits regarding read and write operations. Depending on your needs, it is possible to use provisioned mode, where I have complete control over the costs or when you are lazy and the expected traffic is within the expected range the demand mode. This one scales automatically depending on the traffic and no throttling occurs on unexpected peaks. Creating composite primary keys for tables helps for better modeling of table to support constant request times and prevent scanning the whole table, what leads to poor performance and costs. This is one of the differences when working with DynamoDB comparing to relation databases, I need to think about query patterns before the creation of the table. With the help of the sort key as an extension to the partition key, there are a lot of performant queries possible, including searching in a specified time range or after a specific start time. In case, there is no single way to access the table global secondary indexes allows creating additional projections of the original table with different partition and sort key. With this the same data can be queried in a performant way by different attributes. But that is not for free, because the table data is "copied" for the index, additional costs exists. So it is important to think about what access patterns are really necessary before creating a GSI for every use-case.

Even I covered quite a lot of topics in this post, there are still topics, I named in the first post, but still uncovered until now. It seems that I will have to write at minimum one additional post covering the missing topics like converters for custom data types, batch processing for read and write operations, filter expressions, and TTL. So you can look forward to what’s to come soon.

You can find the full code used for this article on Github.

Twitter Facebook LinkedIn

Matthias Schenk