Designing Data-Intensive Applications – Secondary Indexes, Rebalancing, Routing

Description

We wrap up the discussion on partitioning from our collective favorite book, Designing Data-Intensive Applications, while Allen is properly substituted, Michael can't stop thinking about Kafka, and Joe doesn't live in the real sunshine state.

The full show notes for this episode are available at https://www.codingblocks.net/episode172.

Survey Says

News

Game Ja Ja Ja Jam is coming up, sign up is open now! (itch.io)
Joe finished the Create With Code Unity Course (learn.unity.com)
New MacBook Pro Review, notch be darned!

Last Episode …

Best book evar!

In our previous episode, we talked about data partitioning, which refers to how you can split up data sets, which is great when you have data that's too big to fit on a single machine, or you have special performance requirements. We talked about two different partitioning strategies: key ranges which works best with homogenous, well-balanced keys, and also hashing which provides a much more even distribution that helps avoid hot-spotting.

This episode we're continuing the discussion, talking about secondary indexes, rebalancing, and routing.

Partitioning, Part Deux

Partitioning and Secondary Indexes

Last episode we talked about key range partitioning and key hashing to deterministically figure out where data should land based on a key that we chose to represent our data.
- But what happens if you need to look up data by something other than the key?
- For example, imagine you are partitioning credit card transactions by a hash of the date. If I tell you I need the data for last week, then it's easy, we hash the date for each day in the week.
- But what happens if I ask you to count all the transactions for a particular credit card?
  - You have to look at every single record. in every single partition!
Secondary Indexes refer to metadata about our data that help keep track of where our data is.
In our example about counting a user's transactions in a data set that is partitioned by date, we could keep a separate data structure that keeps track of which partitions each user has data in.
We could even easily keep a count of those transactions so that you could return the count of a user's transaction solely from the information in the secondary index.
Secondary indexes are complicated. HBase and Voldemort avoid them, while search engines like Elasticsearch specialize in them.
There are two main strategies for secondary indexes:
- Document based partitioning, and
- Term based partitioning.

Document Based Partitioning

Remember our example dataset of transactions partitioned by date? Imagine now that each partition keeps a list of each user it holds, as well as the key for the transaction.
When you query for users, you simply ask each partition for the keys for that user.
Counting is easy and if you need the full record, then you know where the key is in the partition. Assuming you store the data in the partition ordered by key, it's a quick lookup.
Remember Big O? Finding an item in an ordered list is O(log n). Which is much, much, much faster than looking at every row in every partition, which is O(n).
We have to take a small performance hit when we insert (i.e. write) new items to the index, but if it's something you query often it's worth it.
Note that each partition only cares about the data they store, they don't know anything about what the other partitions have. Because of that, we call it a local index.
Another name for this type of approach is "scatter/gather": the data is scattered as you write it and gathered up again when you need it.
This is especially nice when you have data retention rules. If you partition by date and only keep 90 days worth of data, you can simply drop old partitions and the secondary index data goes with them.

Term Based Partitioning

If we are willing to make our writes a little more complicated in exchange for more efficient reads, we can step up to term based partitioning.
One problem with having each partition keeping track of their local data is you have to query all the partitions. What if the data's only on one partition? Our client still needs to wait to hear back from all partitions before returning the result.
What if we pulled the index data away from the partitions to a separate system?
Now we check this secondary index to figure out the keys, which we can then go look up on the appropriate indices.
We can go one step further and partition this secondary index so it scales better. For example, userId 1-100 might be on one, 101-200 on another, etc.
The benefit of term based partitioning is you get more efficient reads, the downside is that you are now writing to multiple spots: the node the data lives on and any partitions in our indexing system that we need to account for any secondary indexes. And this is multiplied by replication.
This is usually handled by asynchronous writes that are eventually consistent. Amazon's DynamoDB states it's global secondary indexes are updated within a fraction of a second normally.

Rebalancing Partitions

What do you do if you need to repartition your data, maybe because you're adding more nodes for CPU, RAM, or losing nodes?
Then it's time to rebalance your partitions, with the goals being to …
- Distribute the load equally-ish (notice we didn't say data, could have some data that is more important or mismatched nodes),
- Keep the database operational during the rebalance procedure, and
- Minimize data transfer to keep things fast and reduce strain on the system.
Here's how not to do it: hash % (number of nodes)
- Imagine you have 100 nodes, a key of 1000 hashes to 0. Going to 99 nodes, that same key now hashes to 1, 102 nodes and it now hashes to 4 … it's a lot of change for a lot of keys.

Partitions > Nodes

You can mitigate this problem by fixing the number of partitions to a value higher than the number of nodes.
This means you move where the partitions go, not the individual keys.
- Same recommendation applies to Kafka: keep the numbers of partitions high and you can change nodes.
- In our example of partitioning data by date, with a 7 years retention period, rebalancing from 10 nodes to 11 is easy.
What if you have more nodes than partitions, like if you had so much data that a single day was too big for a node given the previous example?
- It's possible, but most vendors don't support it. You'll probably want to choose a different partitioning strategy.
Can you have too many partitions? Yes!
- If partitions are large, rebalancing and recovering from node failures is expensive.
- On the other hand, there is overhead for each partition, so having many, small partitions is also expensive.

Other methods of partitioning

Dynamic partitioning:
- It's hard to get the number of partitions right especially with data that changes it's behavior over time.
  - There is no magic algorithm here. The database just handles repartitioning for you by splitting large partitions.
  - Databases like HBase and RethinkDB create partitions dynamically, while Mongo has an option for it.
Partitioning proportionally to nodes:
- Cassandra and Ketama can handle partitioning for you, based on the number of nodes. When you add a new node it randomly chooses some partitions to take ownership of.
  - This is really nice if you expect a lot of fluctuation in the number of nodes.

Automated vs Manual Rebalancing

We talked about systems that automatically rebalance, which is nice for systems that need to scale fast or have workloads that are homogenized.
You might be able to do better if you are aware of the patterns of your data or want to control when these expensive operations happen.
Some systems like Couchbase, Riak, and Voldemort will suggest partition assignment, but require an administrator to kick it off.
But why? Imagine launching a large online video game and taking on tons of data into an empty system … there could be a lot of rebalancing going on at a terrible time. It would have been much better if you could have pre-provisioned ahead of time … but that doesn't work with dynamic scaling!

Request Routing

One last thing … if we're dynamically adding nodes and partitions, how does a client know who to talk to?
This is an instance of a more general problem called "service discovery".
There are a couple ways to solve this:
- The nodes keep track of each other. A client can talk to any node and that node will route them anywhere else they need to go.
- Or a centralized routing service that the clients know about, and it knows about the partitions and nodes, and routes as necessary.
- Or require that clients be aware of the partitioning and node data.
No matter which way you go, partitioning and node changes need to be applied. This is notoriously difficult to get right and REALLY bad to get wrong. (Imagine querying the wrong partitions …)
Apache ZooKeeper is a common coordination service used for keeping track of partition/node mapping. Systems check in or out with ZooKeeper and ZooKeeper notifies the routing tier.
Kafka (although not for much longer), Solr, HBase, and Druid all use ZooKeeper. MongoDb uses a custom ConfigServer that is similar.
Cassandra and Riak use a "gossip protocol" that spreads the work out across the nodes.
Elasticsearch has different roles that nodes can have, including data, ingestion and … you guessed it, routing.

Parallel Query Execution

So far we've mostly talked about simple queries, i.e. searching by key or by secondary index … the kinds of queries you would be running in NoSQL type situations.
What about? Massively Parallel Processing (MPP) relational databases that are known for having complex join, filtering, aggregations?
The query optimizer is responsible for breaking down these queries into stages which target primary/secondary indexes when possible and run these stages in parallel, effectively breaking down the query into subqueries which are then joined together.
That's a whole other topic, but based on the way we talked about primary/secondary indexes today you can hopefully have a better understanding of how the query optimizer does that work. It splits up the query you give it into distinct tasks, each of which could run across multiple partitions/nodes, runs them in parallel, and then aggregates the results.
- Designing Data-Intensive Applications goes into it in more depth in future chapters while discussing batch processing.

Resources We Like

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)

Tip of the Week

PowerLevel10k is a Zsh "theme" that adds some really nice features and visual candy. It's highly customizable and works great with Kubernetes and Git. (GitHub)
If for some reason VS Code isn't in your path, you can add it easily within VS Code. Open up the command palette (CTRL+SHIFT+P / COMMAND+SHIFT+P) and search for "path". Easy peasy!
Gently Down the Stream is a guidebook to Apache Kafka written and illustrated in the style of a children's book. Really neat way to learn! (GentlyDownThe.Stream)
PostgreSQL is one of the most powerful and versatile databases. Here is a list of really cool things you can do with it that you may not expect. (HakiBenita.com)

Check out PowerLevel10k