Fun MySQL fact of the day: paginating non-unique ranges
Yesterday, we started considering ways to paginate data without using OFFSET
. We looked at a rather "simple" case where we were paging on a primary or unique key, but in some cases, that's not good enough. This is especially true in a system where the primary or unique key does not necessarily reflect the natural ordering of your data. Today, we'll consider a more involved solution where we want to page on a "creation date", which is a pretty common situation.
Let's consider billions of rows in table, neat_facts
, with with no secondary indices: (pk=id,account_id,creation_date,neat_fact)
. Let's also consider that we have tens of thousands of accounts and a use case where we want to query for account_id
paged 10 results at a time in order of creation_date
. Like yesterday, our first query will be pretty simple:
SELECT id, created_at, neat_fact FROM neat_facts WHERE account_id = 3 ORDER BY created_at, id LIMIT 10;
+---------+---------------------+-------------+
| id | created_at | neat_fact |
|---------+---------------------+-------------|
| 859238 | 2009-07-20 10:18:09 | red chair |
| ... | ... | ... |
| 1031257 | 2009-07-29 13:52:47 | purple hat |
+---------+---------------------+-------------+
If you explain this query, you'll see that this is doing a full table scan, which we absolutely do not want on a table with several billion rows. Knowing what we've learned over the last few months, we know there are a handful of ways to improve this by creating a secondary index. While I'd challenge you to consider what indices you'd choose to create, I propose to CREATE INDEX neat_facts__accountid_createdat ON neat_facts(account_id, created_at)
. You may have chosen a different one, and that's okay: this is, after all, time to learn. Either way, what we're looking for here is the storage engine reading in order of (account_id,created_at,id)
. Now, let's look at the query for page 2.
SELECT id, created_at, neat_fact FROM neat_facts WHERE account_id = 3 AND created_at > '2009-07-29 13:52:47' OR (created_at = '2009-07-29 13:52:47' AND id > 1031257) ORDER BY created_at, id LIMIT 10;
Much like yesterday, we are "pivoting" based on the data from our previous query, except, today, we are dealing with a non-unique created_at
column. To be sure we don't show duplicate results, possibly getting the paging stuck in an infinite loop, we also need to make the search criteria unique, which we do with (created_at,id)
. Now, if you run this query against the index you thought to create, what kind of query plan do you have? With our neat_facts__accountid_createdat
index, you'll see a query type of range
that is Using index condition
. While Using index condition
isn't the most efficient strategy (i.e. compared to covering index optimisation), I'd like to convince you it's probably good enough: the storage engine will only need to do 10 random table reads.
Of course, you could coax MySQL into using only covering index optimisation with an index like (account_id,created_at,id,neat_fact)
, but doing so would require a secondary index that's 8-bytes bigger than PRIMARY
, at which point you may want to consider redesigning your table, instead: (pk=(account_id,created_at,fact_number),neat_fact)
. If you did that, your pagination query will much the same as yesterday's and, in exchange, you'll trade off some write throughput and fragmentation for potentially-increased read throughput and the additional complexity of managing fact_number
.
Our job is, of course, all about trade-offs. And what's more fun than data modelling and query optimisation?