Pessimistic Optimism: The Case Of Unexpected Deadlocks
Despite the fact I've been developing software in one way or another for almost 20 years, I'm constantly surprised how things work, and, once the frustration wears off, I'm intrigued when things break. Lucky for me, over the last few weeks, I've been surprised, frustrated, and intrigued by what, on the surface, seems like a very trivial problem: MySQL reporting deadlocks in unexpected places and unexplainable slow query log entries when using optimistic concurrency control.
The Example
Before we start, let's get on the same page so that, if you wish, you can follow along. I assume you have a basic understanding of SQL and ACID. I also assume you have a basic understanding of FOREIGN KEY
s and UNIQUE
constraints. This article will use READ COMMITTED
, which is the minimum necessary level for optimistic concurrency control.
Throughout this article, we will use the following setup in a MySQL database:
CREATE DATABASE test;
USE test;
CREATE TABLE parent (
id BIGINT PRIMARY KEY,
version INT NOT NULL
);
CREATE TABLE child (
id BIGINT PRIMARY KEY,
parent_id BIGINT NOT NULL,
reference BIGINT NOT NULL,
CONSTRAINT parentid_reference_uk UNIQUE(parent_id, reference),
CONSTRAINT parentid_fk FOREIGN KEY(parent_id) REFERENCES parent(id)
);
INSERT INTO parent(id, version) VALUES(1, 0);
To summarise, we've created a one-to-many relationship between parent
and child
: each child
has one parent
, and each parent
has many child
references. I'd like to call out the reference
column and parentid_reference_uk
constraint in the child
table. These two things make the frustrating magic happen, but if you're wondering, reference
is simply an arbitrary value.
Optimistic Concurrency Control
If you've followed any of my other articles, or if you've been in the field for a while, you've probably seen the term 'optimistic locking' a number of times. OCC is much like optimistic locking: you are able to modify an entity without locking it under the optimistic assumption that a competing write will be rare.
If you're already familiar with the OCC algorithm, feel free to skip this section. Otherwise, it's pretty simple and I will use the parent
and child
tables to demonstrate. In this scenario, the parent
-child
relationship is composition, meaning the child
entities are owned exclusively by a parent
. Further, to uphold invariants of the relationship, this means child
entities should not be modified outside of the parent
graph.
-
Record the version - Before modifying a
parent
entity, you first need to record its current version.BEGIN; SELECT version FROM parent WHERE id = 1; COMMIT;
A read like this could be performed off a replica database or performed in time long prior to the subsequent steps. For this example, let's assume the
version=0
whereid=1
. One important point is that the transaction is being committed after the read. -
Modify the entity - Next, in a new transaction, we'll modify the parent entry by inserting a reference
child
row.BEGIN; INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
Notice that we have not yet committed or rolled back the transaction. It needs to remain open for the next step.
-
Validate the changes - Once the
parent
graph is in the desirable state, we have to validate that our new changes were not conflicted. We do this by bumping up theparent
entity'sversion
while asserting the currentversion
is still0
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
This update will return one of two results: an update count of
0
or1
. If the update count is0
, the parent record was modified by another transaction some time between step 1 and now. In this case, we shouldROLLBACK
; otherwise, an update count of1
indicates that the record was not changed by another transaction. We may thenCOMMIT
. After committing, the newversion
is1
, which means any other prior reads ofparent(version=0)
are invalid.
Deadlocks
Before moving along, let's be clear about the term, "deadlock". A deadlock is, essentially, a state when one thread, thread 1, is preventing another thread, thread 2, from making meaningful progress. Further, thread 2 is also preventing thread 1 from making meaningful progress. For example,
- Thread 1 acquires lock A
- Thread 2 acquires lock B
- Thread 1 waits for lock B
- Thread 2 waits for lock A
Many people much smarter than me have found ways to describe, detect, and prevent deadlocks, so I'll leave the Internet at your disposal.
Reproducing The Surprise
Given my understanding of optimistic concurrency control, I certainly never expected to observe deadlocks in the way demonstrated below. Yet, as usual, I was humbled by the unusually large logs in production informing me I was wrong. Step one in diagnosing an issue is reproducing it, so I set-up two connections to a local database and started testing things out.
-
First, in both connections, disable
autocommit
and set aREAD COMMITTED
transaction isolation levelSET AUTOCOMMIT = 0; SET TRANSACTION ISOLATION LEVEL READ COMMITTED; BEGIN;
Both transactions,
2511
and2510
, are ready---TRANSACTION 2511, not started MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 563 127.0.0.1 someuser cleaning up ---TRANSACTION 2510, not started MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 562 127.0.0.1 someuser cleaning up
-
In the first transaction, we'll insert a new child row that references
parent(id=1)
. As noted, we'll arbitrarily usereference=1
.INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1); Query OK, 1 row affected (0.00 sec)
Note that this transaction has not yet been committed. Taking a look at the InnoDB engine status, we can see that transaction
2510
now has a single insert with1 row lock
.---TRANSACTION 2511, not started MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 563 127.0.0.1 someuser cleaning up ---TRANSACTION 2510, ACTIVE 3 sec 3 lock struct(s), heap size 360, 1 row lock(s), undo log entries 1 MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 565 127.0.0.1 someuser cleaning up
Hopefully, you're not as confused as I was, uncertain about which 1 row was locked. If you are, we will clear it up later.
-
In the second transaction, we'll insert a new child row that purposefully duplicates the
parentid_reference_uk
constraint.INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1); <hanging>
Again, this transaction has not yet been committed. And, in fact, the
INSERT
call is blocking. Taking a look at the InnoDB engine status, we have a bit more information to help understand what's going on.---TRANSACTION 2511, ACTIVE 2 sec inserting mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1 MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 567 127.0.0.1 someuser update INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1) ------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2511 lock mode S waiting ------------------ ---TRANSACTION 2510, ACTIVE 22 sec 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1 MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 565 127.0.0.1 someuser cleaning up
We can see that transaction
2551
is waiting to lock a row inparentid_reference_uk
. You may also notice that transaction2510
now has 2 row locks. This is the first sign that something weird is about to happen, but let's proceed. Also notice thatundo log entries 1
in both transactions indicates that something has happened in both. -
In the first transaction, we'll bump the version of the
parent(id=1)
record to ensure that no other transactions have changed it.UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0
Note, again, that we have not yet called
COMMIT
. The changes made in this transaction are uncommitted and should not yet be visible to any other transactions. -
The second transaction, which, as you may recall, was waiting for a lock, has now failed. The failure is reported as a deadlock
INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1); ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
Taking a look in the InnoDB engine status, we can hopefully start to diagnose the deadlock
------------------------ LATEST DETECTED DEADLOCK ------------------------ *** (1) TRANSACTION: TRANSACTION 2511, ACTIVE 23 sec inserting mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1 MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 567 127.0.0.1 someuser update INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1) *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2511 lock mode S waiting *** (2) TRANSACTION: TRANSACTION 2510, ACTIVE 43 sec starting index read mysql tables in use 1, locked 1 6 lock struct(s), heap size 1184, 3 row lock(s), undo log entries 1 MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 569 127.0.0.1 someuser updating UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2510 lock_mode X locks rec but not gap *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 8 page no 3 n bits 72 index `PRIMARY` of table `test`.`parent` trx id 2510 lock_mode X locks rec but not gap waiting *** WE ROLL BACK TRANSACTION (1)
If, like me, you don't initially understand how this could possibly be a deadlock, be comforted that it took me quite some time to understand it. If the deadlock reported a conflict on
PRIMARY
on both threads, I would have understood, but some information is missing here, and it took a deep-dive into the MySQL source code before I could fill in the gap.
A Deep Dive
In the following sections, I reference the MySQL source code to help explain this behaviour. Because I am neither a MySQL developer nor an expert of its inner workings, however, I encourage any readers to correct my understanding wherever necessary. Further, I encourage you to follow along in the source code.
At a very, very high level, MySQL is a database API that is backed by numerous storage engines such as InnoDB, blackhole, NDB, MyISAM, etc. To simplify in an extreme way, consider that the storage engines are responsible for enforcing (or not enforcing) all the aspects of ACID, while MySQL handles the remaining functionality. In this article, as in most deployments of MySQL, I am using the InnoDB engine.
If you're following along with the source code, spend a few minutes exlporing example/ha_example.cc
, which is a documented example of a custom storage engine. Then, when you're ready, open up innobase/handler/ha_innodb.cc
, the InnoDB storage engine entry point, and come back.
InnoDB Inserts
Before proceeding, you'll need to know a few InnoDB terms. First, within InnoDB, everything is a B+tree. The data in a table are stored in a tree called the "clustered index" and is ordered by the primary key while the data stored in table indices are called "secondary indices". Within each tree, "rows" (or records) are stored on "pages" (of a predictable size).
Indices
When InnoDB performs an insert, a number of things happen. Using the parent
and child
tables demonstrated in this article, I will explain the pertinent steps. Recall that the first statement was INSERT INTO child
. So, starting in innobase/handler/ha_innodb.cc
, navigate to ha_innobase::write_row(uchar*)
, which is the handler for an INSERT
statement. Give this function a quick read and then come back at Step-5
:
/* Step-5: Execute insert graph that will result in actual insert. */
error = row_insert_for_mysql((byte*) record, m_prebuilt);
This call to row_insert_for_mysql
in innobase/row/row0mysql.cc
evaluates the insert and determines whether it is for an intrinsic table
or a regular table. Intrinsic tables are in-memory tables used to store intermediary results; such tables are optimised to avoid many of the expensive guarantees of ACID, since an intrinsic table insert is only visible to the current thread (which cannot contend with itself).
if (dict_table_is_intrinsic(prebuilt->table)) {
return(row_insert_for_mysql_using_cursor(mysql_rec, prebuilt));
} else {
return(row_insert_for_mysql_using_ins_graph(mysql_rec, prebuilt));
}
In this case, child
is not an intrinsic table, so we proceed to row_insert_for_mysql_using_ins_graph
. To save you the leg work of groking the next couple steps, the next calls are in innobase/row/row0ins.cc
: row_ins_step(que_thr_t*)
, which then calls row_ins(ins_node_t*, que_thr_t*)
. An insert essentially consists of inserting the new record into all non-fulltext indices of the table, starting with the clustered index (i.e. first_index
):
node->index = dict_table_get_first_index(node->table);
...
while (node->index != NULL) {
if (node->index->type != DICT_FTS) {
err = row_ins_index_entry_step(node, thr);
...
node->index = dict_table_get_next_index(node->index);
Stepping through row_ins_index_entry_step
, we end up in row_ins_index_entry
, which branches between a clustered index or secondary index:
if (dict_index_is_clust(index)) {
return(row_ins_clust_index_entry(index, entry, thr, 0, false));
} else {
return(row_ins_sec_index_entry(index, entry, thr, false));
}
As noted, the first index is always the clustered index. In this example, this is a straight-forward insert, so we'll skip it and move on to the more interesting row_ins_sec_index_entry
, which, among other, things eventually calls the following two functions (in this order):
row_ins_check_foreign_constraint
for each foreign key indexrow_ins_scan_sec_index_for_duplicate
to look for duplicates in each index
The child
table has 2 secondary indices: parentid_reference_uk
, a unique index, and parentid_fk
, a foreign key index.
Foreign Keys
Starting with row_ins_check_foreign_constraint
, an index scan is performed to find a matching record in the foreign index. If the record is found, the matched record in the foreign index is locked with a read lock (shared
) through row_ins_set_shared_rec_lock
.
cmp = cmp_dtuple_rec(entry, rec, offsets);
if (cmp == 0) {
...
err = row_ins_set_shared_rec_lock(LOCK_REC_NOT_GAP, block, rec, check_index, offsets, thr);
In this case, the clustered index of parent(id=1)
is locked with a read lock to ensure that the foreign key constraint is upheld until the transaction is either committed or rolled-back.
Unique Keys
With the parent
row locked, insertion may continue. InnoDB next finds the appropriate insertion point in the index
btr_cur_search_to_nth_level(index, 0, entry, PAGE_CUR_LE, search_mode, &cursor, 0, __FILE__, __LINE__, &mtr);
Upon return, cursor
points to the necessary insertion point for the new record, but before insertion happens, InnoDB needs to ensure that the record does not violate any UNIQUE
constraints.
n_unique = dict_index_get_n_unique(index);
if (dict_index_is_unique(index) && (cursor.low_match >= n_unique || cursor.up_match >= n_unique)) {
...
err = row_ins_scan_sec_index_for_duplicate(flags, index, entry, thr, check, &mtr, offsets_heap);
First, n_unique
is set to the number of fields that are unique to the index. In the parentid_reference_uk
secondary index, n_unique
is 2, as there are 2 fields. From the call to btr_cur_search_to_nth_level
, both low_match
and up_match
are set. These fields report the number of fields in the record before and after the insertion point that match the new entry. So, in the case of the first insert to child
, low_match=0
and up_match=0
, whereas in the second insert to child
, low_match=2
and up_match=0
.
As such, during the second transaction's INSERT
into child
, a call is made to row_ins_scan_sec_index_for_duplicate
err = row_ins_set_shared_rec_lock(lock_type, block, rec, index, offsets, thr);
...
cmp = cmp_dtuple_rec(entry, rec, offsets);
if (cmp == 0 && !index->allow_duplicates) {
if (row_ins_dupl_error_with_rec(rec, entry, index, offsets)) {
err = DB_DUPLICATE_KEY;
A shared lock is placed on each potential match prior to performing a full comparison. This ensures the consistency guarantees within the transaction. For this article, the important part of this method is the attempted call to row_ins_set_shared_rec_lock
, which, as mentioned, places a read lock on each potential match.
Assuming all things succeed without failure, the record is inserted in row_ins_sec_index_entry_by_modify
and another customer is satisfied.
Putting It All Together
To summarise, when InnoDB performs an INSERT
, it first inserts into the clustered index, then into each secondary index. For each index, the following two steps happen in order
- If this is a foreign key, lock the parent record
- If this is a unique key, lock the potential matching records
Then, let's briefly review the the OCC pattern again
- Record the row's version
- Perform database changes
- Validate that the changes in #2 reflect the state recorded in #1
- Commit or roll-back
Putting these together, the deadlock becomes apparent
- Transaction 1 inserts the data into the clustered index of
child
- Transaction 1 locks
parent(id=1)
with a read lock due to the foreign key - Transaction 1 inserts the data into
parentid_reference_uk
and locks the record with a write lock - Transaction 2 inserts the data into the clustered index of
child
- Transaction 2 locks
parent(id=1)
with a read lock due to the foreign key - Transaction 2 attempts to put a read lock on
parentid_reference_uk(parent_id=1,reference=1)
- Transaction 1 attempts to promote its read lock to a write lock on
parent(id=1)
to perform theUPDATE
At this point, there's a deadlock: neither transaction 1 nor 2 can make meaningful progress until they each release their locks for the other. Something has to give, so InnoDB kills one of the transactions and lets the other proceed. While I will not say that this is incorrect behaviour, I will say this is extremely surprising: I do not expect uncommitted changes to deadlock another transaction. To be fair, however, InnoDB does describe their UPDATE
reads as "semi-consistent", so this point is, at least, documented.
Indeed, a deadlock would arise without the unique index. However, in such a case, the deadlock would happen at a time we can reason about: when both transactions attempt to UPDATE parent
. And, in such a case, MySQL actually reports the deadlock in a useful way:
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2511 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2510 lock mode S locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2510 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (2)
The Workaround
Unless somebody can recommend another fix that doesn't compromise data integrity, I believe the only work-around is to be less optimistic. By changing the order of operations, the deadlock can be avoided at the cost of a write lock for the duration of the transaction:
BEGIN;
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
COMMIT;
By updating the parent first, all subsequent transactions will block waiting to acquire a write lock. This will prevent them from modifying the child
collection until after the current transaction is finished. I would argue rather strongly, however, at this point, we no longer have optimistic locking since a write lock is being held on the entire graph for the entire operation.
Further Thoughts
MySQL, and, in particular, InnoDB is an interesting piece of software that powers a significant portion of the Internet. If you can remember its numerous nuances, it will probably take you wherever you want to go. This all said, I will never describe MySQL as my favourite RDBMS; but, rather, an old acquaintance that comes around once or twice a year that you just have to humour and feed.
In closing, while the MySQL Internals are fairly well documented, the knowledge I gained from in-person training from an exceptional DBA and from Percona was rather useful in understanding this issue. I would strongly encourage all readers to participate in similar training no matter who it is from.