Using Primary Key Parts In A Secondary Index Of An InnoDB Table
Recently, I was asked what would happen if a primary key was explicitly put into a secondary index of an InnoDB table. For example, given a table employees
,
CREATE TABLE employees (
id BIGINT PRIMARY KEY,
first_name VARCHAR(25) NOT NULL,
last_name VARCHAR(25) NOT NULL,
date_of_birth DATETIME NOT NULL
)
the question is what happens when we add a secondary index in the form of
ALTER TABLE employees
ADD KEY employees_name_idx(first_name, last_name, id);
Up until this question, I had assumed (and even been told) that InnoDB would generate an index in the order of (first_name, last_name, id, id)
, leading to a duplicate listing of id
. More, if the primary key was not trailing, such as in the case of
ALTER TABLE employees
ADD KEY employees_name_idx(first_name, id, last_name);
I had similarly assumed that the index would be created in order, (first_name, id, last_name, id)
.
I was wrong. This isn't at all what happens. As it turns out, on creation of an index, InnoDB only appends the primary key to the end of an index if it is not already present. Similarly, during retrieval of a record from an InnoDB secondary index, InnoDB algorithmically finds the primary key for use during lookup in PRIMARY
.
Examining the data
We can understand this a little bit better by using a debugger. In the next several sections, we'll look at how the underlying index structure looks while exploring some of the InnoDB source code.
Without id
When no primary key is added into the index, we should expect InnoDB to append the primary key, id
, to the end of the secondary index:
ALTER TABLE employees
ADD KEY employees_name_without_pk(first_name, last_name);
SHOW CREATE TABLE employees;
...
PRIMARY KEY (`id`),
KEY `employees_name_without_pk` (`first_name`,`last_name`)
We don't actually see the primary key appended into this index, but it's there. We can use a debugger to find it:
(lldb) b row_search_mvcc
Breakpoint 2: where = mysqld`row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long) + 63 at row0sel.cc:4485:3, address = 0x000055555a738a24
(lldb) continue
Then, upon running a query, SELECT * FROM employees WHERE first_name = 'mary'
, we hit the breakpoint:
* thread #39, name = 'mysqld', stop reason = breakpoint 2.1
frame #0: 0x000055555a738a24 mysqld`row_search_mvcc(buf="\U00000004", mode=PAGE_CUR_GE, prebuilt=0x00007fff1cc64e80, match_mode=1, direction=0) at row0sel.cc:4485:3
(lldb) print prebuilt->index->name
(id_name_t) $0 = (m_name = "employees_name_without_pk")
(lldb) print prebuilt->index->n_fields
(unsigned int) $1 = 3
(lldb) print prebuilt->index->fields[0].name
(id_name_t) $4 = (m_name = "first_name")
(lldb) print prebuilt->index->fields[1].name
(id_name_t) $5 = (m_name = "last_name")
(lldb) print prebuilt->index->fields[2].name
(id_name_t) $6 = (m_name = "id")
So while the SHOW CREATE TABLE
command didn't show us the trailing id
, we can actually see it in the running code.
With id
Keeping our debugger on the same breakpoint, what happens when we explicitly include the id
column in the secondary index?
ALTER TABLE employees
DROP KEY employees_name_without_pk,
ADD KEY employees_name_with_pk(first_name, last_name, id);
SHOW CREATE TABLE employees;
...
PRIMARY KEY (`id`),
KEY `employees_name_with_pk` (`first_name`,`last_name`,`id`)
In this case, we can see that MySQL does show id
as a part of the secondary index; however, is the underlying index actually different? Let's run the same SELECT * FROM employees WHERE first_name = 'mary'
:
(lldb) print prebuilt->index->name
(id_name_t) $10 = (m_name = "employees_name_with_pk")
(lldb) print prebuilt->index->n_fields
(unsigned int) $11 = 3
(lldb) print prebuilt->index->fields[0].name
(id_name_t) $12 = (m_name = "first_name")
(lldb) print prebuilt->index->fields[1].name
(id_name_t) $13 = (m_name = "last_name")
(lldb) print prebuilt->index->fields[2].name
(id_name_t) $14 = (m_name = "id")
In this case, n_fields
is still 3
, and not 4
as it would be with a redundant id
column. If you want to live dangerously and risk a segfault (this isn't production, right), we can even be positive there's no extra field:
(lldb) print prebuilt->index->fields[3]
(dict_field_t) $15 = {
col = nullptr
name = (m_name = "")
prefix_len = 3936
fixed_len = 91
is_ascending = 1
}
Underlying mechanism
Even if we put id
between first_name
and last_name
to create an order of (first_name, id, last_name)
, n_fields
is still 3
. Let's look into how this behaviour is happening:
Index creation
To see what's happening during index creation, let's set a breakpoint on dict_index_build_internal_non_clust
inside dict0dict.cc
:
(lldb) b dict_index_build_internal_non_clust
Breakpoint 1: where = mysqld`::dict_index_build_internal_non_clust(const dict_table_t *, dict_index_t *) + 36 at dict0dict.cc:3057:3, address = 0x000055555a993a7e
(lldb) continue
Next, we'll create another index:
ALTER TABLE employees
ADD KEY employees_first_id_last(first_name, id, last_name);
At this point, we'll hit our new breakpoint, several layers deep in the ALTER
call stack:
* thread #40, name = 'mysqld', stop reason = breakpoint 1.1
frame #0: 0x000055555a993a7e mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024a0, index=0x00007fff1c17d440) at dict0dict.cc:3057:3
frame #1: 0x000055555a991622 mysqld`dict_index_add_to_cache_w_vcol(table=0x00007fff1c1024a0, index=0x00007fff1c17d440, add_v=0x0000000000000000, page_no=4294967295, strict=1) at dict0dict.cc:2380:52
frame #2: 0x000055555a6c2423 mysqld`row_merge_create_index(trx=0x00007fffea519118, table=0x00007fff1c1024a0, index_def=0x00007fff1c128ae0, add_v=0x0000000000000000) at row0merge.cc:3611:39
frame #3: 0x000055555a4fbeea mysqld`::prepare_inplace_alter_table_dict<dd::Table>(ha_alter_info=0x00007fffe87a4f30, altered_table=0x00007fff1c1755d0, old_table=0x00007fff1c031770, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8, table_name="employees", flags=33, flags2=16, fts_doc_id_col=18446744073709551615, add_fts_doc_id=false, add_fts_doc_id_idx=false, prebuilt=0x00007fff1c116140) at handler0alter.cc:4888:31
frame #4: 0x000055555a50edc3 mysqld`bool ha_innobase::prepare_inplace_alter_table_impl<dd::Table>(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8) at handler0alter.cc:6094:42
frame #5: 0x000055555a4e5e74 mysqld`ha_innobase::prepare_inplace_alter_table(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8) at handler0alter.cc:1183:53
frame #6: 0x00005555589bb142 mysqld`handler::ha_prepare_inplace_alter_table(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_table_def=0x00007fff1c032e08, new_table_def=0x00007fff1c00cac8) at handler.cc:5136:37
frame #7: 0x0000555558f33a56 mysqld`::mysql_inplace_alter_table(thd=0x00007fff1c001040, schema=0x00007fff1c017eb8, new_schema=0x00007fff1c017eb8, table_def=0x00007fff1c032e08, altered_table_def=0x00007fff1c00cac8, table_list=0x00007fff1c118a20, table=0x00007fff1c031770, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE, alter_ctx=0x00007fffe87a5a20, columns=0x00007fffe87a4ea0, fk_key_info=0x00007fff1c11b998, fk_key_count=0, fk_invalidator=0x00007fffe87a4e70) at sql_table.cc:13157:52
frame #8: 0x0000555558f414e1 mysqld`mysql_alter_table(thd=0x00007fff1c001040, new_db="test", new_name=0x0000000000000000, create_info=0x00007fffe87a6f00, table_list=0x00007fff1c118a20, alter_info=0x00007fffe87a6d60) at sql_table.cc:17381:36
...
In this frame, the index
argument passed into dict_index_build_internal_non_clust
has already been created in mysql_inplace_alter_table
(frame #7). However, in its current state, it only includes the user-defined fields, and InnoDB does not know if those user-defined fields are capable of uniquely identifying a record in PRIMARY
:
(lldb) print index->name
(id_name_t) $4 = (m_name = "employees_first_id_last")
(lldb) p index->n_fields
(unsigned int) $6 = 3
We can consider this index
argument a "template" of the eventual index that InnoDB needs to create, which it do in a variable, new_index
:
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a993cb4 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3071:36
3068 ut_ad(!dict_index_is_ibuf(clust_index));
3069
3070 /* Create a new index */
-> 3071 new_index = dict_mem_index_create(table->name.m_name, index->name,
3072 index->space, index->type,
3073 index->n_fields + 1 + clust_index->n_uniq);
3074
Skipping ahead in this function, we can see InnoDB map new_index->n_user_defined_cols = index->n_fields
, which solidifies the understanding that InnoDB is, in fact, creating the index we asked it to create:
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a993ce7 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3078:43
3075 /* Copy other relevant data from the old index
3076 struct to the new struct: it inherits the values */
3077
-> 3078 new_index->n_user_defined_cols = index->n_fields;
3079
3080 new_index->id = index->id;
The value of new_index->n_fields
will be updated to a value greater than or equal to new_index->n_user_defined_cols
later.
After skipping ahead a couple more lines, the new_index
metadata is updated to indicate whether or not a full column is indexed, therefore usable for covering index optimization:
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a993da7 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3090:10
3087 static_cast<ibool *>(ut_zalloc_nokey(table->n_cols * sizeof *indexed));
3088
3089 /* Mark the table columns already contained in new_index */
-> 3090 for (i = 0; i < new_index->n_def; i++) {
3091 field = new_index->get_field(i);
3092
...
3097 /* If there is only a prefix of the column in the index
3098 field, do not mark the column as contained in the index */
3099
3100 if (field->prefix_len == 0) {
3101 indexed[field->col->ind] = TRUE;
3102 }
3103 }
After completing this loop, the code continues to examine the index and determine whether or not the entire primary key is included in the new secondary index. In this case, clust_index->n_uniq
is a count of the columns used in to uniquely identify a record (i.e. the primary key) in the clust_index
(i.e. PRIMARY
). This loop works because the first n_uniq
columns of clust_index
are the primary key, in primary key order:
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a993f0e mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3108:3
3105 /* Add to new_index the columns necessary to determine the clustered
3106 index entry uniquely */
3107
-> 3108 for (i = 0; i < clust_index->n_uniq; i++) {
3109 field = clust_index->get_field(i);
3110
3111 if (!indexed[field->col->ind]) {
3112 dict_index_add_col(new_index, table, field->col, field->prefix_len,
3113 field->is_ascending);
On line 3111
, the code checks whether or not the nth column of the primary key is included in this new secondary index. If it isn't, this column of the primary key is appended to the end of the secondary index. In our case, clust_index->n_uniq
is 1
with id
being the entire primary key. This means that the check on line 3111
is false and the loop exits with no extra columns being added. At this point, InnoDB completes the index creation, which, for this article, is an uninteresting process.
Index lookup
To see how InnoDB re-derives the full primary key from a secondary index, we'll set another breakpoint within row_search_mvcc
. In the current git rev I am looking at, 3933c4e
, we're interested in line 5597
:
(lldb) b row0sel.cc:5597
Breakpoint 6: where = mysqld`row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long) + 14103 at row0sel.cc:5597:3, address = 0x000055555a73c0fc
Next, we'll re-run the query, SELECT * FROM employees WHERE first_name = 'mary'
to trigger this breakpoint:
* thread #40, name = 'mysqld', stop reason = breakpoint 6.1
frame #0: 0x000055555a73c0fc mysqld`row_search_mvcc(buf="\U00000004", mode=PAGE_CUR_GE, prebuilt=0x00007fff1cc43110, match_mode=1, direction=0) at row0sel.cc:5597:3
5594 }
5595 }
5596
-> 5597 if (use_clustered_index) {
5598 requires_clust_rec:
5599 ut_ad(index != clust_index);
5600 /* We use a 'goto' to the preceding label if a consistent
(lldb) p index->name
(id_name_t) $35 = (m_name = "employees_first_id_last")
(lldb) p use_clustered_index
(bool) $11 = true
At this breakpoint, we've seen that this query is reading from the employees_first_id_last
secondary index, and we've seen that InnoDB is at a point where it needs to perform a lookup in PRIMARY
for the full record it has found. Next, InnoDB will invoke Row_sel_get_clust_rec_for_mysql::operator
, which is where we'll place our next breakpoint:
(lldb) b row0sel.cc:3216
Breakpoint 7: where = mysqld`Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) + 242 at row0sel.cc:3216:29, address = 0x000055555a735708
(lldb) continue
* thread #40, name = 'mysqld', stop reason = breakpoint 7.1
frame #0: 0x000055555a735708 mysqld`Row_sel_get_clust_rec_for_mysql::operator(this=0x00007fffe87a6440, prebuilt=0x00007fff1cc43110, sec_index=0x00007fff1c129510, rec="mary\x80", thr=0x00007fff1cc43c00, out_rec=0x00007fffe87a6270, offsets=0x00007fffe87a6290, offset_heap=0x00007fffe87a6288, vrow=0x0000000000000000, mtr=0x00007fffe87a6ca0, lob_undo=0x00007fff1cc43338)(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) at row0sel.cc:3216:29
3213 *out_rec = nullptr;
3214 trx = thr_get_trx(thr);
3215
-> 3216 row_build_row_ref_in_tuple(prebuilt->clust_ref, rec, sec_index, *offsets,
3217 trx);
3218
3219 clust_index = sec_index->table->first_index();
On line 3216
, InnoDB will dispatch onto row_build_row_ref_in_tuple
, which populates the prebuilt->clust_ref
member with the primary key. Stepping into this function, we can see how:
(lldb) step
(lldb) next ...
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a726a1b mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="mary\x80", index=0x00007fff1c129510, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:829:24
826 ut_ad(!index->is_clustered());
827 ut_a(index->table);
828
-> 829 clust_index = index->table->first_index();
830 ut_ad(clust_index);
831
832 if (!offsets) {
On line 829
, a reference to the PRIMARY
index is stored into clust_index
, which is used on line 840
:
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a726b61 mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="maryd-last\x80", index=0x00007fff1c113ba0, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:842:3
839 ut_ad(!rec_offs_any_extern(offsets));
840 ref_len = dict_index_get_n_unique(clust_index);
841
-> 842 ut_ad(ref_len == dtuple_get_n_fields(ref));
843
844 dict_index_copy_types(ref, clust_index, ref_len);
845
(lldb) print ref_len
(ulint) $15 = 1
After stepping forward a bit more to line 840
, we can see that ref_len
is set to n_unique
of clust_index
; this number, n_unique
represents the number of columns used to uniquely identify a record in an index (in this case, in PRIMARY
). Because the primary key for this table is a single column, id
, we expect ref_len
to be 1
, which it is.
Stepping ahead a bit more to line 846
, a ref_len
is used as the upper bounds of a for
loop to iterate over the primary key:
(lldb) next ...
* thread #40, name = 'mysqld', stop reason = step over
frame #0: 0x000055555a726bc5 mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="mary\x80", index=0x00007fff1c129510, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:846:10
843
844 dict_index_copy_types(ref, clust_index, ref_len);
845
-> 846 for (i = 0; i < ref_len; i++) {
847 dfield = dtuple_get_nth_field(ref, i);
848
849 pos = dict_index_get_nth_field_pos(index, clust_index, i);
...
853 field = rec_get_nth_field(rec, offsets, pos, &len);
854
855 dfield_set_data(dfield, field, len);
Of particular interest is line 849
, calling dict_index_get_nth_field_pos
, which will search in index
for the i
-th column in clust_index
, which we know is id
:
(lldb) p clust_index->fields[0].name
(id_name_t) $20 = (m_name = "id")
Because id
is the second column in our secondary index, we should expect the return from dict_index_get_nth_field_pos
to return 1
(0-based):
(lldb) call dict_index_get_nth_field_pos(index, clust_index, 0)
(ulint) $25 = 1
If we want to be sure, we can dump the fields from both indexes:
(lldb) parray 6 clust_index->fields
(dict_field_t *const) $21 = 0x00007fff1c114bc8 {
(dict_field_t) [0] = { name = (m_name = "id") }
(dict_field_t) [1] = { name = (m_name = "DB_TRX_ID") }
(dict_field_t) [2] = { name = (m_name = "DB_ROLL_PTR") }
(dict_field_t) [3] = { name = (m_name = "first_name") }
(dict_field_t) [4] = { name = (m_name = "last_name") }
(dict_field_t) [5] = { name = (m_name = "date_of_birth")
(lldb) parray 3 index->fields
(dict_field_t *const) $22 = 0x00007fff1c116018 {
(dict_field_t) [0] = { name = (m_name = "first_name") }
(dict_field_t) [1] = { name = (m_name = "id") }
(dict_field_t) [2] = { name = (m_name = "last_name") }
}
In this (modified) dump, we can see that id
is, in fact, at index 0
of clust_index
and index 1
of index
, agreeing with our understanding. Proceeding through to line 855
, the code writes the value of id
into ref
(which, you may recall, is a pointer to prebuilt->clust_ref
), which we can further scrutinise:
(lldb) call mach_read_int_type((const unsigned char *) ref->fields[0].data, 8, 0)
(ib_uint64_t) $82 = 4
In this dump, we can see that the primary key read into prebuilt->clust_ref
(through ref
) is 4
(which is, in fact, our employee's id
). With the interesting part of this function is complete, we may step out:
(lldb) finish
* thread #41, name = 'mysqld', stop reason = step out
frame #0: 0x000055555a735740 mysqld`Row_sel_get_clust_rec_for_mysql::operator(this=0x00007fffe87a6440, prebuilt=0x00007fff1c118fa0, sec_index=0x00007fff1c115cb0, rec="mary\x80", thr=0x00007fff1c119a90, out_rec=0x00007fffe87a6270, offsets=0x00007fffe87a6290, offset_heap=0x00007fffe87a6288, vrow=0x0000000000000000, mtr=0x00007fffe87a6ca0, lob_undo=0x00007fff1c1191c8)(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) at row0sel.cc:3219:28
3216 row_build_row_ref_in_tuple(prebuilt->clust_ref, rec, sec_index, *offsets,
3217 trx);
3218
-> 3219 clust_index = sec_index->table->first_index();
3220
3221 btr_pcur_open_with_no_init(clust_index, prebuilt->clust_ref, PAGE_CUR_LE,
3222 BTR_SEARCH_LEAF, prebuilt->clust_pcur, 0, mtr);
3223
3224 clust_rec = btr_pcur_get_rec(prebuilt->clust_pcur);
From here, lines 3219
through 3224
are responsible for performing the actual read on PRIMARY
(using prebuilt->clust_ref
), ultimately returning the associated record in PRIMARY
.
Closing thoughts
In this article, we used a debugger to inspect the internals of MySQL/InnoDB. In doing this, we learned that a secondary index of an InnoDB table may contain the primary key (or parts of a primary key) in an arbitrary position of the secondary index. In such a case, InnoDB is capable of re-deriving the primary key for a subsequent look-up for the corresponding record in PRIMARY
without needing to append the full primary key to the end of the index. I find it interesting is that this approach is O(n)
, suggesting that an equivalent O(1)
implementation of finding the primary key at the end of the index was not worth sacrificing the flexibility of this capability.
Further, this implementation will change the way I propose solutions for certain data access problems. In the past, I avoided having the primary key in arbitrary parts of the secondary index because I assumed, and been informed, that it would contribute to data bloat. While the need/want to have a unique field in an arbitrary position of a secondary index is somewhat uncommon, it is a pattern I have encountered in the past. I hope that the information in this article is similarly useful to you.