2021-12-11 21:33 UTC

Using Primary Key Parts In A Secondary Index Of An InnoDB Table

Recently, I was asked what would happen if a primary key was explicitly put into a secondary index of an InnoDB table. For example, given a table employees,

CREATE TABLE employees (
    id BIGINT PRIMARY KEY,
    first_name VARCHAR(25) NOT NULL,
    last_name VARCHAR(25) NOT NULL,
    date_of_birth DATETIME NOT NULL
)

the question is what happens when we add a secondary index in the form of

ALTER TABLE employees
ADD KEY employees_name_idx(first_name, last_name, id);

Up until this question, I had assumed (and even been told) that InnoDB would generate an index in the order of (first_name, last_name, id, id), leading to a duplicate listing of id. More, if the primary key was not trailing, such as in the case of

ALTER TABLE employees
ADD KEY employees_name_idx(first_name, id, last_name);

I had similarly assumed that the index would be created in order, (first_name, id, last_name, id).

I was wrong. This isn't at all what happens. As it turns out, on creation of an index, InnoDB only appends the primary key to the end of an index if it is not already present. Similarly, during retrieval of a record from an InnoDB secondary index, InnoDB algorithmically finds the primary key for use during lookup in PRIMARY.

Examining the data

We can understand this a little bit better by using a debugger. In the next several sections, we'll look at how the underlying index structure looks while exploring some of the InnoDB source code.

Without `id`

When no primary key is added into the index, we should expect InnoDB to append the primary key, id, to the end of the secondary index:

ALTER TABLE employees
    ADD KEY employees_name_without_pk(first_name, last_name);

SHOW CREATE TABLE employees;
...
  PRIMARY KEY (`id`),
  KEY `employees_name_without_pk` (`first_name`,`last_name`)

We don't actually see the primary key appended into this index, but it's there. We can use a debugger to find it:

(lldb)  b row_search_mvcc
Breakpoint 2: where = mysqld`row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long) + 63 at row0sel.cc:4485:3, address = 0x000055555a738a24

(lldb) continue

Then, upon running a query, SELECT * FROM employees WHERE first_name = 'mary', we hit the breakpoint:

* thread #39, name = 'mysqld', stop reason = breakpoint 2.1
    frame #0: 0x000055555a738a24 mysqld`row_search_mvcc(buf="\U00000004", mode=PAGE_CUR_GE, prebuilt=0x00007fff1cc64e80, match_mode=1, direction=0) at row0sel.cc:4485:3

(lldb)  print prebuilt->index->name
(id_name_t) $0 = (m_name = "employees_name_without_pk")

(lldb) print prebuilt->index->n_fields
(unsigned int) $1 = 3

(lldb) print prebuilt->index->fields[0].name
(id_name_t) $4 = (m_name = "first_name")

(lldb) print prebuilt->index->fields[1].name
(id_name_t) $5 = (m_name = "last_name")

(lldb) print prebuilt->index->fields[2].name
(id_name_t) $6 = (m_name = "id")

So while the SHOW CREATE TABLE command didn't show us the trailing id, we can actually see it in the running code.

With `id`

Keeping our debugger on the same breakpoint, what happens when we explicitly include the id column in the secondary index?

ALTER TABLE employees
    DROP KEY employees_name_without_pk,
    ADD KEY employees_name_with_pk(first_name, last_name, id);

SHOW CREATE TABLE employees;
...
  PRIMARY KEY (`id`),
  KEY `employees_name_with_pk` (`first_name`,`last_name`,`id`)

In this case, we can see that MySQL does show id as a part of the secondary index; however, is the underlying index actually different? Let's run the same SELECT * FROM employees WHERE first_name = 'mary':

(lldb) print prebuilt->index->name
(id_name_t) $10 = (m_name = "employees_name_with_pk")

(lldb) print prebuilt->index->n_fields
(unsigned int) $11 = 3

(lldb) print prebuilt->index->fields[0].name
(id_name_t) $12 = (m_name = "first_name")

(lldb) print prebuilt->index->fields[1].name
(id_name_t) $13 = (m_name = "last_name")

(lldb) print prebuilt->index->fields[2].name
(id_name_t) $14 = (m_name = "id")

In this case, n_fields is still 3, and not 4 as it would be with a redundant id column. If you want to live dangerously and risk a segfault (this isn't production, right), we can even be positive there's no extra field:

(lldb) print prebuilt->index->fields[3]
(dict_field_t) $15 = {
  col = nullptr
  name = (m_name = "")
  prefix_len = 3936
  fixed_len = 91
  is_ascending = 1
}

Underlying mechanism

Even if we put id between first_name and last_name to create an order of (first_name, id, last_name), n_fields is still 3. Let's look into how this behaviour is happening:

Index creation

To see what's happening during index creation, let's set a breakpoint on dict_index_build_internal_non_clust inside dict0dict.cc:

(lldb) b dict_index_build_internal_non_clust
Breakpoint 1: where = mysqld`::dict_index_build_internal_non_clust(const dict_table_t *, dict_index_t *) + 36 at dict0dict.cc:3057:3, address = 0x000055555a993a7e

(lldb) continue

Next, we'll create another index:

ALTER TABLE employees
ADD KEY employees_first_id_last(first_name, id, last_name);

At this point, we'll hit our new breakpoint, several layers deep in the ALTER call stack:

* thread #40, name = 'mysqld', stop reason = breakpoint 1.1
    frame #0: 0x000055555a993a7e mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024a0, index=0x00007fff1c17d440) at dict0dict.cc:3057:3
    frame #1: 0x000055555a991622 mysqld`dict_index_add_to_cache_w_vcol(table=0x00007fff1c1024a0, index=0x00007fff1c17d440, add_v=0x0000000000000000, page_no=4294967295, strict=1) at dict0dict.cc:2380:52
    frame #2: 0x000055555a6c2423 mysqld`row_merge_create_index(trx=0x00007fffea519118, table=0x00007fff1c1024a0, index_def=0x00007fff1c128ae0, add_v=0x0000000000000000) at row0merge.cc:3611:39
    frame #3: 0x000055555a4fbeea mysqld`::prepare_inplace_alter_table_dict<dd::Table>(ha_alter_info=0x00007fffe87a4f30, altered_table=0x00007fff1c1755d0, old_table=0x00007fff1c031770, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8, table_name="employees", flags=33, flags2=16, fts_doc_id_col=18446744073709551615, add_fts_doc_id=false, add_fts_doc_id_idx=false, prebuilt=0x00007fff1c116140) at handler0alter.cc:4888:31
    frame #4: 0x000055555a50edc3 mysqld`bool ha_innobase::prepare_inplace_alter_table_impl<dd::Table>(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8) at handler0alter.cc:6094:42
    frame #5: 0x000055555a4e5e74 mysqld`ha_innobase::prepare_inplace_alter_table(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_dd_tab=0x00007fff1c032e08, new_dd_tab=0x00007fff1c00cac8) at handler0alter.cc:1183:53
    frame #6: 0x00005555589bb142 mysqld`handler::ha_prepare_inplace_alter_table(this=0x00007fff1c100b28, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, old_table_def=0x00007fff1c032e08, new_table_def=0x00007fff1c00cac8) at handler.cc:5136:37
    frame #7: 0x0000555558f33a56 mysqld`::mysql_inplace_alter_table(thd=0x00007fff1c001040, schema=0x00007fff1c017eb8, new_schema=0x00007fff1c017eb8, table_def=0x00007fff1c032e08, altered_table_def=0x00007fff1c00cac8, table_list=0x00007fff1c118a20, table=0x00007fff1c031770, altered_table=0x00007fff1c1755d0, ha_alter_info=0x00007fffe87a4f30, inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE, alter_ctx=0x00007fffe87a5a20, columns=0x00007fffe87a4ea0, fk_key_info=0x00007fff1c11b998, fk_key_count=0, fk_invalidator=0x00007fffe87a4e70) at sql_table.cc:13157:52
    frame #8: 0x0000555558f414e1 mysqld`mysql_alter_table(thd=0x00007fff1c001040, new_db="test", new_name=0x0000000000000000, create_info=0x00007fffe87a6f00, table_list=0x00007fff1c118a20, alter_info=0x00007fffe87a6d60) at sql_table.cc:17381:36
    ...

In this frame, the index argument passed into dict_index_build_internal_non_clust has already been created in mysql_inplace_alter_table (frame #7). However, in its current state, it only includes the user-defined fields, and InnoDB does not know if those user-defined fields are capable of uniquely identifying a record in PRIMARY:

(lldb) print index->name
(id_name_t) $4 = (m_name = "employees_first_id_last")

(lldb) p index->n_fields
(unsigned int) $6 = 3

We can consider this index argument a "template" of the eventual index that InnoDB needs to create, which it do in a variable, new_index:

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a993cb4 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3071:36
   3068   ut_ad(!dict_index_is_ibuf(clust_index));
   3069
   3070   /* Create a new index */
-> 3071   new_index = dict_mem_index_create(table->name.m_name, index->name,
   3072                                     index->space, index->type,
   3073                                     index->n_fields + 1 + clust_index->n_uniq);
   3074

Skipping ahead in this function, we can see InnoDB map new_index->n_user_defined_cols = index->n_fields, which solidifies the understanding that InnoDB is, in fact, creating the index we asked it to create:

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a993ce7 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3078:43
   3075   /* Copy other relevant data from the old index
   3076   struct to the new struct: it inherits the values */
   3077
-> 3078   new_index->n_user_defined_cols = index->n_fields;
   3079
   3080   new_index->id = index->id;

The value of new_index->n_fields will be updated to a value greater than or equal to new_index->n_user_defined_cols later.

After skipping ahead a couple more lines, the new_index metadata is updated to indicate whether or not a full column is indexed, therefore usable for covering index optimization:

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a993da7 mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3090:10
   3087       static_cast<ibool *>(ut_zalloc_nokey(table->n_cols * sizeof *indexed));
   3088
   3089   /* Mark the table columns already contained in new_index */
-> 3090   for (i = 0; i < new_index->n_def; i++) {
   3091     field = new_index->get_field(i);
   3092
   ...
   3097     /* If there is only a prefix of the column in the index
   3098     field, do not mark the column as contained in the index */
   3099
   3100     if (field->prefix_len == 0) {
   3101       indexed[field->col->ind] = TRUE;
   3102     }
   3103   }

After completing this loop, the code continues to examine the index and determine whether or not the entire primary key is included in the new secondary index. In this case, clust_index->n_uniq is a count of the columns used in to uniquely identify a record (i.e. the primary key) in the clust_index (i.e. PRIMARY). This loop works because the first n_uniq columns of clust_index are the primary key, in primary key order:

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a993f0e mysqld`::dict_index_build_internal_non_clust(table=0x00007fff1c1024b0, index=0x00007fff1c17d440) at dict0dict.cc:3108:3
   3105   /* Add to new_index the columns necessary to determine the clustered
   3106   index entry uniquely */
   3107
-> 3108   for (i = 0; i < clust_index->n_uniq; i++) {
   3109     field = clust_index->get_field(i);
   3110
   3111     if (!indexed[field->col->ind]) {
   3112       dict_index_add_col(new_index, table, field->col, field->prefix_len,
   3113                          field->is_ascending);

On line 3111, the code checks whether or not the nth column of the primary key is included in this new secondary index. If it isn't, this column of the primary key is appended to the end of the secondary index. In our case, clust_index->n_uniq is 1 with id being the entire primary key. This means that the check on line 3111 is false and the loop exits with no extra columns being added. At this point, InnoDB completes the index creation, which, for this article, is an uninteresting process.

Index lookup

To see how InnoDB re-derives the full primary key from a secondary index, we'll set another breakpoint within row_search_mvcc. In the current git rev I am looking at, 3933c4e, we're interested in line 5597:

(lldb) b row0sel.cc:5597
Breakpoint 6: where = mysqld`row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long) + 14103 at row0sel.cc:5597:3, address = 0x000055555a73c0fc

Next, we'll re-run the query, SELECT * FROM employees WHERE first_name = 'mary' to trigger this breakpoint:

* thread #40, name = 'mysqld', stop reason = breakpoint 6.1
    frame #0: 0x000055555a73c0fc mysqld`row_search_mvcc(buf="\U00000004", mode=PAGE_CUR_GE, prebuilt=0x00007fff1cc43110, match_mode=1, direction=0) at row0sel.cc:5597:3
   5594     }
   5595   }
   5596
-> 5597   if (use_clustered_index) {
   5598   requires_clust_rec:
   5599     ut_ad(index != clust_index);
   5600     /* We use a 'goto' to the preceding label if a consistent

(lldb) p index->name
(id_name_t) $35 = (m_name = "employees_first_id_last")

(lldb) p use_clustered_index
(bool) $11 = true

At this breakpoint, we've seen that this query is reading from the employees_first_id_last secondary index, and we've seen that InnoDB is at a point where it needs to perform a lookup in PRIMARY for the full record it has found. Next, InnoDB will invoke Row_sel_get_clust_rec_for_mysql::operator, which is where we'll place our next breakpoint:

(lldb) b row0sel.cc:3216
Breakpoint 7: where = mysqld`Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) + 242 at row0sel.cc:3216:29, address = 0x000055555a735708

(lldb) continue

* thread #40, name = 'mysqld', stop reason = breakpoint 7.1
    frame #0: 0x000055555a735708 mysqld`Row_sel_get_clust_rec_for_mysql::operator(this=0x00007fffe87a6440, prebuilt=0x00007fff1cc43110, sec_index=0x00007fff1c129510, rec="mary\x80", thr=0x00007fff1cc43c00, out_rec=0x00007fffe87a6270, offsets=0x00007fffe87a6290, offset_heap=0x00007fffe87a6288, vrow=0x0000000000000000, mtr=0x00007fffe87a6ca0, lob_undo=0x00007fff1cc43338)(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) at row0sel.cc:3216:29
   3213   *out_rec = nullptr;
   3214   trx = thr_get_trx(thr);
   3215
-> 3216   row_build_row_ref_in_tuple(prebuilt->clust_ref, rec, sec_index, *offsets,
   3217                              trx);
   3218
   3219   clust_index = sec_index->table->first_index();

On line 3216, InnoDB will dispatch onto row_build_row_ref_in_tuple, which populates the prebuilt->clust_ref member with the primary key. Stepping into this function, we can see how:

(lldb) step
(lldb) next ...

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a726a1b mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="mary\x80", index=0x00007fff1c129510, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:829:24
   826    ut_ad(!index->is_clustered());
   827    ut_a(index->table);
   828
-> 829    clust_index = index->table->first_index();
   830    ut_ad(clust_index);
   831
   832    if (!offsets) {

On line 829, a reference to the PRIMARY index is stored into clust_index, which is used on line 840:

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a726b61 mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="maryd-last\x80", index=0x00007fff1c113ba0, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:842:3
   839    ut_ad(!rec_offs_any_extern(offsets));
   840    ref_len = dict_index_get_n_unique(clust_index);
   841
-> 842    ut_ad(ref_len == dtuple_get_n_fields(ref));
   843
   844    dict_index_copy_types(ref, clust_index, ref_len);
   845

(lldb) print ref_len
(ulint) $15 = 1

After stepping forward a bit more to line 840, we can see that ref_len is set to n_unique of clust_index; this number, n_unique represents the number of columns used to uniquely identify a record in an index (in this case, in PRIMARY). Because the primary key for this table is a single column, id, we expect ref_len to be 1, which it is.

Stepping ahead a bit more to line 846, a ref_len is used as the upper bounds of a for loop to iterate over the primary key:

(lldb) next ...

* thread #40, name = 'mysqld', stop reason = step over
    frame #0: 0x000055555a726bc5 mysqld`row_build_row_ref_in_tuple(ref=0x00007fff1cc43970, rec="mary\x80", index=0x00007fff1c129510, offsets=0x00007fffe87a6660, trx=0x00007fffea519118) at row0row.cc:846:10
   843
   844    dict_index_copy_types(ref, clust_index, ref_len);
   845
-> 846    for (i = 0; i < ref_len; i++) {
   847      dfield = dtuple_get_nth_field(ref, i);
   848
   849      pos = dict_index_get_nth_field_pos(index, clust_index, i);
   ...
   853      field = rec_get_nth_field(rec, offsets, pos, &len);
   854
   855      dfield_set_data(dfield, field, len);

Of particular interest is line 849, calling dict_index_get_nth_field_pos, which will search in index for the i-th column in clust_index, which we know is id:

(lldb) p clust_index->fields[0].name
(id_name_t) $20 = (m_name = "id")

Because id is the second column in our secondary index, we should expect the return from dict_index_get_nth_field_pos to return 1 (0-based):

(lldb) call dict_index_get_nth_field_pos(index, clust_index, 0)
(ulint) $25 = 1

If we want to be sure, we can dump the fields from both indexes:

(lldb) parray 6 clust_index->fields
(dict_field_t *const) $21 = 0x00007fff1c114bc8 {
  (dict_field_t) [0] = { name = (m_name = "id") }
  (dict_field_t) [1] = { name = (m_name = "DB_TRX_ID") }
  (dict_field_t) [2] = { name = (m_name = "DB_ROLL_PTR") }
  (dict_field_t) [3] = { name = (m_name = "first_name") }
  (dict_field_t) [4] = { name = (m_name = "last_name") }
  (dict_field_t) [5] = { name = (m_name = "date_of_birth")

(lldb) parray 3 index->fields
(dict_field_t *const) $22 = 0x00007fff1c116018 {
  (dict_field_t) [0] = { name = (m_name = "first_name") }
  (dict_field_t) [1] = { name = (m_name = "id") }
  (dict_field_t) [2] = { name = (m_name = "last_name") }
}

In this (modified) dump, we can see that id is, in fact, at index 0 of clust_index and index 1 of index, agreeing with our understanding. Proceeding through to line 855, the code writes the value of id into ref (which, you may recall, is a pointer to prebuilt->clust_ref), which we can further scrutinise:

(lldb) call mach_read_int_type((const unsigned char *) ref->fields[0].data, 8, 0)
(ib_uint64_t) $82 = 4

In this dump, we can see that the primary key read into prebuilt->clust_ref (through ref) is 4 (which is, in fact, our employee's id). With the interesting part of this function is complete, we may step out:

(lldb) finish
* thread #41, name = 'mysqld', stop reason = step out
    frame #0: 0x000055555a735740 mysqld`Row_sel_get_clust_rec_for_mysql::operator(this=0x00007fffe87a6440, prebuilt=0x00007fff1c118fa0, sec_index=0x00007fff1c115cb0, rec="mary\x80", thr=0x00007fff1c119a90, out_rec=0x00007fffe87a6270, offsets=0x00007fffe87a6290, offset_heap=0x00007fffe87a6288, vrow=0x0000000000000000, mtr=0x00007fffe87a6ca0, lob_undo=0x00007fff1c1191c8)(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned long**, mem_block_info_t**, dtuple_t const**, mtr_t*, lob::undo_vers_t*) at row0sel.cc:3219:28
   3216   row_build_row_ref_in_tuple(prebuilt->clust_ref, rec, sec_index, *offsets,
   3217                              trx);
   3218
-> 3219   clust_index = sec_index->table->first_index();
   3220
   3221   btr_pcur_open_with_no_init(clust_index, prebuilt->clust_ref, PAGE_CUR_LE,
   3222                              BTR_SEARCH_LEAF, prebuilt->clust_pcur, 0, mtr);
   3223
   3224   clust_rec = btr_pcur_get_rec(prebuilt->clust_pcur);

From here, lines 3219 through 3224 are responsible for performing the actual read on PRIMARY (using prebuilt->clust_ref), ultimately returning the associated record in PRIMARY.

Closing thoughts

In this article, we used a debugger to inspect the internals of MySQL/InnoDB. In doing this, we learned that a secondary index of an InnoDB table may contain the primary key (or parts of a primary key) in an arbitrary position of the secondary index. In such a case, InnoDB is capable of re-deriving the primary key for a subsequent look-up for the corresponding record in PRIMARY without needing to append the full primary key to the end of the index. I find it interesting is that this approach is O(n), suggesting that an equivalent O(1) implementation of finding the primary key at the end of the index was not worth sacrificing the flexibility of this capability.

Further, this implementation will change the way I propose solutions for certain data access problems. In the past, I avoided having the primary key in arbitrary parts of the secondary index because I assumed, and been informed, that it would contribute to data bloat. While the need/want to have a unique field in an arbitrary position of a secondary index is somewhat uncommon, it is a pattern I have encountered in the past. I hope that the information in this article is similarly useful to you.