Fun MySQL fact of the day: loose scan semi-join
Yesterday, we considered MySQL's FirstMatch
semi-join optimisation, and while it's pretty simple and fairly straight-forward, some queries may just not benefit from it.
Let's go back to our on-going example relation of employee->company
and imagine a company
table with tens of millions of companies. Let's also imagine that each company may hire many tens to many tens-of-thousands of employees, of which likely less than, say, 10% have the same name. Now, let's further imagine we have need to find all companies that employ at least 1 individual named "Jane". Because we want to have a unique list of companies, one possible way of writing this query is as follows:
EXPLAIN SELECT id, name
FROM company
WHERE id IN (
SELECT company_id
FROM employee
WHERE full_name = 'JANE'
);
For sanity's sake, let's say the employee
table has a secondary index, (full_name,company_id)
. Now, without any type of semi-join optimisation, we have a pretty good idea that MySQL will do a full table scan on company
and then look for any employee named "Jane" in the employee
table for each row in company
. Naturally, this will be slow, and even if we go back to the FirstMatch
optimisation we discussed yesterday, MySQL will still need to perform a full table scan on company
. This even holds true for table pullouts and materialization. There must be a better way! And there is: the LooseScan
optimisation, which will invert the query, first looking for all employees named "Jane" and then, for each match, join the company
table by primary key using company_id
. Once MySQL finds a match, it will skip to the next "Jane" at the next company_id
, skipping over all other "Jane"s at the current company.
Now, if you've used MySQL for a while, you may have heard the term "Loose Index Scan" before. It isn't something new to MySQL, but historically, it had been used only when performing a GROUP BY
. While these two optimisations do both use a loose scan over a grouping of keys, when you're using a subquery, you'll know MySQL has chosen a LoseScan
semi-join optimisation when the Extra
column of your EXPLAIN
shows LoseScan
. This is not true for a GROUP BY
, but that's a fun fact for another day.