Cardinality & Selectivity

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 07 Feb 2010
Tagged with: [ MySQL

Cardinality and selectivity are two keywords that are very important when dealing with optimization in MySQL queries and indexes. This article will go a bit in depth on both terms and tries to let you understand their usefulness…

Cardinality

CREATE TABLE `users` (
  `user_id` INT UNSIGNED NOT NULL ,
  `first_name` VARCHAR( 25 ) NOT NULL ,
  `last_name` VARCHAR( 50 ) NOT NULL ,
  `city` VARCHAR( 50 ) NOT NULL ,
  `country` VARCHAR( 50 ) NOT NULL ,
  `mailing` TINYINT( 1 ) NOT NULL DEFAULT '1',
  PRIMARY KEY ( `user_id` )
) ENGINE = MYISAM ;

Selectivity

SELECT * FROM users WHERE country="Netherlands";

You have a cardinality of 10 (there are 10 unique countries in your userbase) so the selectivity will be:

selectivity = 10 / 10000 * 100% = 0.1%

which is very low (i’ve seen much, much, much lower values though).
So, there we are… MySQL has calculated the selectivity of an index he’s planning to use.  At this point, it will decide whether or not this index is useful. With such a low selectivity, it’s GUESSING that there are so many duplicates records that it will probably be faster for him to skip browsing the index.  This is part of the cost-based optimizer. When it decides it’s too costly to browse an index (because of the low selectivity), it will never use this index.

Your application is ready and you are going to test it against real life data. You fill your userbase with around 1000 users by a simple script that will INSERT random users. Since your script is very sophisticated, it will randomly assign users to all the different countries just to make sure you cover all your bases.

However, after deployment of your application, it will only attract dutch visitors. Maybe 1 or 2 belgian users but that’s it. Your cardinality will be 2, and your userbase is already on 1000 users. Now your selectivity will be 2 / 1000 = 0.2%.

Catches:

I do not know exactly the actual percentage below which MySQL will not use the index (and subsequently does a full table search instead). Most people say it’s around 30% and this number seems pretty ok to use.

Because of low cardinality, don’t create indices on boolean-columns (“sex”, “active”, “deleted”, “mailinglist” etc), things like “country” (on a large userbase) etc. Even user age is a very bad candidate, since your users will be between 10 and 80, so it’s not a very large unique set.