且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

提高MySQL索引效率-多个索引中的列?

更新时间:2022-10-18 09:27:55

MySQL(和其他RDMS系统)很好地利用了覆盖索引.因此,如果您正在查找,请使用您的示例,

SELECT prism_data.id,
       prism_data.action_id,
       prism_data.world_id
  FROM prism_data 
 INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id 
 WHERE w.world = 'DeuxTiersMondes'
   AND (prism_data.x BETWEEN 668 AND 868)
   AND (prism_data.y BETWEEN -33 AND 167)
   AND (prism_data.z BETWEEN 358 AND 558)
 ORDER BY prism_data.id DESC
 LIMIT 1000;

prism_data上的以下BTREE索引可能会帮助提高查询性能(几乎所有索引都是BTREE索引):

 (world_id, x, y, z, id, action_id, world_id)

仅从索引就可以满足整个关于Arizona_data的查询.之所以称为 covering index (覆盖索引),是因为服务器可以找到它需要满足的所有内容(包括覆盖索引中的查询),因此不必重新启动数据表本身.它将对world_id进行索引身份扫描,然后对x进行范围扫描,然后查看y和z值以匹配其余查询.然后它将提取ID值,对其进行排序,然后返回LIMIT 1000部分结果集.

您绝对应该停止使用SELECT *.当您说SELECT *时,您将拒绝MySQL有关您实际需要哪些数据列的任何知识,因此您将击败选择覆盖原始表查询的优化器逻辑.

如果您的数据相当均匀地分布在x和y上,并且可以使用MyISAM,则可能需要考虑使用地理空间索引.与普通索引相比,它们在随机访问x/y范围方面做得更好.

精心设计的索引设置会减慢插入和更新的速度;绝对是个权衡.

Trying to use indexes more efficiently on massive data.

I have an open source application that logs millions of records to a MySQL database. I've used mysql databases for years in web development and I understand enough about choosing efficient field types, the basics of why/how indexes are useful, etc - but the sheer volume of data our application logs combined with the fact that it's hard to predict exactly which columns will be queried has me a bit under water.

The application logs events by players. We have a very advanced purge system but some servers are so busy, they have 50 million records after just eight weeks.

At that size, event with our existing indexes, queries may still take 30-90 seconds.

The primary table schema (minus existing indexes):

CREATE TABLE IF NOT EXISTS `prism_data` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `epoch` int(10) unsigned NOT NULL,
  `action_id` int(10) unsigned NOT NULL,
  `player_id` int(10) unsigned NOT NULL,
  `world_id` int(10) unsigned NOT NULL,
  `x` int(11) NOT NULL,
  `y` int(11) NOT NULL,
  `z` int(11) NOT NULL,
  `block_id` mediumint(5) DEFAULT NULL,
  `block_subid` mediumint(5) DEFAULT NULL,
  `old_block_id` mediumint(5) DEFAULT NULL,
  `old_block_subid` mediumint(5) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

WHERE conditions most often include:

  • world_id/x/y/z coordinates (queries all default to a radius around the user, so coordinates are almost always used)
  • epoch (all queries default to the last three days, users need to override this for longer timeframes)
  • action_id and/or player_id (Half the time, users are looking for who did specific actions or for what actions a specific player caused. )
  • Remaining queries could be any combination, block_id values combined with player or action, etc. Random

GROUP BY - By default the application groups by certain fields so that the user doesn't see 100 duplicate events for the same player/action/block, they can just see a single record with a count.

action_id, player_id, block_id, DATE(FROM_UNIXTIME(epoch))

ORDER BY is always prism_data.epoch DESC, x ASC, z ASC, y ASC, id DESC. The epoch is so that the user sees the most recent events first. The rest are so that a "rollback" engine gets things in the right order.

Here is an example query without order/group:

SELECT *
FROM prism_data 
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id 
LEFT JOIN prism_data_extra ex ON ex.data_id = prism_data.id 
WHERE w.world = 'DeuxTiersMondes'
AND (prism_data.x BETWEEN 668 AND 868)
AND (prism_data.y BETWEEN -33 AND 167)
AND (prism_data.z BETWEEN 358 AND 558);
LIMIT 1000;

Using an index: INDEXlocation(world_id,x,z,y); it still takes 15 seconds to find 1000 rows (or 50 seconds to find all 64735).

The explain for that query:

+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
| id | select_type | table      | type   | possible_keys | key      | key_len | ref                            | rows | Extra                    |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+
|  1 | SIMPLE      | w          | ref    | PRIMARY,world | world    | 767     | const                          |    1 | Using where; Using index |
|  1 | SIMPLE      | prism_data | ref    | location      | location | 4       | minecraft.w.world_id           | 6155 | Using index condition    |
|  1 | SIMPLE      | a          | eq_ref | PRIMARY       | PRIMARY  | 4       | minecraft.prism_data.action_id |    1 | NULL                     |
|  1 | SIMPLE      | p          | eq_ref | PRIMARY       | PRIMARY  | 4       | minecraft.prism_data.player_id |    1 | NULL                     |
|  1 | SIMPLE      | ex         | ref    | data_id       | data_id  | 4       | minecraft.prism_data.id        |    1 | NULL                     |
+----+-------------+------------+--------+---------------+----------+---------+--------------------------------+------+--------------------------+

It just seems to me that looking for values this specific ought to be much faster. We're not even sorting/grouping in this query.

My questions:

I assume it makes the most sense to design an index for each of the common conditions I listed above. i.e. one index that combines world_id/x/y/z, one that combines action_id/player_id and one for epoch. For certain queries this works well but for others it doesn't. For a query that used world_id, player_id, and epoch it only chose the world_id/x/y/z index.

  • Can I/should I include a column in multiple indices? Maybe one index for the full location, and one for the world_id/player_id/epoch? I can't really tell what logic mysql uses to choose which index fits best but I assume if an index uses more columns mysql needs, it'll chose that one. A slight perf hit on write is worth it if that will help my queries.
  • Should I create an index which contains all of the fields I group by / order by? My explains often show Using filesort which I know is a main pain point for performance.
  • Is there any benefit to using individual indexes on most fields even if they're in a combined index?

Sorry for the long read.

I'm doing a lot of profiling for 5 of our most common queries with different index setups but have a feeling that I may be missing some basics. I'd rather have some true experts school me in something I'm missing before I continue.

MySQL (and other RDMS systems) makes good use of covering indexes. So, if you're looking up, to use your example,

SELECT prism_data.id,
       prism_data.action_id,
       prism_data.world_id
  FROM prism_data 
 INNER JOIN prism_worlds w ON w.world_id = prism_data.world_id 
 WHERE w.world = 'DeuxTiersMondes'
   AND (prism_data.x BETWEEN 668 AND 868)
   AND (prism_data.y BETWEEN -33 AND 167)
   AND (prism_data.z BETWEEN 358 AND 558)
 ORDER BY prism_data.id DESC
 LIMIT 1000;

The following BTREE index on prism_data will probably help a bunch with query performance (almost all indexes are BTREE indexes):

 (world_id, x, y, z, id, action_id, world_id)

The whole of this query on prism_data can be satisfied just from the index. It's called a covering index because the server can find everything it needs to satisfy -- to cover -- the query in the index, and so doesn't have to bounce over to the data table itself. It'll do an index identity scan on world_id, then a range scan on x, and then look at the y, and z values for matching the rest of your query. It will then pull out the id values, order them, and return the LIMIT 1000 partial result set.

You should absolutely stop using SELECT *. When you say SELECT * you deny MySQL any knowledge of what columns of data you actually need, so you defeat the optimizer's logic that chooses covering index queries over raw table queries.

If your data are fairly evenly distributed over x and y, and you can use MyISAM, you may want to look into using geospatial indexes. These do a better job of random-accessing x/y ranges than ordinary indexes.

Elaborate index setups do slow down insertion and update; it's definitely a tradeoff.