且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

关于PostgreSQL中的多字段索引之三(gin篇)

更新时间:2022-06-07 01:59:50

前两篇讲的btree和gist的多字段索引,本篇顺理成章地讲一下gin的多字段索引。
前两篇请参考这里:
http://blog.chinaunix.net/uid-20726500-id-5088527.html
http://blog.chinaunix.net/uid-20726500-id-5090166.html

1. gin多字段索引的特征

不像gist和btree,gin天生就适合做多字段索引,不管查询条件覆盖所有索引字段还是仅仅覆盖一个子集,它都可以胜任。
http://www.postgres.cn/docs/9.3/indexes-multicolumn.html
---------------------------------------------------------------
一个多字段的 GIN 索引可以用于那些查询条件包含索引字段子集的查询中。 不像B-tree 或 GiST,除了查询条件使用的索引字段外,索引的搜索效率是相同的。
---------------------------------------------------------------

2. gin多字段索引的原理

gin多字段索引的原理是,索引中的每个key由字段编号加上元素值构成,这样的key作为一个整体构成倒排索引树。

参考:src/backend/access/gin/README
---------------------------------------------------------------
...
* In a single-column index, a key tuple just contains the key datum, but
in a multi-column index, a key tuple contains the pair (column number,
key datum) where the column number is stored as an int2.  This is needed
to support different key data types in different columns.  This much of
the tuple is built by index_form_tuple according to the usual rules.
The column number (if present) can never be null, but the key datum can
be, in which case a null bitmap is present as usual.  (As usual for index
tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)
...
---------------------------------------------------------------

3. 性能对比测试

为了和前面的btree和gist有个可比性,测试环境和数据还采用上一篇的标量数据,而不是采用gin特有的集合数据,所以要使用btree-gin扩展。
测试环境详见
http://blog.chinaunix.net/uid-20726500-id-5088527.html

3.1 创建c1+c2多字段gin索引

点击(此处)折叠或打开

  1. postgres=# create extension btree_gin;
  2. CREATE EXTENSION
  3. postgres=# \timing
  4. Timing is on.
  5. postgres=# create index tb1_idx5 on tb1 using gin(c1,c2);
  6. CREATE INDEX
  7. Time: 23119.722 ms
  8. postgres=# select pg_size_pretty(pg_relation_size('tb1_idx5'));
  9.  pg_size_pretty
  10. ----------------
  11.  129 MB
  12. (1 row)


3.2 c1+c2多字段gin索引处理多字段查询

点击(此处)折叠或打开

  1. postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;
  2.                                                          QUERY PLAN
  3. ----------------------------------------------------------------------------------------------------------------------------
  4.  Aggregate (cost=404.70..404.71 rows=1 width=0) (actual time=15.797..15.798 rows=1 loops=1)
  5.    Buffers: shared hit=179 read=2
  6.    -> Bitmap Heap Scan on tb1 (cost=21.01..404.45 rows=99 width=0) (actual time=14.545..15.769 rows=92 loops=1)
  7.          Recheck Cond: ((c1 = 99) AND (c2 = 999))
  8.          Buffers: shared hit=179 read=2
  9.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..20.99 rows=99 width=0) (actual time=14.144..14.144 rows=92 loops=1)
  10.                Index Cond: ((c1 = 99) AND (c2 = 999))
  11.                Buffers: shared hit=87 read=2
  12.  Total runtime: 16.199 ms
  13. (9 rows)

这个速度没有btree和gist的多字段索引快,因为它的原理和BitmapAnd组合索引相似,要扫描2次索引再取交集,但比组合索引的50ms快。

3.3 c1+c2多字段gist索引处理单字段查询

点击(此处)折叠或打开

  1. postgres=# explain (analyze,buffers) select count(*) from tb1 where c1=99;
  2.                                                              QUERY PLAN
  3. -------------------------------------------------------------------------------------------------------------------------------------
  4.  Aggregate (cost=46880.08..46880.09 rows=1 width=0) (actual time=1569.289..1569.290 rows=1 loops=1)
  5.    Buffers: shared hit=142 read=39638
  6.    -> Bitmap Heap Scan on tb1 (cost=1097.08..46624.25 rows=102333 width=0) (actual time=37.484..1548.405 rows=99886 loops=1)
  7.          Recheck Cond: (c1 = 99)
  8.          Rows Removed by Index Recheck: 7363157
  9.          Buffers: shared hit=142 read=39638
  10.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..1071.50 rows=102333 width=0) (actual time=35.034..35.034 rows=99886 loops=1)
  11.                Index Cond: (c1 = 99)
  12.                Buffers: shared hit=78
  13.  Total runtime: 1569.548 ms
  14. (10 rows)


  15. postgres=# explain (analyze,buffers) select count(*) from tb1 where c2=999;
  16.                                                           QUERY PLAN
  17. -------------------------------------------------------------------------------------------------------------------------------
  18.  Aggregate (cost=23488.40..23488.41 rows=1 width=0) (actual time=79.362..79.362 rows=1 loops=1)
  19.    Buffers: shared hit=1974 read=6910
  20.    -> Bitmap Heap Scan on tb1 (cost=110.81..23464.27 rows=9652 width=0) (actual time=7.228..77.332 rows=9892 loops=1)
  21.          Recheck Cond: (c2 = 999)
  22.          Buffers: shared hit=1974 read=6910
  23.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..108.39 rows=9652 width=0) (actual time=4.468..4.468 rows=9892 loops=1)
  24.                Index Cond: (c2 = 999)
  25.                Buffers: shared hit=1 read=11
  26.  Total runtime: 79.408 ms
  27. (9 rows)
查询速度和gist的多字段索引差不多。

4. PostgreSQL9.4的测试结果 

PostgreSQL 9.4对gin的性能做了很大提升,下面再在9.4上重复一下上面的测试。

4.1 数据准备

点击(此处)折叠或打开

  1. chenhj=# create table tb1(c1 int,c2 int);
  2. CREATE TABLE
  3. chenhj=# insert into tb1 select round(random()*100),round(random()*1000) from generate_series(1,10000000);
  4. INSERT 0 10000000
  5. chenhj=# select pg_size_pretty(pg_table_size('tb1'));
  6.  pg_size_pretty
  7. ----------------
  8.  346 MB
  9. (1 row)
  10. chenhj=# create extension btree_gin;
  11. CREATE EXTENSION
  12. chenhj=# create index tb1_idx5 on tb1 using gin(c1,c2);
  13. CREATE INDEX
  14. chenhj=# select pg_size_pretty(pg_relation_size('tb1_idx5'));
  15.  pg_size_pretty
  16. ----------------
  17.  47 MB
  18. (1 row)
9.4里gin索引的大小比9.3小了很多,只有9.3的三分之一。

4.2 性能测试


查询条件包含c1+c2

点击(此处)折叠或打开

  1. chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99 and c2=999;
  2.                                                          QUERY PLAN
  3. ----------------------------------------------------------------------------------------------------------------------------
  4.  Aggregate (cost=423.95..423.96 rows=1 width=0) (actual time=3.399..3.399 rows=1 loops=1)
  5.    Buffers: shared hit=166
  6.    -> Bitmap Heap Scan on tb1 (cost=25.05..423.69 rows=103 width=0) (actual time=3.151..3.375 rows=117 loops=1)
  7.          Recheck Cond: ((c1 = 99) AND (c2 = 999))
  8.          Heap Blocks: exact=117
  9.          Buffers: shared hit=166
  10.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..25.03 rows=103 width=0) (actual time=3.125..3.125 rows=117 loops=1)
  11.                Index Cond: ((c1 = 99) AND (c2 = 999))
  12.                Buffers: shared hit=49
  13.  Planning time: 0.099 ms
  14.  Execution time: 3.448 ms
  15. (11 rows)
比9.3快了很多,但仍然没有btree和gist的多字段索引快,不过差别也不是太大。

查询条件包含c1

点击(此处)折叠或打开

  1. chenhj=# explain (analyze,buffers) select count(*) from tb1 where c1=99;
  2.                                                              QUERY PLAN
  3. ------------------------------------------------------------------------------------------------------------------------------------
  4.  Aggregate (cost=46819.50..46819.51 rows=1 width=0) (actual time=590.436..590.437 rows=1 loops=1)
  5.    Buffers: shared hit=105 read=39583 written=700
  6.    -> Bitmap Heap Scan on tb1 (cost=981.50..46554.50 rows=106000 width=0) (actual time=49.895..570.300 rows=99790 loops=1)
  7.          Recheck Cond: (c1 = 99)
  8.          Heap Blocks: exact=39665
  9.          Buffers: shared hit=105 read=39583 written=700
  10.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..955.00 rows=106000 width=0) (actual time=31.225..31.225 rows=99790 loops=1)
  11.                Index Cond: (c1 = 99)
  12.                Buffers: shared hit=23
  13.  Planning time: 0.112 ms
  14.  Execution time: 590.496 ms
  15. (11 rows)
速度是9.3的2倍多,也比gist和btree都快。

查询条件包含c2

点击(此处)折叠或打开

  1. chenhj=# explain (analyze,buffers) select count(*) from tb1 where c2=999;
  2.                                                           QUERY PLAN
  3. -------------------------------------------------------------------------------------------------------------------------------
  4.  Aggregate (cost=23539.39..23539.40 rows=1 width=0) (actual time=92.894..92.894 rows=1 loops=1)
  5.    Buffers: shared hit=1950 read=7130 written=3
  6.    -> Bitmap Heap Scan on tb1 (cost=95.12..23515.16 rows=9692 width=0) (actual time=5.672..90.431 rows=10178 loops=1)
  7.          Recheck Cond: (c2 = 999)
  8.          Heap Blocks: exact=9073
  9.          Buffers: shared hit=1950 read=7130 written=3
  10.          -> Bitmap Index Scan on tb1_idx5 (cost=0.00..92.69 rows=9692 width=0) (actual time=2.671..2.671 rows=10178 loops=1)
  11.                Index Cond: (c2 = 999)
  12.                Buffers: shared hit=6 read=1
  13.  Planning time: 0.187 ms
  14.  Execution time: 94.261 ms
  15. (11 rows)
和9.3差不多,和gist也差不多。

4.总结

1)当查询条件中同时出现多字段索引的所有条件时,多字段索引的速度优于多个字段单独建索引然后通过BitmapAnd组合起来。btree和gist的多字段索引又快于gin多字段索引。
2)多字段索引用在只包含全部索引字段的子集时,除了btree在查询条件中多字段的第一个字段缺席时效率很差外,其它都表现良好。
3)9.4的gin索引的大小比btree和gist小了太多。(但gin有别的问题,在做比较查询时有时效率不高,后面准备再通过例子讲解
4)9.4的gin索引比9.3的gin优化了太多了

最后引用一下手册中关于使用多字段索引还是多索引组合的说明。加黑的那个限制似乎只对btree的多字段索引有效。
-----------------------------------------------------------------------------------
在大多数最简单的应用里,可能有多种索引组合都是有用的,数据库开发人员必须在使用哪个索引之间作出平衡。 有时候多字段索引是***的,有时候创建一个独立索引并依靠索引组合是***的。比如, 假如你的查询有时候只涉及字段x,有时候只涉及字段y,有时候两个字段都涉及, 那么你可能会选择在xy上创建两个独立的索引, 然后依靠索引组合来处理同时使用两个字段的查询。你也可以在(x, y)上创建一个多字段索引, 它在同时使用两个字段的查询通常比索引组合更高效,但是,正如我们在Section 11.3 里面讨论的,它对那些只包含y的查询几乎没有用,因此它不能是唯一一个索引。 一个多字段索引和y上的独立索引可能会更好。因为对那些只涉及x的查询, 可以使用多字段索引,但是它会更大,因此也比只在x上的索引更慢。最后一个选择是创建三个索引, 但是这种方法只有在表的更新远比查询少得多,并且所有三种查询都很普遍的情况下才是合理的。 如果其中一种查询比其它的少很多,那么你可能更愿意仅仅创建两种匹配更常见查询的索引。
-----------------------------------------------------------------------------------