且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

用于数据库索引的排序字符串表 (SSTable) 或 B+ 树?

更新时间:2022-06-22 20:50:08

在比较 BTree 索引和 SSTable 索引时,应该考虑写入复杂度:

When comparing a BTree index to an SSTable index, you should consider the write complexity:

  • 当随机写入写入时复制的 BTree 时,您将产生随机读取(以进行叶节点和路径的复制).因此,虽然写入在磁盘上是顺序的,但对于大于 RAM 的数据集,这些随机读取将很快成为瓶颈.对于类似 SSTable 的索引,写入时不会发生此类读取 - 只有顺序写入.

  • When writing randomly to a copy-on-write BTree, you will incur random reads (to do the copy of the leaf node and path). So while the writes my be sequential on disk, for datasets larger than RAM, these random reads will quickly become the bottle neck. For a SSTable-like index, no such read occurs on write - there will only be the sequential writes.

您还应该考虑在最坏的情况下,对 BTree 的每次更新都可能导致 log_b N 次 IO - 也就是说,您最终可能会为每个键写入 3 或 4 个块.如果密钥大小远小于块大小,这是非常昂贵的.对于类似SSTable的索引,每次写IO都会包含尽可能多的新键,所以每个键的IO成本更像是1/B.

You should also consider that in the worse case, every update to a BTree could incur log_b N IOs - that is, you could end up writing 3 or 4 blocks for every key. If key size is much less than block size, this is extremely expensive. For an SSTable-like index, each write IO will contain as many fresh keys as it can, so the IO cost for each key is more like 1/B.

在实践中,这使得类似 SSTable 的速度比 BTrees 快数千倍(对于随机写入).

In practice, this make SSTable-like thousands of times faster (for random writes) than BTrees.

在考虑实现细节时,我们发现实现类似 SSTable 的索引(几乎)无锁要容易得多,而 BTree 的锁策略变得相当复杂.

When considering implementation details, we have found it a lot easier to implement SSTable-like indexes (almost) lock-free, where as locking strategies for BTrees has become quite complicated.

您还应该重新考虑您的阅读成本.您是正确的,因为 BTree 是 O(log_b N) 随机点读取的随机 IO,但类似 ​​SSTable 的索引实际上是 O(#sstables . log_b N).如果没有合适的合并方案,#sstables 与 N 成正比.有各种技巧可以解决这个问题(例如,使用布隆过滤器),但这些对小型随机范围查询没有帮助.这是我们在 Cassandra 中发现的:

You should also re-consider your read costs. You are correct than a BTree is O(log_b N) random IOs for random point reads, but a SSTable-like index is actually O(#sstables . log_b N). Without an decent merge scheme, #sstables is proportional to N. There are various tricks to get round this (using Bloom Filters, for instance), but these don't help with small, random range queries. This is what we found with Cassandra:

重写下的Cassandra加载

这就是为什么我们的 (GPL) 存储引擎 Castle 的合并方式略有不同,并且可以在写入性能(O(log^2 N/B)).在实践中,我们发现它也比 Cassandra 的 SSTable 写入索引更快.

This is why Castle, our (GPL) storage engine, does merges slightly differently, and can achieve a lot better (O(log^2 N)) range queries performance with a slight trade off in write performance (O(log^2 N / B)). In practice we find it to be quicker than Cassandra's SSTable index for writes as well.

如果你想了解更多,我已经讨论了它是如何工作的:

If you want to know more about this, I've given a talk about how it works: