且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么在考虑可扩展性时联接不好?

更新时间:2023-12-04 16:13:28

可扩展性是关于预计算(缓存)、分散或削减重复工作到基本要素,以最大限度地减少每个工作的资源使用单元.要想实现良好的规模化,您不需要大量做任何不需要的事情,而是确保您实际做的事情尽可能高效地完成.

Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.

在这种情况下,当然,加入两个独立的数据源相对较慢,至少与不加入它们相比,因为这是您需要在用户请求时实时完成的工作.

In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.

但请记住,替代方案根本不再有两个单独的数据;您必须将两个不同的数据点放在同一个记录中.您不能在某处合并两个不同的数据而没有结果,因此请确保您了解权衡.

But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.

好消息是现代关系数据库擅长连接.如果使用良好的数据库,您不应该真的认为连接很慢.有许多可扩展性友好的方法来获取原始连接并使它们大大:

The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:

  • 加入代理键(自动编号/身份列)而不是自然键.这意味着在连接操作期间进行更小(因此更快)的比较
  • 索引
  • 物化/索引视图(将此视为预先计算的连接或托管非规范化)
  • 计算列.您可以使用它来散列或以其他方式预先计算连接的键列,这样对于连接的复杂比较现在要小得多,并且可能已预先索引.
  • 表分区(通过将负载分散到多个磁盘来帮助处理大型数据集,或将可能是表扫描的内容限制为分区扫描)
  • OLAP(预先计算某些类型的查询/连接的结果.这不太正确,但您可以将其视为通用非规范化)
  • 复制、可用性组、日志传送或其他机制,让多台服务器回答对同一数据库的读取查询,从而在多台服务器之间扩展您的工作负载.
  • 使用像 Redis 这样的缓存层来避免重新运行需要复杂连接的查询.

我想说关系数据库存在的主要原因是让您可以高效地进行连接*.这当然不仅仅是存储结构化数据(您可以使用像 csv 或 xml 这样的平面文件结构来做到这一点).我列出的一些选项甚至可以让您提前完全构建您的连接,因此在您发出查询之前已经完成了结果 - 就像您对数据进行了非规范化一样(无可否认,以较慢的写入操作为代价).

I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).

如果您的联接速度较慢,则您可能没有正确使用数据库.

只有在这些其他技术失败后才应进行反规范化.而你真正判断失败"的唯一方法是是设定有意义的绩效目标并根据这些目标进行衡量.如果你还没有测量过,现在考虑去规范化还为时过早.

De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.

* 也就是说,作为不同于单纯的表格集合的实体存在.真正的 rdbms 的另一个原因是安全的并发访问.