且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Cassandra CQL 中的 Where 和 Order By 子句

更新时间:2023-08-27 11:41:58

Cassandra 中 PRIMARY KEY 有两个部分:

There are two parts to a PRIMARY KEY in Cassandra:

  • 分区键
  • 集群键

PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)

PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)

分区键决定了您的数据存储在哪个节点上.集群键决定了分区键中数据的顺序.

The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.

在 CQL 中,ORDER BY 子句实际上仅用于反转聚类顺序的定义排序方向.至于列本身,您只能在创建表时在 CLUSTERING ORDER BY 子句中指定定义的列(并且按照确切的顺序...不能跳过).因此,您不能在查询时选择任意列来对结果集进行排序.

In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.

Cassandra 通过使用集群键对磁盘上的数据进行排序来实现性能,从而仅在单次读取中返回有序行(无随机读取).这就是为什么您必须在 Cassandra 中采用基于查询的建模方法(通常将您的数据复制到多个查询表中).提前了解您的查询,并构建您的表来为它们提供服务.

Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.

Select * from emp order by empno;

首先,您需要一个 WHERE 子句.不用它也可以查询,如果您使用的是关系数据库.使用 Cassandra,您应该尽量避免未绑定的 SELECT 查询.此外,Cassandra 只能在分区内强制执行排序顺序,因此在没有 WHERE 子句的情况下进行查询将不会以您想要的顺序返回数据.

First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.

其次,正如我上面提到的,您需要定义集群键.如果您想按 empno 对结果集进行排序,那么您必须找到另一列来定义为您的分区键.尝试这样的事情:

Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:

CREATE TABLE emp_by_dept (
  empno text,
  dept text,
  name text,
  PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);

现在,我可以按部门查询员工,他们将按empno排序返回给我:

Now, I can query employees by department, and they will be returned to me ordered by empno:

SELECT * FROM emp_by_dept WHERE dept='IT';

但需要明确的是,您将无法查询表中的每一行,并按单列对其进行排序.在结果集中获得有意义的顺序的唯一方法是首先以对您的业务案例有意义的方式对数据进行分区.运行未绑定的 SELECT 将返回您的所有行(假设查询在尝试查询集群中的每个节点时不会超时),但结果集排序只能在分区内强制执行.因此,您必须通过分区键进行限制才能使其有意义.

But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.

我很抱歉自我推销,但去年我为 DataStax 写了一篇文章,名为 我们应该有秩序!,其中我讨论了如何解决这些类型的问题.阅读一下,看看是否有帮助.

My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.

编辑其他问题:

从你的回答中我总结出关于 Cassandra 的两件事:

From your answer I concluded 2 things about Cassandra:

(1) 没有获取结果集的方法,该结果集仅按具有的列排序被定义为唯一.

(1) There is no way of getting a result set which is only order by a column that has been defined as Unique.

(2) 当我们定义一个 PK(partition-key+clustering-key),那么结果永远是有序的通过在任何固定分区键内聚集列(我们必须限制到一个分区键值),这意味着不需要 ORDER BY子句,因为它永远不会改变行的顺序(在实际存储了哪些行),即 Order By 没用.

(2) When we define a PK (partition-key+clustering-key), then the results will always be order by Clustering columns within any fixed partition key (we must restrict to one partition-key value), that means there is no need of ORDER BY clause, since it cannot ever change the order of rows (the order in which rows are actually stored), i.e. Order By is useless.

1) Cassandra 中的所有 PRIMARY KEY 都是唯一的.无法通过分区键对结果集进行排序.在我的示例中,我按 empno 排序(按部门分区后).– 亚伦 1 小时前

1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago

2) 先不说 ORDER BY 没用,我会说它唯一真正的用途是在 ASC 和 DESC 之间切换排序方向.

2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.

我在emp"表的empno"列上创建了索引,它仍然没有允许 ORDER BY empno.那么,索引有什么用呢?他们只是为了搜索记录以获取索引键的特定值?

I created an index on "empno" column of "emp" table, it is still not allowing ORDER BY empno. So, what Indexes are for? are they only for searching records for specific value of index key?

您不能按索引列对结果集进行排序.二级索引(与其对应的关系索引不同)实际上仅对边缘情况、基于分析的查询有用.它们无法扩展,因此一般建议不要使用二级索引.

You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.

好吧,那只是意味着一张表不能用于获取不同条件、不同排序的不同结果集订购.

Ok, that simply means that one table cannot be used for getting different result sets with different conditions and different sorting order.

正确.

因此,对于每个新需求,我们都需要创建一个新表.这意味着如果我们在一个表(比如 Sales 表)中有十亿行,并且我们需要销售总和(1)产品方面,(2)区域方面,那么我们将复制 2 个表中的所有 10 亿行,其中一个在集群中Product 的顺序,另一个是 Region 的聚类顺序.乃至如果我们需要对每个 Salesman_id 的销售额求和,那么我们建立一个第三个表,再次放置所有这些十亿行?合理吗?

Hence for each new requirement, we need to create a new table. IT means if we have a billion rows in a table (say Sales table), and we need sum of sales (1) Product-wise, (2) Region-wise, then we will duplicate all those billion rows in 2 tables with one in clustering order of Product, the other in clustering order of Region,. and even if we need to sum sales per Salesman_id, then we build a 3rd table, again putting all those billion rows? is it sensible?

它的合理性完全由您来决定.但缺乏查询灵活性是 Cassandra 的一个缺点.为了解决这个问题,您可以继续创建查询表(即,为了性能而交易磁盘).但是,如果它变得笨拙或难以管理,那么就该考虑 Cassandra 是否真的是正确的解决方案.

It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.

编辑 20160321

Aaron,你在上面说过不要说 ORDER BY 没用,我会说它唯一真正的用途是在 ASC 和 DESC 之间切换排序方向."

Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."

但我发现即使这样也不正确.Cassandra 只允许按照我们在 CREATE TABLE 的CLUSTERING ORDER BY"caluse 中定义的相同方向进行 ORDER by.如果在该子句中定义了 ASC,则它只允许按 ASC 排序,反之亦然.

But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.

如果没有看到错误消息,就很难知道该告诉您什么.尽管我听说过 ORDER BY 的查询在分区中存储的行过多时会失败.

Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.

ORDER BY 如果您指定多个要排序的列,那么它的功能也会有点奇怪.如果我定义了两个聚类列,我可以不加选择地在第一列上使用 ORDER BY.但是,一旦我将第二列添加到 ORDER BY 子句中,我的查询仅在我指定 both 相同的排序方向(如 CLUSTERING ORDER BY 定义)或两者不同.如果我混合搭配,我会得到这个:

ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:

InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"

我认为这与数据在磁盘上的存储方式有关.否则 Cassandra 在准备结果集方面会有更多工作要做.然而,如果它要求一切都匹配或镜像 CLUSTERING ORDER BY 中指定的方向,它可以只中继从磁盘的顺序读取.因此,***在 ORDER BY 子句中只使用一列,以获得更可预测的结果.

I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.