且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

根据时间范围删除Cassandra表中的记录

更新时间:2023-02-04 20:16:42

这是一个有趣的问题...

It's an interesting question...

所有不属于主键的列都具有所谓的WriteTime,可以使用CQL的 writetime(column_name)函数(警告:不适用于集合列,并且为UDT返回null!).但是,由于CQL中没有嵌套查询,因此您将需要编写程序来获取数据,按WriteTime筛选出条目并删除WriteTime早于阈值的条目.(请注意 writetime 的值以微秒为单位,而不是CQL的 timestamp 类型以毫秒为单位).

All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).

最简单的方法是使用

The easiest way is to use Spark Cassandra Connector's RDD API, something like this:

val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
      .select("prk1", "prk2", "reg_col".writeTime as "writetime")
      .filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable, 
      keyColumns = SomeColumns("prk1", "prk2"))

其中: prk1 prk2 ,...是主键的所有组件( documentId sequenceNo (根据您的情况),以及 reg_col -表中不是集合或UDT的任何常规"列(例如, clientId ).重要的是, select deleteFromCassandra 中的主键列的列表必须相同.

where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.