且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Cassandra-处理分区和存储桶以处理大数据量

更新时间:2023-02-02 21:18:23

从评论的讨论中可以看出,您似乎正在尝试将Cassandra用作队列,这是一个很大的反模式.
虽然可以存储有关在Cassandra中完成的操作的数据,但应该在队列中查找类似Kafka或RabbitMQ的内容.

From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.

它可能看起来像这样:

  1. 应用程序1复制/生成记录A;
  2. 应用程序1将A的路径添加到队列中;
  3. 应用程序1根据文件ID/路径在分区中向cassandra追加证书(其他列可以是诸如日期,复制时间,文件哈希等之类的信息);
  4. 应用程序2读取队列,找到A,对其进行处理,并确定它是失败还是完成;
  5. 应用程序2向卡桑德拉(cassandra)更新有关处理的信息,包括状态.您还可以拥有诸如失败原因之类的东西;
  6. 如果失败,则可以将路径/id写入另一个主题.

因此,总而言之,不要尝试将Cassandra用作队列,这是全球公认的反模式.您可以并且应该使用Cassandra来保存已完成操作的日志,包括处理结果(如果适用),文件的处理方式,结果等.
根据您进一步需要在Cassandra中读取和使用数据的方式,您可以考虑根据诸如文件源,文件类型等之类的内容使用分区和存储桶.如果没有,则可以按唯一值对它们进行分区就像我在表中看到的UUID.然后您可能会基于此获取有关它的信息.

So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.

希望这个被治愈的人,
干杯!

Hope this heleped,
Cheers!