且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Cassandra nosql数据库中的节点,集群和数据中心之间的区别是什么?

更新时间:2023-02-08 19:14:37

节点是运行Cassandra的单个机器。

A node is a single machine that runs Cassandra. A collection of nodes holding similar data are grouped in what is known as a "ring" or cluster.

有时,如果您有大量数据,或者您正在提供服务数据在不同的地理区域,将集群的节点分组到不同的数据中心是有意义的。一个很好的用例是一个电子商务网站,它可能在东海岸和西海岸有许多频繁的客户。这样,您在东海岸的客户就会连接到您的东海岸DC(为了更快的性能),但最终可以访问与西海岸客户相同的数据集(两个DC都在同一个集群中)。

Sometimes if you have a lot of data, or if you are serving data in different geographical areas, it makes sense to group the nodes of your cluster into different data centers. A good use case of this, is for an e-commerce website, which may have many frequent customers on the east coast and the west coast. That way your customers on the east coast connect to your east coast DC (for faster performance), but ultimately have access to the same dataset (both DCs are in the same cluster) as the west coast customers.

有关这方面的详情,请访问:关于Apache Cassandra-如何Cassandra是否工作?

More information on this can be found here: About Apache Cassandra- How does Cassandra work?


包含相同(重复)数据的所有节点组成一个数据中心。是这样吗?

And all the nodes that contains the same (duplicated) data compose a datacenter. Is that right?

关闭,但不一定。您具有的数据重复级别由复制因素决定,该复制因素是基于每个键空间设置的。例如,假设我在我的单个DC中有3个节点,全部存储600GB的产品数据。我的产品键空间定义可能如下所示:

Close, but not necessarily. The level of data duplication you have is determined by your replication factor, which is set on a per-keyspace basis. For instance, let's say that I have 3 nodes in my single DC, all storing 600GB of product data. My products keyspace definition might look like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '3'};

这将确保我的产品数据平等地复制到所有3个节点。我的总数据集的大小是600GB,在所有3个节点上重复。

This will ensure that my product data is replicated equally to all 3 nodes. The size of my total dataset is 600GB, duplicated on all 3 nodes.

但是我们说,我们正在推出一个新的,相当大的产品线,估计我们将有另外300GB的数据来,这可能开始推动我们的硬盘驱动器的最大容量。如果我们现在不能升级所有的硬盘驱动器,我可以改变这样的复制因素:

But let's say that we're rolling-out a new, fairly large product line, and I estimate that we're going to have another 300GB of data coming, which may start pushing the max capacity of our hard drives. If we can't afford to upgrade all of our hard drives right now, I can alter the replication factor like this:

CREATE KEYSPACE products
WITH replication = {'class': 'NetworkTopologyStrategy', 'MyDC': '2'};

这将创建我们所有数据的2个副本,并将其存储在我们当前的3个节点。我们的数据集的大小现在是900GB,但由于只有它的两个副本(每个节点本质上负责2/3的数据),我们的磁盘上的大小仍然是600GB。这里的缺点是(假设我在 ONE 的一致性级别读取和写入)我只能承受损失1个节点。而3个节点和3的RF(再次读写一致 ONE ),我可以失去2个节点,仍然提供请求。

This will create 2 copies of all of our data, and store it in our current cluster of 3 nodes. The size of our dataset is now 900GB, but since there are only two copies of it (each node is essentially responsible for 2/3 of the data) our size on-disk is still 600GB. The drawback here, is that (assuming I read and write at a consistency level of ONE) I can only afford to suffer a loss of 1 node. Whereas with 3 nodes and a RF of 3 (again reading and writing at consistency ONE), I could lose 2 nodes and still serve requests.