且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

是否可以在 Apache Spark 中创建嵌套的 RDD?

更新时间:2023-11-18 22:34:28

不,这是不可能的,因为 RDD 的项必须是可序列化的,而 RDD 是不可序列化的.这是有道理的,否则你可能会通过网络传输整个 RDD,如果它包含大量数据,这是一个问题.如果它不包含大量数据,您可能并且应该使用数组或类似的东西.

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.

然而,我不知道你是如何实现 K 近邻的......但要小心:如果你做一些像计算每对点之间的距离之类的事情,这实际上在数据集大小上是不可扩展的,因为它是 O(n2).

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).