如何为 Spark RDD 中的元素分配唯一的连续数字

更新时间：2023-11-18 22:38:52

从 Spark 1.0 开始，您可以使用两种方法轻松解决此问题:

Starting with Spark 1.0 there are two methods you can use to solve this easily:

RDD.zipWithIndex 就像 Seq.zipWithIndex 一样，它添加了连续的 (Long) 数字.这需要首先计算每个分区中的元素，因此您的输入将被评估两次.如果您想使用它，请缓存您的输入 RDD.
RDD.zipWithUniqueId 还为您提供唯一的 Long ID，但不保证它们是连续的.(只有当每个分区具有相同数量的元素时，它们才会是连续的.)好处是这不需要知道有关输入的任何信息，因此不会导致双重评估.

RDD.zipWithIndex is just like Seq.zipWithIndex, it adds contiguous (Long) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.
RDD.zipWithUniqueId also gives you unique Long IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.

相关阅读