且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

连接两个RDD [String] -Spark Scala

更新时间:2023-11-18 22:21:28

如评论中所述,您必须在加入RDD之前将其RDD转换为PairRDD,这意味着每个RDD的类型必须为RDD[(key, value)].只有这样,您才能通过键执行联接.在您的情况下,密钥由(名称,地址)组成,因此您将必须执行以下操作:

As said in the comments, you have to convert your RDDs to PairRDDs before joining, which means that each RDD must be of type RDD[(key, value)]. Only then you can perform the join by the key. In your case, the key is composed by (Name, Address), so you you would have to do something like:

// First, we create the first PairRDD, with (name, address) as key and zipcode as value:
val pairRDD1 = rdd1.map { case (name, address, zipcode) => ((name, address), zipcode) }
// Then, we create the second PairRDD, with (name, address) as key and landmark as value:
val pairRDD2 = rdd2.map { case (name, address, landmark) => ((name, address), landmark) }

// Now we can join them. 
// The result will be an RDD of ((name, address), (zipcode, landmark)), so we can map to the desired format:
val joined = pairRDD1.fullOuterJoin(pairRDD2).map { 
  case ((name, address), (zipcode, landmark)) => (name, address, zipcode, landmark) 
}

有关Spark的 Scala API文档