且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark mllib:如何将字符串分类特征转换为 int 以供评级接受

更新时间:2022-11-29 08:41:07

让我试试.假设 data: RDD[(String, String, Float)]

import org.apache.spark.mllib.recommendation.Rating

val data = sc.parallelize(Array(("StringName1", "StringProduct1", 1.0), ("StringName2", "StringProduct2", 2.0), ("StringName3", "StringProduct3", 3.0)))

//get distinct names and products and create maps from them
val names = data.map(_._1).distinct.sortBy(x => x).zipWithIndex.collectAsMap
val products = data.map(_._2).distinct.sortBy(x => x).zipWithIndex.collectAsMap

//convert to Rating format
val data_rating = data.map(r => Rating(names(r._1).toInt, products(r._2).toInt, r._3))

应该可以.基本上,您只需创建一个从 string 到 long 的映射,然后将 long 转换为 int.

That should do it. Basically, you just create a mapping from string to long and then convert long to int.