更新时间:2023-08-26 20:48:58
至少有两个选择:
在现有的DataFrame
上,可以将as
方法与metadata
参数一起使用:
On existing DataFrame
you can use as
method with metadata
argument:
import org.apache.spark.ml.attribute._
val rdd = sc.parallelize(Seq(
(1, Vectors.dense(1.0, 2.0, 3.0))
))
val df = rdd.toDF("label", "features")
df.withColumn("features", $"features".as("_", attrGroup.toMetadata))
创建新的DataFrame
时,转换AttributeGroup
toStructField
并将其用作给定列的架构:
When you create new DataFrame
convert AttributeGroup
toStructField
and use it as a schema for a given column:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(Array(
StructField("label", IntegerType, false),
attrGroup.toStructField()
))
spark.createDataFrame(
rdd.map(row => Row.fromSeq(row.productIterator.toSeq)),
schema)
如果已使用VectorAssembler
列元数据创建了向量列,则描述父列的列元数据应已附加.
If vector column has been created using VectorAssembler
column metadata describing parent columns should be already attached.
import org.apache.spark.ml.feature.VectorAssembler
val raw = sc.parallelize(Seq(
(1, 1.0, 2.0, 3.0)
)).toDF("id", "feat1", "feat2", "feat3")
val assembler = new VectorAssembler()
.setInputCols(Array("feat1", "feat2", "feat3"))
.setOutputCol("features")
val dfWithMeta = assembler.transform(raw).select($"id", $"features")
dfWithMeta.schema.fields(1).metadata
// org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"numeric":[
// {"idx":0,"name":"feat1"},{"idx":1,"name":"feat2"},
// {"idx":2,"name":"feat3"}]},"num_attrs":3}
不能使用点语法(例如$features.feat1
)直接访问向量字段,但可以使用VectorSlicer
这样的专用工具使用这些字段:
Vector fields are not directly accessible using dot syntax (like $features.feat1
) but can used by specialized tools like VectorSlicer
:
import org.apache.spark.ml.feature.VectorSlicer
val slicer = new VectorSlicer()
.setInputCol("features")
.setOutputCol("featuresSubset")
.setNames(Array("feat1", "feat3"))
slicer.transform(dfWithMeta).show
// +---+-------------+--------------+
// | id| features|featuresSubset|
// +---+-------------+--------------+
// | 1|[1.0,2.0,3.0]| [1.0,3.0]|
// +---+-------------+--------------+
有关PySpark,请参见如何在XML框架中将列声明为DataFrame中的分类特征
For PySpark see How can I declare a Column as a categorical feature in a DataFrame for use in ml