Python Spark DataFrame:用 SparseVector 替换 null

更新时间：2023-08-23 21:09:10

你可以使用udf:

from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *

fill_with_vector = udf(
    lambda x, i: x if x is not None else SparseVector(i, {}),
    VectorUDT()
)

df = sc.parallelize([
    (SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])

(df
    .withColumn("features1", fill_with_vector("features1", lit(5)))
    .withColumn("features2", fill_with_vector("features2", lit(10)))
    .show())

# +-------------+---------------+
# |    features1|      features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# |    (5,[],[])|     (10,[],[])|
# +-------------+---------------+

上一篇 : ：与Java 6相比，Java 8的GUI性能较差下一篇 : boost :: details :: pool :: pthread_mutex和boost :: details :: pool :: null_mutex

Python Spark DataFrame:用 SparseVector 替换 null

相关阅读

推荐文章