且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从Spark中的数据框列值中删除空格

更新时间:2023-11-18 23:00:52

虽然所描述的问题无法通过提供的代码重现,但使用Python UDFs来处理此类简单任务却效率低下.如果您只想从文本中删除空格,请使用regexp_replace:

While the problem you've described is not reproducible with provided code, using Python UDFs to handle simple tasks like this, is rather inefficient. If you want to simply remove spaces from the text use regexp_replace:

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

如果要规范空行,请使用trim:

If you want to normalize empty lines use trim:

from pyspark.sql.functions import trim

df.select(trim(col("v")))

如果要保留前导/尾随空格,可以调整regexp_replace:

If you want to keep leading / trailing spaces you can adjust regexp_replace:

df.select(regexp_replace(col("v"), "^\s+$", ""))