且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

通过检查字符串是否出现在列中来过滤PySpark DataFrame

更新时间:2023-11-28 18:56:22

您可以使用pyspark.sql.functions.array_contains方法:

df.filter(array_contains(df['authors'], 'Some Author'))


from pyspark.sql.types import *
from pyspark.sql.functions import array_contains

lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 2]|
|          [author 1]|
+--------------------+

df.printSchema()
root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)

df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 1]|
+--------------------+