更新时间:2022-04-09 18:03:09
您正在使用逻辑连词 (AND).这意味着所有列都必须与 'null'
不同才能包含行.让我们以使用 filter
版本为例来说明:
Your are using logical conjunction (AND). It means that all columns have to be different than 'null'
for row to be included. Lets illustrate that using filter
version as an example:
numeric = sqlContext.createDataFrame([
('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'),
('null', 'null', 'null'), ('1.0', 'null', '4.0')],
('low', 'high', 'normal'))
numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
numeric_filtered_2 = numeric_filtered_1.where(
numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null| 4.0|
## +---+----+------+
numeric_filtered_3 = numeric_filtered_2.where(
numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+
您尝试过的所有剩余方法都遵循完全相同的架构.您在这里需要的是逻辑分离 (OR).
All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
或使用原始 SQL:
numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+