更新时间:2022-06-23 03:07:59
您的条件可以简化为检查 time
列的小时部分是否在 16
和 23
之间.
Your condition can be simplified to checking if the hour part of your time
column is between 16
and 23
.
您可以使用 pyspark.sql.functions.split
标记 :
字符上的 time
列.提取索引 0 处的令牌以获取小时,并使用 pyspark.sql.Column.between()
(包括边界).
You can get the hour by using pyspark.sql.functions.split
to tokenize the time
column on the :
character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between()
(which is inclusive of the bounds).
from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#| time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+
请注意,即使 split
返回一个字符串,也会隐式转换为 int
以进行 between
比较.
Note that even though split
returns a string, there is an implicit conversion to int
to do the between
comparison.
当然,如果您有更复杂的过滤条件,包括查看分钟或秒,则可以扩展此功能:
Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:
df.select(
"*",
split("time", ":")[0].cast("int").alias("hour"),
split("time", ":")[1].cast("int").alias("minute"),
split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#| time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24| 8| 28| 24|
#|22:20:54| 22| 20| 54|
#|12:59:38| 12| 59| 38|
#|21:46:07| 21| 46| 7|
#+--------+----+------+------+