且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

pyspark 检查 HH:mm:ss 是否在一个范围内

更新时间:2022-06-23 03:07:59

您的条件可以简化为检查 time 列的小时部分是否在 1623 之间.

Your condition can be simplified to checking if the hour part of your time column is between 16 and 23.

您可以使用 pyspark.sql.functions.split 标记 : 字符上的 time 列.提取索引 0 处的令牌以获取小时,并使用 pyspark.sql.Column.between()(包括边界).

You can get the hour by using pyspark.sql.functions.split to tokenize the time column on the : character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between() (which is inclusive of the bounds).

from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#|    time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+

请注意,即使 split 返回一个字符串,也会隐式转换为 int 以进行 between 比较.

Note that even though split returns a string, there is an implicit conversion to int to do the between comparison.

当然,如果您有更复杂的过滤条件,包括查看分钟或秒,则可以扩展此功能:

Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:

df.select(
    "*",
    split("time", ":")[0].cast("int").alias("hour"),
    split("time", ":")[1].cast("int").alias("minute"),
    split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#|    time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24|   8|    28|    24|
#|22:20:54|  22|    20|    54|
#|12:59:38|  12|    59|    38|
#|21:46:07|  21|    46|     7|
#+--------+----+------+------+