且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在PySpark中的两个日期之间生成每小时的时间戳?

更新时间:2023-01-29 16:54:42

这就是我最终解决它的方式.

This is how I finally solved it.

输入数据

data = [
    (dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22)),
    (dt.datetime(2001,1,1,15,20,37), dt.datetime(2001,1,1,18,12,22))
]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()

结果

+-------------------+-------------------+
|            minDate|            maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|
+-------------------+-------------------+

转换后的数据

# Compute hours between min and max date
df = df.withColumn(
    'hour_diff',
    fn.ceil((fn.col('maxDate').cast('long') - fn.col('minDate').cast('long'))/3600)
)

# Duplicate rows a number of times equal to hour_diff
df = df.withColumn("repeat", fn.expr("split(repeat(',', hour_diff), ',')"))\
    .select("*", fn.posexplode("repeat").alias("idx", "val"))\
    .drop("repeat", "val")\
    .withColumn('hour_add', (fn.col('minDate').cast('long') + fn.col('idx')*3600).cast('timestamp'))

# Create the new start and end date according to the boundaries
df = (df
.withColumn(
    'start_dt', 
    fn.when(
        fn.col('idx') > 0,
        (fn.floor(fn.col('hour_add').cast('long') / 3600)*3600).cast('timestamp')
    ).otherwise(fn.col('minDate'))
).withColumn(
    'end_dt', 
    fn.when(
        fn.col('idx') != fn.col('hour_diff'),
        (fn.ceil(fn.col('hour_add').cast('long') / 3600)*3600-60).cast('timestamp')
    ).otherwise(fn.col('maxDate'))
).drop('hour_diff', 'idx', 'hour_add'))

df.show()

会导致

+-------------------+-------------------+-------------------+-------------------+
|            minDate|            maxDate|           start_dt|             end_dt|
+-------------------+-------------------+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 15:20:37|2000-01-01 15:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 16:00:00|2000-01-01 16:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 17:00:00|2000-01-01 17:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 18:00:00|2000-01-01 18:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 19:00:00|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 15:20:37|2001-01-01 15:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 16:00:00|2001-01-01 16:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 17:00:00|2001-01-01 17:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 18:00:00|2001-01-01 18:12:22|
+-------------------+-------------------+-------------------+-------------------+