且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用 Pyspark 将大量结构字段类型转换为字符串

更新时间:2022-06-24 14:51:54

我已经用我自己的测试数据集试过了,看看它是否适合你.答案来自这里:Pyspark- 循环遍历 structType 和 ArrayType 以在 structfield 中进行类型转换详情请参阅

I have tried with my own test dataset, check if it works for you. The answer is inspired from here : Pyspark - Looping through structType and ArrayType to do typecasting in the structfield Refer for more details

#Create test data frame
tst= sqlContext.createDataFrame([(1,1,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','x','y'])
tst_struct = tst.withColumn("str_col",F.struct('x','y'))
old_schema = tst_struct.schema
res=[]
# Function to transform the schema to string
def transform(schema):
    res=[]
    for f in schema.fields:
        res.append(StructField(f.name, StringType(), f.nullable))
    return(StructType(res))
# Traverse through existing schema and change it when struct type is encountered
new_schema=[]
for f in old_schema.fields:
    if(isinstance(f.dataType,StructType)):
        new_schema.append(StructField(f.name,transform(f.dataType),f.nullable))
    else:
        new_schema.append(StructField(f.name,f.dataType,f.nullable))
# Transform the dataframe with new schema
tst_trans=tst_struct.select([F.col(f.name).cast(f.dataType) for f in new_schema])

这是测试数据集的方案:

This is the scheme of test dataset:

tst_struct.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: long (nullable = true)
 |    |-- y: long (nullable = true)

这是转换后的架构

tst_trans.printSchema()
root
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)
 |-- str_col: struct (nullable = false)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)

如果您需要将结构列分解为单独的列,您可以执行以下操作:(参考:如何将嵌套的 Struct 列解包为多列?).

If you need to explode the struct columns into seperate columns , you can do the below:(Refer: How to unwrap nested Struct column into multiple columns?).

所以,最后

tst_exp.show()
+----+----+---+---+--------+---+---+
|col1|col2|  x|  y| str_col|  x|  y|
+----+----+---+---+--------+---+---+
|   1|   1|  2| 11| [2, 11]|  2| 11|
|   1|   3|  4| 12| [4, 12]|  4| 12|
|   1|   5|  6| 13| [6, 13]|  6| 13|
|   1|   7|  8| 14| [8, 14]|  8| 14|
|   2|   9| 10| 15|[10, 15]| 10| 15|
|   2|  11| 12| 16|[12, 16]| 12| 16|
|   2|  13| 14| 17|[14, 17]| 14| 17|
+----+----+---+---+--------+---+---+
tst_exp = tst_trans.select(tst_trans.columns+[F.col('str_col.*')])