且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

pyspark 数据框如果不存在则添加一列

更新时间:2022-12-12 09:14:10

您可以检查数据帧中的列是否可用,并仅在必要时修改 df :

如果在 df.columns 中不是 'f':df = df.withColumn('f', f.lit(''))

对于嵌套模式,您可能需要使用 df.schema,如下所示:

>>>df.printSchema()根|-- a: struct (nullable = true)||-- b: long (nullable = true)>>>df.schema['a'].dataType.names 中的'b'真的>>>df.schema['a'].dataType.names 中的'x'错误的

I have json data in various json files And the keys could be different in lines, for eg

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}

I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.

I am reading the input file and aggregating the data like this

import pyspark.sql.functions as f
df =  spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

This is the final output I want

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}

Can anyone please Help? Thanks in advance!

You can check if colum is available in dataframe and modify df only if necessary:

if not 'f' in df.columns:
   df = df.withColumn('f', f.lit(''))

For nested schemas you may need to use df.schema like below:

>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 |    |-- b: long (nullable = true)

>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False