使用pyspark解析JSON时，嵌套动态架构无法正常工作

更新时间：2023-10-17 18:24:22

您的架构未正确映射...如果您要手动构建架构，请参阅以下文章(如果数据不变，建议您这样做):

Your schema is not mapped properly ... please see these posts if you want to manually construct schema (which is recommended if the data doesn't change):

PySpark:如何更新嵌套列?

https://docs.databricks.com/_static/notebooks/complex-nested-structured.html

此外，如果您的JSON是多行的(例如您的示例)，则可以...

Also, if your JSON is multi-line (like your example) then you can ...

通过多行选项读取json以使Spark推断模式
然后保存嵌套模式
然后使用正确的架构映射读回数据，以避免触发Spark作业

! cat nested.json

[
    {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
    {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
    {
        "string": "string3",
        "int": 3,
        "array": [
            3,
            6,
            9
        ],
        "dict": {
            "key": "value3",
            "extra_key": "extra_value3"
        }
    }
]

getSchema = spark.read.option("multiline", "true").json("nested.json")

extractSchema = getSchema.schema
print(extractSchema)
StructType(List(StructField(array,ArrayType(LongType,true),true),StructField(dict,StructType(List(StructField(extra_key,StringType,true),StructField(key,StringType,true))),true),StructField(int,LongType,true),StructField(string,StringType,true)))

loadJson = spark.read.option("multiline", "true").schema(extractSchema ).json("nested.json")

loadJson.printSchema()
root
 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)

loadJson.show(truncate=False)
+---------+----------------------+---+-------+
|array    |dict                  |int|string |
+---------+----------------------+---+-------+
|[1, 2, 3]|[, value1]            |1  |string1|
|[2, 4, 6]|[, value2]            |2  |string2|
|[3, 6, 9]|[extra_value3, value3]|3  |string3|
+---------+----------------------+---+-------+

一旦您加载了正确的映射数据，就可以开始通过嵌套列的点"表示法和分解"到扁平化阵列等的方式转换为规范化的模式.

Once you have the data loaded with the correct mapping then you can start to transform into a normalized schema via the "dot" notation for nested columns and "explode" to flatten arrays, etc.

loadJson\
.selectExpr("dict.key as key", "dict.extra_key as extra_key").show()

+------+------------+
|   key|   extra_key|
+------+------------+
|value1|        null|
|value2|        null|
|value3|extra_value3|
+------+------------+

上一篇 : ：NodeJS使用相对路径访问文件下一篇 : IIS上的Blazor服务器端应用程序经常断开WebSocket连接

使用pyspark解析JSON时，嵌套动态架构无法正常工作

相关阅读

推荐文章