且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从文本文件中检索 JSON 对象(使用 Python)

更新时间:2023-01-17 16:05:18

这将从字符串中解码您的 JSON 对象列表":

from json import JSONDecoderdef load_invalid_obj_list(s):解码器 = JSONDecoder()s_len = len(s)对象 = []结束 = 0而结束!= s_len:obj, end =decoder.raw_decode(s, idx=end)objs.append(obj)返回对象

这里的好处是你可以很好地使用解析器.因此,它会不断告诉您确切地发现错误的位置.

示例

>>>load_invalid_obj_list('{}{}')[{}、{}]>>>load_invalid_obj_list('{}{ }{')回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件decode.py",第 9 行,在loads_invalid_obj_list 中obj, end =decoder.raw_decode(s, idx=end)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py",第 376 行,raw_decodeobj, end = self.scan_once(s, idx)ValueError:预期对象:第 2 行第 2 列(字符 5)

清洁解决方案(稍后添加)

导入json进口重新#shameless 从 json/decoder.py 复制粘贴标志 = re.VERBOSE |re.MULTILINE |重新打点空格 = re.compile(r'[ 	

]*', FLAGS)类 ConcatJSONDecoder(json.JSONDecoder):def 解码(self, s, _w=WHITESPACE.match):s_len = len(s)对象 = []结束 = 0而结束!= s_len:obj, end = self.raw_decode(s, idx=_w(s, end).end())end = _w(s, end).end()objs.append(obj)返回对象

示例

>>>打印 json.loads('{}', cls=ConcatJSONDecoder)[{}]>>>打印 json.load(open('file'), cls=ConcatJSONDecoder)[{}]>>>打印 json.loads('{}{} {', cls=ConcatJSONDecoder)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py",第 339 行,加载中返回 cls(encoding=encoding, **kw).decode(s)文件decode.py",第 15 行,在解码中obj, end = self.raw_decode(s, idx=_w(s, end).end())文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py",第 376 行,raw_decodeobj, end = self.scan_once(s, idx)ValueError:预期对象:第 1 行第 5 列(字符 5)

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:

{field1: {}, field2: "some value", field3: {}, ...} 

and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().

Any suggestion on how I can solve this problem. Is there a known parser to do this?

This decodes your "list" of JSON Objects from a string:

from json import JSONDecoder

def loads_invalid_obj_list(s):
    decoder = JSONDecoder()
    s_len = len(s)

    objs = []
    end = 0
    while end != s_len:
        obj, end = decoder.raw_decode(s, idx=end)
        objs.append(obj)

    return objs

The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.

Examples

>>> loads_invalid_obj_list('{}{}')
[{}, {}]

>>> loads_invalid_obj_list('{}{
}{')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "decode.py", line 9, in loads_invalid_obj_list
    obj, end = decoder.raw_decode(s, idx=end)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)

Clean Solution (added later)

import json
import re

#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ 	

]*', FLAGS)

class ConcatJSONDecoder(json.JSONDecoder):
    def decode(self, s, _w=WHITESPACE.match):
        s_len = len(s)

        objs = []
        end = 0
        while end != s_len:
            obj, end = self.raw_decode(s, idx=_w(s, end).end())
            end = _w(s, end).end()
            objs.append(obj)
        return objs

Examples

>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]

>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]

>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "decode.py", line 15, in decode
    obj, end = self.raw_decode(s, idx=_w(s, end).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)