且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python - 从网页 PDF 中提取文本

更新时间:2023-02-12 22:16:01

您可以使用 requests 将文件作为字节流下载,并用 io.BytesIO() 包装它,就这样:

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf import PdfFileReader

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f 是一个类似于对象的文件,您可以像打开 PDF 文件一样使用它.这样文件只存在于内存中,永远不会保存在本地.

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

为了从 PDF 文件中获取文本,您可以使用 PyPdf.

In order to get text from the PDF file you can use PyPdf.