更新时间:2023-09-04 09:51:58
您可以使用外部循环浏览目录中的每个PDF.然后,浏览PDF的所有页面,并在内部循环中提取文本.您要检查每个文档是否至少一页包含school
,gym
或swimming pool
.您要使用的返回值是:
You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school
, gym
or swimming pool
. The returned values you want to use are:
Present
或Not present
的PDF文档数量的长度的向量.Present
or Not present
.对吗?
您可以跳过循环中的几个步骤,尤其是在将PDF转换为TIFF并使用ocr
从其中读取文本时:
You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr
:
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
它不像我希望的那样可读.可能是因为几乎没有要求输出需要少量的中间步骤,使代码看起来有些混乱.效果很好,尽管
It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though