且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检查字符串中是否有多个单词匹配,以便在R中搜索文本

更新时间:2023-09-04 09:51:58

您可以使用外部循环浏览目录中的每个PDF.然后,浏览PDF的所有页面,并在内部循环中提取文本.您要检查每个文档是否至少一页包含schoolgymswimming pool.您要使用的返回值是:

You go through every PDF in your directory with the outside loop. Then you go through all pages of the PDF and extract the text in the inner loop. You want to check for every document whether at least one page contains either school, gym or swimming pool. The returned values you want to use are:

  1. 包含PresentNot present的PDF文档数量的长度的向量.
  2. 带有一些字符串的三个向量,其中包含有关哪个单词何时何地出现的信息.
  1. a vector of the length of the number of PDF documents containing either Present or Not present.
  2. Three vector with some strings, containing information on which word occurs where and when.

对吗?

您可以跳过循环中的几个步骤,尤其是在将PDF转换为TIFF并使用ocr从其中读取文本时:

You can skip a couple of steps in your loop, especially while transforming PDFs to TIFFs and reading texts from them with ocr:

all_files <- Sys.glob("*.pdf")
strings   <- c("school", "gym", "swimming pool")

# Read text from pdfs
texts <- lapply(all_files, function(x){
                img_file <- pdf_convert(x, format="tiff", dpi=400)
                return( tolower(ocr(img_file)) )
                })

# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
  for(w in seq_along(strings)){
    intermed   <- grep(strings[w], texts[[d]])
    words[[d]] <- c(words[[d]], 
                    strings[w][ (length(intermed) > 0) ])
    pages[[d]] <- unique(c(pages[[d]], intermed))
  }
}

# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))

Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))

Words    <- sapply(words, paste0, collapse=", ")
Status   <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")

data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)        
#       Status                                   Page                      Words
# pdf1 Present                         pdf1_1, pdf1_2         gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool

它不像我希望的那样可读.可能是因为几乎没有要求输出需要少量的中间步骤,使代码看起来有些混乱.效果很好,尽管

It's not as readable as I'd like it to be. Probably because little requirements w.r.t. the output require minor intermediate steps that make the code seem a bit chaotic. It works well, though