
且构网 - 分享程序员编程开发的那些事


更新时间:2023-02-26 09:20:28




This can work by escaping each character that has an accent in each company name before they are used in the grep command.


So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\). For example:

  • Württemberg中的ü将需要成为\\ü
  • Königsberg中的ö将需要成为\\ö
  • Einbahnstraße中的ß将需要成为\\ß
  • The ü in Württemberg will need to become \\ü
  • The ö in Königsberg will need to become \\ö
  • The ß in Einbahnstraße will need to become \\ß

这些带重音的字符,例如 u带有音调符号,肯定会以不同的方式编码.他们所接收的编码类型很难确定.我的假设是所使用的编码模式以反斜杠开头-因此,为什么用反斜杠转义那些字符可以解决此问题.考虑上一个链接中的 u带有音调的,它表明对于C/C ++语言,ü编码为\u00FC.

These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü is encoded as \u00FC.


In the complete script below you'll notice the following:

  1. set accentedChars to {"ü", "ö", "ß", "á", "ė"}来保存所有需要转义的字符的列表.您需要明确说明每个人,因为似乎没有一种方法可以推断角色是否带有重音.
  2. 在将grep命令分配给theCommand变量之前,我们首先通过以下代码行转义必要的字符:

  1. set accentedChars to {"ü", "ö", "ß", "á", "ė"} has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.
  2. Before assigning the grepcommand to theCommand variable we firstly escape the necessary characters via the line reading:

set company to escapeChars(company, accentedChars)


As you can see here we are passing two arguments to the escapeChars sub-routine, (i.e. the non-escaped company variable and the list of accented characters).


In the escapeChars sub-routine we iterate over each char in the accentedChars list and invoke the findAndReplace sub-routine. This will escape any instances of those characters with backslashes found in the company variable.


set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}

set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
  space & searchFile
do shell script theCommand

tell application "Finder"
  set companies to get name of folders of folder ("/path/" as POSIX file)
end tell

repeat with company in companies
  set company to escapeChars(company, accentedChars)

  set theCommand to "grep -c " & quoted form of company & ¬
    space & quoted form of searchFile

    do shell script theCommand
    set CompanyName to company as string
    return CompanyName
  on error

  end try
end repeat

return false

 * Checks each character of a given word. If any characters of the word
 * match a character in the given list of characters they will be escapd.
 * @param {text} searchWord - The word to check the characters of.
 * @param {text} charactersList - List of characters to be escaped.
 * @returns {text} The new text with the item(s) replaced.
on escapeChars(searchWord, charactersList)
  repeat with char in charactersList
    set searchWord to findAndReplace(char, ("\\" & char), searchWord)
  end repeat
  return searchWord
end escapeChars

 * Replaces all occurances of findString with replaceString
 * @param {text} findString - The text string to find.
 * @param {text} replaceString - The replacement text string.
 * @param {text} searchInString - Text string to search.
 * @returns {text} The new text with the item(s) replaced.
on findAndReplace(findString, replaceString, searchInString)
  set oldTIDs to text item delimiters of AppleScript
  set text item delimiters of AppleScript to findString
  set searchInString to text items of searchInString
  set text item delimiters of AppleScript to replaceString
  set searchInString to "" & searchInString
  set text item delimiters of AppleScript to oldTIDs
  return searchInString
end findAndReplace



Note about current counts:

Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.

如果您要获取单词的实际实例数,则将-o选项与 以输出每次出现的事件.然后使用-l选项将其通过管道传递到 wc 来计算行数.例如:

If you want the actual number of instances of the word then use the -o option with grep to output each occurrence. Then pipe that to wc with the -l option to count the number of lines. For example:

grep -o 'Württemberg' /tmp/output.txt | wc -l


and in your AppleScript that would be:

set theCommand to "grep -o " & quoted form of company & space & ¬
  quoted form of searchFile & "| wc -l"


Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed to strip the spaces: For example via your script:

set theCommand to "grep -o " & quoted form of company & space & ¬
  quoted form of searchFile & "| wc -l | sed -e 's/ //g'"


and the equivalent via the command line:

grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'
