从字幕文件中删除不在单词列表中的(常用单词)单词

更新时间：2023-11-28 18:29:28

以下仅处理每个'.srt'文件的第三行.它可以轻松地用于处理其他行和/或其他文件.

The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'
            fout.write(line)

结果(对于您作为示例给出的 subtitle.rst :

Result (for the subtitle.rst you gave as example:

! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.

替代方法:只需在词汇以外的单词旁边添加'*':

Alternative: just add a '*' next to out-of-vocabulary words:

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

则输出为:

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

说明:

第一个 open 用于读取所有想要的单词，确保它们都是小写，然后将它们放入 set 中(用于快速成员资格测试)./li>
我们使用 glob 查找以'.srt'结尾的所有文件名.
对于每个这样的文件，我们都构造一个新文件名，作为'..._ new.srt'.
我们阅读了所有行，但仅修改了 i == 2 行(即第三行，因为默认情况下 enumerate 从0开始).
line.strip()删除尾随的换行符.
我们本可以使用 line.strip().split()将行拆分为单词，但是最后将'good.'保留为最后一个单词;不好.使用的正则表达式通常用于拆分单词(特别是，它用单引号引起来，例如"do n't" ；它可能不是您想要的，当然可以随意使用)
我们使用捕获组拆分 r(([\ w'] +)" ，而不是拆分非单词char，这样我们既拥有单词又将它们分隔在零件.例如，好人".成为 ["，人"，，"，谁"，，"，，"，好"'，'.'] .
单词本身是 parts 的所有其他元素，从索引1开始.
如果单词的小写形式不是 keep_words ，我们会用'*'替换这些单词.
最后，我们重新组装该行，并通常将所有行输出到新文件中.

The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
We use glob to find all filenames ending in '.srt'.
For each such file, we construct a new filename derived from it as '..._new.srt'.
We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
line.strip() removes the trailing newline.
We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
The words themselves are every other element of parts, starting at index 1.
We replace the words by '*' if their lowercase form is not in keep_words.
Finally we re-assemble that line, and generally output all lines to the new file.

上一篇 : ：Lisp 中的子列表下一篇 : iOS从UITextView中删除单词

从字幕文件中删除不在单词列表中的(常用单词)单词

相关阅读

推荐文章