更新时间:2023-11-28 18:29:28
以下仅处理每个'.srt'
文件的第三行.它可以轻松地用于处理其他行和/或其他文件.
The following processes the 3rd line only of every '.srt'
file. It can be easily adapted to process other lines and/or other files.
import os
import re
from glob import glob
with open('words.txt') as f:
keep_words = {line.strip().lower() for line in f}
for filename_in in glob('*.srt'):
filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
with open(filename_in) as fin, open(filename_out, 'w') as fout:
for i, line in enumerate(fin):
if i == 2:
parts = re.split(r"([\w']+)", line.strip())
parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
line = ''.join(parts) + '\n'
fout.write(line)
结果(对于您作为示例给出的 subtitle.rst
:
Result (for the subtitle.rst
you gave as example:
! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.
替代方法:只需在词汇以外的单词旁边添加'*'
:
Alternative: just add a '*'
next to out-of-vocabulary words:
# replace:
# parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]
则输出为:
2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.
说明:
open
用于读取所有想要的单词,确保它们都是小写,然后将它们放入 set
中(用于快速成员资格测试)./li> glob
查找以'.srt'
结尾的所有文件名.'..._ new.srt'
. i == 2
行(即第三行,因为默认情况下 enumerate
从0开始). line.strip()
删除尾随的换行符. line.strip().split()
将行拆分为单词,但是最后将'good.'
保留为最后一个单词;不好.使用的正则表达式通常用于拆分单词(特别是,它用单引号引起来,例如"do n't"
;它可能不是您想要的,当然可以随意使用) r(([\ w'] +)"
,而不是拆分非单词char,这样我们既拥有单词又将它们分隔在零件
.例如,好人".
成为 [",人",,",谁",,",,",好"','.']
. parts
的所有其他元素,从索引1开始. keep_words
,我们会用'*'
替换这些单词.open
is used to read in all wanted words, make sure they are in lowercase, and put them into a set
(for fast membership test).glob
to find all filenames ending in '.srt'
.'..._new.srt'
.i == 2
(i.e. the 3rd line, since enumerate
by default starts at 0).line.strip()
removes the trailing newline.line.strip().split()
to split the line into words, but it would have left 'good.'
as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"
; it may or may not be what you want, adapt at will of course).r"([\w']+)"
instead of splitting on non-word chars, so that we have both words and what separates them in parts
. For example, 'People, who are good.'
becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.']
.parts
, starting at index 1.'*'
if their lowercase form is not in keep_words
.