且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

用于从 RTF 字符串中提取文本的正则表达式

更新时间:2022-10-18 17:53:16

在 RTF 中,{ 和 } 标记一个组.组可以嵌套.\ 标记控制字的开始.控制词以空格或非字母字符结尾.控制字后面可以有一个数字参数,中间没有任何分隔符.一些控制字也采用文本参数,用;"分隔.这些控制词通常在它们自己的组中.

我想我已经设法制作了一个可以处理大多数情况的模式.

\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?

虽然在您的模式上运行时会留下一些空格.

通过 RTF规范(其中一些),我发现纯正则表达式的剥离器存在很多缺陷.最明显的一个是某些组应该被忽略(页眉、页脚等),而其他组应该被呈现(格式化).

我编写了一个 Python 脚本,它应该比上面的正则表达式更有效:

def striprtf(text):pattern = re.compile(r"\\([az]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^az])|([{}])|[\r\n]+|(.)", re.I)# 指定目的地"的控制字.目的地 = 冻结集(('aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid','atnparent','atnref','atntime','atrfend','atrfstart','author','background','bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping','colortbl','comment','company','creatim','datafield','datastore','defchp','defpap','do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt','fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl','ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype','fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr','脚注','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl','header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc','hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers','leveltext','lfolevel','linkval','list','listlevel','listname','listoverride','listoverridetable','listpicture','liststylename','listtable','listtext','lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr','mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr','mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me','mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr','mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag','mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname','mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr','mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject','mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname','mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl','mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr','mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu','phant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr','mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup','msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide','msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol','mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables','objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops','oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password','passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta','pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe','结果','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst','shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv','svb','tc','template','themedata','title','txe','ud','upr','userprops','wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform','xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl','xmlopen',))# 一些特殊字符的翻译.特殊字符 = {'par': '\n','sect': '\n\n','页面': '\n\n','亚麻布','tab': '\t','emdash': u'\u2014','endash': u'\u2013','emspace': u'\u2003','enspace': u'\u2002','qmspace': u'\u2005','子弹':你'\u2022','lquote': u'\u2018','rquote': u'\u2019','ldblquote': u'\201C','rdblquote': u'\u201D',}堆栈 = []ignorable = False # 这个组(以及里面的所有)是否是可忽略的".ucskip = 1 # Unicode 字符后要跳过的 ASCII 字符数.curskip = 0 # 剩下要跳过的 ASCII 字符数out = [] # 输出缓冲区.在 pattern.finditer(text) 中匹配:word,arg,hex,char,brace,tchar = match.groups()如果大括号:游记 = 0如果大括号 == '{':#推送状态stack.append((ucskip,ignorable))elif 大括号 == '}':# 弹出状态ucskip,ignorable = stack.pop()elif char: # \x (不是字母)游记 = 0如果字符 == '~':如果不可忽视:out.append(u'\xA0')'{}\\' 中的 elif 字符:如果不可忽视:out.append(char)elif 字符 == '*':可忽略 = 真elif 字:# \foo游记 = 0如果目的地中的单词:可忽略 = 真elif 可忽略:经过specialchars 中的 elif 字:out.append(specialchars[word])elif 字 == 'uc':ucskip = int(arg)elif 字 == 'u':c = int(arg)如果 c127: out.append(unichr(c))否则:out.append(chr(c))curskip = ucskipelif 十六进制:# \'xx如果curskip >0:游记 -= 1elif 不可忽视:c = 整数(十六进制,16)如果 c >127: out.append(unichr(c))否则:out.append(chr(c))elif tchar:如果curskip >0:快跳 -= 1elif 不可忽视:out.append(tchar)返回 '​​'.join(out)

它的工作原理是解析 RTF 代码,并跳过任何指定了目的地"的组,以及所有可忽略"的组 ({\*...}).我还添加了一些特殊字符的处理.

缺少许多功能使其成为完整的解析器,但对于简单的文档应该足够了.

更新:此网址已更新此脚本以在 Python 3.x 上运行: