且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在 Python 中将字符串分解为单个单词

更新时间:2023-02-09 22:21:31

我们尝试从一组已知词中将域名 (s) 拆分为任意数量的词(不仅仅是 2 个)(words).递归ftw!

We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

如果它在words 中,这个迭代器函数首先产生它被调用的字符串.然后它以各种可能的方式将字符串分成两部分.如果第一部分不在 words 中,它会尝试下一个拆分.如果是,则第一部分将添加到第二部分调用自身的所有结果之前(可能没有,例如 ["example", "cart", ...])

This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])

然后我们建立英语词典:

Then we build the english dictionary:

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast")) 

现在我们可以把事情放在一起:

Now we can put things together:

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

结果:{'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, '交易': 1, '例子': 4}

使用 set 来包含英语词典可以进行快速的成员资格检查.-= 从集合中删除项目,|= 添加.

Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.

all 函数与 生成器表达式 一起使用可提高效率,因为 all 返回第一个 False.

Using the all function together with a generator expression improves efficiency, since all returns on the first False.

某些子字符串可能是一个完整的或拆分的有效单词,例如example"/ex"+ample".对于某些情况,我们可以通过排除不需要的词来解决问题,例如上面代码示例中的ex".对于其他的,比如pensions"/pens"+ions",这可能是不可避免的,当发生这种情况时,我们需要防止字符串中的所有其他单词被多次计数(一次为pensions",一次为对于笔"+离子").我们通过跟踪在一个集合中找到的每个域名的词来做到这一点——集合忽略重复——然后在找到所有词后对这些词进行计数.

Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.

重组并添加了大量评论.强制字符串小写,以避免因大写而遗漏.还添加了一个列表来跟踪没有匹配单词组合的域名.

Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.

NECROMANCY 更改了子字符串函数,使其扩展性更好.对于超过 16 个字符左右的域名,旧版本的速度慢得离谱.仅使用上述四个域名,我自己的运行时间从 3.6 秒提高到 0.2 秒!

NECROMANCY Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!