且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从列表中删除标点符号

更新时间:2023-11-13 19:07:40

您可以从单词中去除标点符号,也可以避免将所有文件读入内存:

  punc = string.punctuation 
return''.join(word.strip(punc)for line in line.split()中word的行)

如果你想从 Nature's 中删除​​',那么你需要translate:

 从字符串导入标点符号

#使用要替换的字符的ord作为键和要替换的字符它们的值为
tbl = {ord(k):for k in punctuation}
return''.join(line.translate(tbl)for line in fname)

要获取频率,请使用反字典

  from集合import Counter 
freq = Counter(len(word.translate(tbl))用于在line.split()中的word的fname行)





$ $ $ $ $ $ $ $ $ $ $ $ $ $ $'换行)。

以上面的问题中的代码行为例:

  lines =在人类事件过程中,有必要让一个人解散已经连接的政治乐队他们与另一个人,并承担在地球的权力,自然法则和自然的上帝赋予他们的独立和平等的站,一个体面的尊重人类的意见需要
,他们应该宣布的原因(b)(b)b
从集合中导入计数器
freq =计数器(len(word.strip(标点符号))用于line.splitlines() line.split())
print(freq.most_common())

输出从字长开始的键/值对的元组,最小到最小,键是长度,第二个元素是频率:
(3,12),(2,12),(4,9),(5,9),(6,9),(7,7,8) 7),(8,5),(9,3),(1,1),(10,1)]

如果你想从1个字母的单词开始输出频率,而不需要排序,按顺序:

  mx (1,mx + 1):
v = freq [i]
如果v:
print(length { }格式(i,v))

输出:

 长度1个字出现1次/ s。 
长度2个字出现12次/秒。
长度3个字出现15次/秒。
长度4个字出现了9次/秒。
长度5个字出现了9次/秒。
长度6个字出现了9次/秒。
长度7个字出现7次/秒。
长度8字出现5次/秒。
长度9个字出现3次/秒。
长度10个字出现1次/秒。

对于一个缺少的键,一个Counter dict不像一个正常的dict,不会返回一个keyError,而是返回一个值 0 so 如果v 只会在文件中显示的字长为True。



如果你想打印清理的数据,把所有的逻辑放在一起:

  def clean_text fname):
punc = string.punctuation
return [word.strip(punc)for line in fname for word in line.split()]


def get_freq(已清理):
返回Counter(len(word)for word in clean)


def freq_output(d):
mx = max(d.values ())
(1,mx + 1):
v = d [i]
如果v:
print(length {} words appears {} time /s\".format(i,v))

尝试:
打开(sys.argv [1],'r')作为file_arg:
file_arg.read ()
,除了IndexError:
print('您需要提供一个文件名作为参数')
sys.exit()

fname = open(sys.argv [1],'r')
formatted_text = clean_text(fname)
$ b $ print( .join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq)

在您的问题代码片段输出中运行:

 〜$ python test.py test.txt 
在人类活动过程中,有必要让一个人
解散与另一个
相连的政治乐队,并假设把自然法则和自然之神赋予的独立和平等的地位
赋予他们对人类意见的体面的
的尊重,要求他们宣布
促使他们到达分离

长度1个字出现1次/秒。
长度2个字出现12次/秒。
长度3个字出现15次/秒。
长度4个字出现了9次/秒。
长度5个字出现了9次/秒。
长度6个字出现了9次/秒。
长度7个字出现7次/秒。
长度8字出现5次/秒。
长度9个字出现3次/秒。
长度10个字出现1次/秒。

如果您只关心频率输出,请一次完成:

 导入sys 
导入字符串


def freq_output(fname):
from字符串输入标点符号

tbl = {ord(k):for k in punctuation}
d = Counter(len(word.strip(标点符号)) .blit())
d = Counter(len(word.translate(tbl))用于在line.split()中单词的fname中的行)
mx = max(d.values())$ b (1,mx + 1):
v = d [i]
如果v:
print(length {} words出现{} time / s。格式(i,v))


尝试:
打开(sys.argv [1],'r')作为file_arg:
file_arg.read )
,除了IndexError:
print('您需要提供一个文件名作为参数。')
sys.exit()

fname = open(sys。 argv [1],'r')

freq_output(fname)

使用whic hever方法对于 d


是正确的

I'm working on taking a sample of the Declaration of Independence and calculating the frequency of the length of words in it.

Sample text from file:

"When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires 
that they should declare the causes which impel them to the separation."

Note: The word length cannot include any punctuation e.g. anything from string.punctuation.

Expected Outcome (sample):

Length Count
1 16
2 267
3 267
4 169
5 140
6 112
7 99
8 68
9 61
10 56
11 35
12 13
13 9
14 7
15 2

I'm currently stuck on removing punctuation from the file that I've converted into a list.

Here is what I've tried so far:

import sys
import string

def format_text(fname):
        punc = set(string.punctuation)
        words = fname.read().split()
        return ''.join(word for word in words if word not in punc)

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = format_text(fname)
print(formatted_text)

You can strip the punctuation from the words and also avoid reading all the file into memory:

punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())

If you want to remove the ' from Nature's then you will need translate:

from string import punctuation

# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)

To get the frequency, use a Counter dict:

from collections import Counter
freq = Counter(len(word.translate(tbl)) for line in fname for word in line.split())

Or depending on your approach:

freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())

Using the lines in your question above as an example:

lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""

from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common()) 

Outputs tuples of key/value pairings starting with the word length seen the most all the way down to the least, the key is the length and the second element is the frequency:

[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]

If you want to output the frequency starting from 1 letter words up without sorting and in order:

mx = max(freq.values())
for i in range(1, mx+1):
    v = freq[i]
    if v:
        print("length {} words appeared {} time/s.".format(i, v) )

Output:

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

For a missing key a Counter dict unlike a normal dict will not return a keyError but return a value of 0 so if v will only be True for word lengths that appeared in the file.

If you want to print the cleaned data putting all the logic in fucntions:

def clean_text(fname):
    punc = string.punctuation
    return [word.strip(punc) for line in fname for word in line.split()]


def get_freq(cleaned):
    return Counter(len(word) for word in cleaned)


def freq_output(d):
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)

print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq) 

Which run on your question snippet outputs:

~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people  
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station 
 to which the Laws of Nature and of Nature's God entitle them a decent 
respect to the opinions of mankind requires that they should declare 
the causes which impel them to the separation

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

If you only care about the frequency output, do it all in one pass:

import sys
import string


def freq_output(fname):
    from string import punctuation

    tbl = {ord(k): "" for k in punctuation}
    d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
    d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))


try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')

freq_output(fname)

using whichever approach is correct for d.