且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在csv文件中计算python中的特定出现次数

更新时间:2022-12-10 09:41:14

由于有人已经张贴了 defaultdict 解决方案,我要给一个 pandas 一个,只是为了品种。 pandas 是一个非常方便的数据处理库。在其他不错的功能,它可以处理这一计数问题在一行,取决于需要什么样的输出。真的:

Since someone's already posted a defaultdict solution, I'm going to give a pandas one, just for variety. pandas is a very handy library for data processing. Among other nice features, it can handle this counting problem in one line, depending on what kind of output is required. Really:

df = pd.read_csv("cluster.csv")
counted = df.groupby(["Cluster_id", "User", "Quality"]).size()
df.to_csv("counted.csv")

-

只要给予 pandas 我们可以加载文件 - pandas 中的主数据存储对象被称为DataFrame:

Just to give a trailer for what pandas makes easy, we can load the file -- the main data storage object in pandas is called a "DataFrame":

>>> import pandas as pd
>>> df = pd.read_csv("cluster.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns:
Tag           500000  non-null values
User          500000  non-null values
Quality       500000  non-null values
Cluster_id    500000  non-null values
dtypes: int64(1), object(3)

我们可以检查前几行是否正常:

We can check that the first few rows look okay:

>>> df[:5]
   Tag  User Quality  Cluster_id
0  bbb  u001     bad          39
1  bbb  u002     bad          36
2  bag  u003    good          11
3  bag  u004    good           9
4  bag  u005     bad          26

然后我们可以按Cluster_id和User进行分组,每组:

and then we can group by Cluster_id and User, and do work on each group:

>>> for name, group in df.groupby(["Cluster_id", "User"]):
...     print 'group name:', name
...     print 'group rows:'
...     print group
...     print 'counts of Quality values:'
...     print group["Quality"].value_counts()
...     raw_input()
...     
group name: (1, 'u003')
group rows:
        Tag  User Quality  Cluster_id
372002  xxx  u003     bad           1
counts of Quality values:
bad    1

group name: (1, 'u004')
group rows:
           Tag  User Quality  Cluster_id
126003  ground  u004     bad           1
348003  ground  u004    good           1
counts of Quality values:
good    1
bad     1

group name: (1, 'u005')
group rows:
           Tag  User Quality  Cluster_id
42004   ground  u005     bad           1
258004  ground  u005     bad           1
390004  ground  u005     bad           1
counts of Quality values:
bad    3
[etc.]

做很多处理 csv 文件,这绝对值得一看。

If you're going to be doing a lot of processing of csv files, it's definitely worth having a look at.