且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

当一个列表的所有元素都在另一个列表中时,如何分组和求和

更新时间:2023-11-28 15:28:10

最简单的方法就是检查所有交易的所有项目:

# df1 and df2 are initialized

def sum_score(transaction):
    score = 0
    for _, row in df2.iterrows():
        if all(item in transaction for item in row["items"]):
            score += row["cost"]
    return score

df1["score"] = df1["transactions"].map(sum_score)

大规模将非常慢.如果这是一个问题,我们不需要遍历每个项目,而只能预选可能的项目.如果您有足够的内存,可以这样做.对于每个项目,我们都记住它出现在df2中的所有行号.因此,对于每笔交易,我们都会得到项目,获得所有可能的行并仅检查它们.

import collections

# df1 and df2 are initialized

def get_sum_score_precalculated_func(items_cost_df):

    # create a dict of possible indexes to search for an item
    items_search_dict = collections.default_dict(set)
    for i, (_, row) in enumerate(items_cost_df.iterrow()):
        for item in row["items"]:
            items_search_dict[item].add(i)

    def sum_score(transaction):
        possible_indexes = set()
        for i in transaction:
            possible_indexes += items_search_dict[i]

        score = 0
        for i in possible_indexes:
            row = items_cost_df.iloc[i]
            if all(item in transaction for item in row["items"]):
                score += row["cost"]
        return score

    return sum_score

df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))

我在这里使用 set是唯一值的无序存储(它有助于连接可能的行号并避免重复计数). collections.defaultdict,这是通常的dict,但是如果您尝试访问未初始化的值,则会使用给定的数据填充它(在我的情况下为空白set).这有助于避免使用if x not in my_dict: my_dict[x] = set().我还使用了所谓的关闭",这意味着sum_score函数将可以访问items_cost_dfitems_search_dict,即使在返回sum_score函数并声明get_sum_score_precalculated_func之后,它们仍可以在sum_score函数声明的级别进行访问. >

如果项目非常独特,并且只能在几行df2中找到,那应该快得多.

如果您有很多独特商品,并且有很多相同的交易,则***先为每个独特交易计算得分.然后只需加入结果即可.

transactions_score = []
for transaction in df1["transactions"].unique():
    score = sum_score(transaction)
    transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
    transaction_score,
    columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")

在这里,我使用第一个代码示例中的sum_score

P.S.对于python错误消息,应该有一个行号,这对理解问题有很大帮助.

I have a data frame df1. "transactions" column has an array of int.

id     transactions
1      [1,2,3]
2      [2,3]

data frame df2. "items" column has an array of int.

items  cost
[1,2]  2.0
[2]    1.0
[2,4]  4.0

I need to check whether all elements of items are in each transaction if so sum up the costs.

Expected Result

id    transaction score
 1      [1,2,3]     3.0
 2      [2,3]       1.0

I did the following

#cross join
-----------
def cartesian_product_simplified(left, right):
   la, lb = len(left), len(right)
   ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
    np.column_stack([left.values[ia2.ravel()], 
     right.values[ib2.ravel()]]))

out=cartesian_product_simplified(df1,df2) 

#column names assigning        
out.columns=['id', 'transactions', 'cost', 'items']

#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()


#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list() 
for row in trans:
   ret =np.all(np.in1d(itm, row))
   out_list.append(ret)
return out_list

if true: group and sum
-----------------------
a=check(t,item)
for i in a:
  if(i):
   print(out.groupby(['id','transactions']))['cost'].sum()      
  else:
   print("no")

Throws TypeError: 'NoneType' object is not subscriptable.

I am new to python and don't know how to put all these together. How to group by and sum the cost when all items of one list in another list?

The simplies way is just to check all items for all transactions:

# df1 and df2 are initialized

def sum_score(transaction):
    score = 0
    for _, row in df2.iterrows():
        if all(item in transaction for item in row["items"]):
            score += row["cost"]
    return score

df1["score"] = df1["transactions"].map(sum_score)

It will be extremely slow on big scale. If this is a problem, we need to iterate not over every item, but preselect only possible. If you have enough memory, it can be done like that. For each item we remember all the row numbers in df2, where it appeared. So for each transaction we get the items, get all the possible lines and check only them.

import collections

# df1 and df2 are initialized

def get_sum_score_precalculated_func(items_cost_df):

    # create a dict of possible indexes to search for an item
    items_search_dict = collections.default_dict(set)
    for i, (_, row) in enumerate(items_cost_df.iterrow()):
        for item in row["items"]:
            items_search_dict[item].add(i)

    def sum_score(transaction):
        possible_indexes = set()
        for i in transaction:
            possible_indexes += items_search_dict[i]

        score = 0
        for i in possible_indexes:
            row = items_cost_df.iloc[i]
            if all(item in transaction for item in row["items"]):
                score += row["cost"]
        return score

    return sum_score

df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))

Here I use set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count). collections.defaultdict which is a usual dict, but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). It help to avoid if x not in my_dict: my_dict[x] = set(). I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func

That should be much faster in case the items are quite unique and can be found only in a few lines of df2.

If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. And then just join the result.

transactions_score = []
for transaction in df1["transactions"].unique():
    score = sum_score(transaction)
    transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
    transaction_score,
    columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")

Here I use sum_score from first example of code

P.S. With the python error message there should be a line number which helps a lot to understand the problem.