且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

用于交叉列列表的一致 ColumnTransformer

更新时间:2023-02-01 22:50:45

ColumnTransformer 的预期用途是并行应用不同的转换器,而不是按顺序应用.为了实现您想要的结果,我想到了三种方法:

第一种方法:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),('scale', StandardScaler())])pipe_b = Pipeline(steps=[('imp', SimpleImputer()),('log', log_transformer),('scale', StandardScaler())])pipe_c = Pipeline(steps=[('log', log_transformer),('scale', StandardScaler())])proc = ColumnTransformer(transformers=[('a', pipe_a, ['a']),('b', pipe_b, ['b']),('c', pipe_c, ['c'])])

这第二个实际上不起作用,因为 ColumnTransformer 将重新排列列并忘记名称*,因此后面的将失败或应用于错误的列.当 sklearn 最终确定如何传递数据帧或功能名称时,这可能会被挽救,或者您现在可以针对您的特定用例对其进行调整.(* ColumnTransformer 已经有一个 get_feature_names,但通过管道传递的实际数据没有该信息.)

imp_tfm = ColumnTransformer(变压器=[('num', impute.SimpleImputer() , ['a', 'b'])],余数='直通')log_tfm = ColumnTransformer(变压器=[('log', log_transformer, ['b', 'c'])],余数='直通')scl_tfm = ColumnTransformer(变压器=[('scale', StandardScaler(), ['a', 'b', 'c']))proc = 流水线(步骤=[('imp', imp_tfm),('log', log_tfm),('比例',scl_tfm)])

第三,可能有一种方法可以使用 Pipeline 切片功能让一个主"您为每个功能减少的管道......这主要像第一种方法一样工作,在较大的管道​​的情况下可能会节省一些编码,但似乎有点hacky.例如,您可以在此处:

pipe_a = clone(pipe_b)[1:]pipe_c = 克隆(pipe_b)pipe_c.steps[1] = ('nolog', 'passthrough')

(如果不克隆或以其他方式深度复制pipe_b,最后一行将同时更改pipe_cpipe_b.切片机制返回一个副本,所以 pipe_a 并不严格需要被克隆,但我把它留在了感觉更安全.不幸的是你不能提供一个不连续的切片,所以 pipe_c = pipe_b[0,2] 不起作用,但您可以像我上面所做的那样将各个切片设置为 passthrough" 以禁用它们.)>

I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

  1. I get array of (4, 7) shape.
  2. I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?

The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

This second one actually won't work, because the ColumnTransformer will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names, but the actual data passed through the pipeline doesn't have that information.)

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
)

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)