更新时间:2023-02-01 22:50:45
ColumnTransformer
的预期用途是并行应用不同的转换器,而不是按顺序应用.为了实现您想要的结果,我想到了三种方法:
第一种方法:
pipe_a = Pipeline(steps=[('imp', SimpleImputer()),('scale', StandardScaler())])pipe_b = Pipeline(steps=[('imp', SimpleImputer()),('log', log_transformer),('scale', StandardScaler())])pipe_c = Pipeline(steps=[('log', log_transformer),('scale', StandardScaler())])proc = ColumnTransformer(transformers=[('a', pipe_a, ['a']),('b', pipe_b, ['b']),('c', pipe_c, ['c'])])
这第二个实际上不起作用,因为 ColumnTransformer
将重新排列列并忘记名称*,因此后面的将失败或应用于错误的列.当 sklearn 最终确定如何传递数据帧或功能名称时,这可能会被挽救,或者您现在可以针对您的特定用例对其进行调整.(* ColumnTransformer 已经有一个 get_feature_names
,但通过管道传递的实际数据没有该信息.)
imp_tfm = ColumnTransformer(变压器=[('num', impute.SimpleImputer() , ['a', 'b'])],余数='直通')log_tfm = ColumnTransformer(变压器=[('log', log_transformer, ['b', 'c'])],余数='直通')scl_tfm = ColumnTransformer(变压器=[('scale', StandardScaler(), ['a', 'b', 'c']))proc = 流水线(步骤=[('imp', imp_tfm),('log', log_tfm),('比例',scl_tfm)])
第三,可能有一种方法可以使用 Pipeline
切片功能让一个主"您为每个功能减少的管道......这主要像第一种方法一样工作,在较大的管道的情况下可能会节省一些编码,但似乎有点hacky.例如,您可以在此处:
pipe_a = clone(pipe_b)[1:]pipe_c = 克隆(pipe_b)pipe_c.steps[1] = ('nolog', 'passthrough')
(如果不克隆或以其他方式深度复制pipe_b
,最后一行将同时更改pipe_c
和pipe_b
.切片机制返回一个副本,所以 pipe_a
并不严格需要被克隆,但我把它留在了感觉更安全.不幸的是你不能提供一个不连续的切片,所以 pipe_c = pipe_b[0,2]
不起作用,但您可以像我上面所做的那样将各个切片设置为 passthrough"
以禁用它们.)>
I want to use sklearn.compose.ColumnTransformer
consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:
log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
transformers=[
('num', impute.SimpleImputer() , ['a', 'b']),
('log', log_transformer, ['b', 'c']),
('scale', p.StandardScaler(), ['a', 'b', 'c'])
]).fit_transform(df)
So, I want to use SimpleImputer
for 'a'
, 'b'
, then log
for 'b'
, 'c'
, and then StandardScaler
for 'a'
, 'b'
, 'c'
.
But:
(4, 7)
shape.Nan
in a
and b
columns.So, how can I use ColumnTransformer
for different columns in the manner of Pipeline
?
UPD:
pipe_1 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])
pipe_2 = pipeline.Pipeline(steps=[
('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])
pipe_3 = pipeline.Pipeline(steps=[
('scl', p.StandardScaler()),
])
# in the real situation I don't know exactly what cols these arrays contain, so they are not static:
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']
proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
('1', pipe_1, cols_1),
('2', pipe_2, cols_2),
('3', pipe_3, cols_3),
])
proc.fit_transform(df).T
Output:
array([[ 1. , 2. , 42. , 4. ],
[ 1. , 24. , 3. , 4. ],
[-1.06904497, -0.26726124, nan, 1.33630621],
[-1.33630621, nan, 0.26726124, 1.06904497],
[-1.34164079, -0.4472136 , 0.4472136 , 1.34164079]])
I understood why I have cols duplicates, nans
and not scaled values, but how can I fix this in the correct way when cols are not static?
UPD2:
A problem may arise when the columns change their order. So, I want to use FunctionTransformer
for columns selection:
def select_col(X, cols=None):
return X[cols]
ct1 = compose.make_column_transformer(
(p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
remainder='passthrough'
)
ct1.fit(df)
But get this output:
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed
How can I fix it?
The intended usage of ColumnTransformer
is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:
First approach:
pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
('log', log_transformer),
('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
('a', pipe_a, ['a']),
('b', pipe_b, ['b']),
('c', pipe_c, ['c'])]
)
This second one actually won't work, because the ColumnTransformer
will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names
, but the actual data passed through the pipeline doesn't have that information.)
imp_tfm = ColumnTransformer(
transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
remainder='passthrough'
)
log_tfm = ColumnTransformer(
transformers=[('log', log_transformer, ['b', 'c'])],
remainder='passthrough'
)
scl_tfm = ColumnTransformer(
transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
)
proc = Pipeline(steps=[
('imp', imp_tfm),
('log', log_tfm),
('scale', scl_tfm)]
)
Third, there may be a way to use the Pipeline
slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:
pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')
(Without cloning or otherwise deep-copying pipe_b
, the last line would change both pipe_c
and pipe_b
. The slicing mechanism returns a copy, so pipe_a
doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2]
doesn't work, but you can set the individual slices as I've done above to "passthrough"
to disable them.)