且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

未从改组后的数据集中选择Keras ImageDataGenerator验证拆分

更新时间:2023-12-02 20:30:04

在Keras的ImageDataGenerator中指定validation_split参数时,将在对数据进行混洗之前执行拆分,以便仅获取最后的x个样本.问题在于,最后一个被选作验证的数据样本可能无法代表训练数据,因此可能会失败.当您的图像数据存储在公共目录中且每个子文件夹均由class命名时,这是一个特别常见的死角.已在几篇文章中提到了:

When specifying the validation_split argument in Keras' ImageDataGenerator the split is performed before the data is shuffled such that only the last x samples are taken. The issue is that the last sample of data selected as validation may not be representative of the training data and so it can fail. This is an especially common dead end when your image data is stored in a common directory with each sub-folder named by class. The has been noted in several posts:

选择随机验证数据集 >

正如您所提到的,Keras只是获取数据集的最后x个样本,因此,如果您要继续使用它,则需要提前对数据集进行洗牌.

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.

训练准确性很高,而验证准确性却很低?

请在训练之前检查您是否对数据进行了混洗.由于在喀拉拉邦中进行的验证拆分是在洗牌之前执行的,因此也许您选择了不平衡数据集作为验证集,所以准确性较低.

please check if you have shuffled the data before training. Because the validation splitting in keras is performed before shuffle, so maybe you have chosen an unbalanced dataset as your validation set, thus you got the low accuracy.

验证拆分"是否随机选择验证样本?

验证数据被选为输入的最后10%(例如,如果validation_split = 0.9).训练数据(其余部分)可以选择在每个纪元(适合的shuffle参数)进行shuffle.显然,这并不会影响验证数据,因为每个纪元之间的设置都必须相同.

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

此答案指出了sklearn train_test_split()作为解决方案,但我想提出一种保持一致性的不同解决方案在keras工作流程中.

This answer points to the sklearn train_test_split() as a solution, but I want to propose a different solution that keeps consistency in the keras workflow.

使用 split-folders 软件包,您可以将主数据目录随机分为训练内容,验证和测试(或只是培训和验证)目录.特定于类的子文件夹将自动复制.

With the split-folders package you can randomly split your main data directory into training, validation, and testing (or just training and validation) directories. The class-specific subfolders are automatically copied.

输入文件夹应采用以下格式:

The input folder shoud have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

为了给你这个:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/            # optional
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

从文档中:

import split_folders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
split_folders.ratio('input_folder', output="output", seed=1337, ratio=(.8, .1, .1)) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
split_folders.fixed('input_folder', output="output", seed=1337, fixed=(100, 100), oversample=False) # default values

通过这种新的文件夹结构,您可以轻松地使用keras数据生成器将数据分为训练和验证,并最终训练模型.

With this new folder arrangement you can easily use keras data generators to divide your data into training and validation and eventually train your model.

import tensorflow as tf
import split_folders
import os

main_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/Data'
output_dir = '/Volumes/WMEL/Independent Research Project/Data/test_train/output'

split_folders.ratio(main_dir, output=output_dir, seed=1337, ratio=(.7, .3))

train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./224)

train_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'train'),
                                                    class_mode='categorical',
                                                    batch_size=32,
                                                    target_size=(224,224),
                                                    shuffle=True)

validation_generator = train_datagen.flow_from_directory(os.path.join(output_dir,'val'),
                                                        target_size=(224, 224),
                                                        batch_size=32,
                                                        class_mode='categorical',
                                                        shuffle=True) # set as validation data

base_model = tf.keras.applications.ResNet50V2(
    input_shape=IMG_SHAPE,
    include_top=False,
    weights=None)

maxpool_layer = tf.keras.layers.GlobalMaxPooling2D()
prediction_layer = tf.keras.layers.Dense(4, activation='softmax')

model = tf.keras.Sequential([
    base_model,
    maxpool_layer,
    prediction_layer
])

opt = tf.keras.optimizers.Adam(lr=0.004)
model.compile(optimizer=opt,
              loss=tf.keras.losses.CategoricalCrossentropy(),
              metrics=['accuracy'])

model.fit(
    train_generator,
    steps_per_epoch = train_generator.samples // 32,
    validation_data = validation_generator,
    validation_steps = validation_generator.samples // 32,
    epochs = 20)