且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将大数据加载到 TensorFlow 2.0 中而不将其加载到 RAM 中

更新时间:2023-12-02 18:58:52

使用 tensorflow 2.xx 数据集 API,您可以使用 tf.data.Dataset.from_generator 来创建来自生成器函数的数据集.这个生成器函数将通过 numpy memap 完成读取工作.

下面的代码创建了一个虚拟数据文件,然后从磁盘上的文件中一次读取一个示例.可以轻松更新它以读取多个示例以增加 IO 吞吐量(如果您需要,请在下面的代码示例中告诉我).

# 导入将 numpy 导入为 np导入路径库将张量流导入为 tf# 创建巨大的 numpy 数组并将其保存到磁盘file = pathlib.Path("huge_data.npy")示例 = 5000示例形状 = (256, 256)巨大的数据形状=(示例,*示例形状)big_data_dtype = np.float64# 如果文件不存在则创建如果不是 file.is_file():print("用随机数据创建文件并保存到磁盘")numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)np.save(文件,numpy_data)# 内存映射文件numpy_data_memmap = np.load(file, mmap_mode='r')# 生成器函数def data_generator():返回迭代器(numpy_data_memmap)# 从生成器 fn 创建 tf 数据集数据集 = tf.data.Dataset.from_generator(生成器=数据生成器,output_types=huge_data_dtype,output_shapes=example_shape,)# 消耗庞大的数据集对于 i, ex in enumerate(dataset):打印(我,ex.shape,ex.dtype)

输出:

0 (256, 256) 1 (256, 256) <dtype: 'float64'>2 (256, 256) <dtype: 'float64'>3 (256, 256) <dtype: 'float64'>...4995 (256, 256) <dtype: 'float64'>4996 (256, 256) <dtype: 'float64'>4997 (256, 256) <dtype: 'float64'>4998 (256, 256) <dtype: 'float64'>4999 (256, 256) <dtype: 'float64'>

I have processed and saved a large dataset of video and audio file (about 8 to 9 GB of data) The data is saved as 2 numpy arrays, one for each modality Shapes of the files are (number_of_examples, maximum_time_length, feature_length)

I want to use this data for training my Neural Network for a classification task I am using the TensorFlow 2.0 Beta version I am running all the codes on Google Colab (after installing tf-2.0 beta) Each time I loading the data in tf.data the entire RAM of the Virtual Machine is used and the session is forced to restart.

Previous Approaches:

I tried 2 approaches

1) Loading both the variables entirely into the RAM and converting it to tensors

2) Loading the data as memory mapped array(from disk) and load that to tf.data

However both approaches loaded the RAM and forced the VM to restart

Code:

# Access the Audio memory from disk without loading
X_audio = np.memmap('gdrive/My Drive/Codes/audio_data.npy', dtype='float32', mode='r').reshape(2198,3860,74)

# Access the Video memory from disk without loading
X_video = np.memmap('gdrive/My Drive/Codes/video_data.npy', dtype='float32', mode='r').reshape(2198,1158,711)

# Load labels
with open('gdrive/My Drive/Codes/label_data_3','rb') as f:
    Y = pkl.load(f)

dataset = tf.data.Dataset.from_tensor_slices((X_audio, X_video, Y)).shuffle(2198).batch(32)

Error : Your session crashed after using all available RAM

With tensorflow 2.x.x dataset API you can use tf.data.Dataset.from_generator to create dataset from generator function. This generator function will do the job reading via numpy memap.

The below code creates a dummy data file then reads one example at a time from the file on the disk. It can easily be updated to read multiple examples to increase IO-throughput (let me know if you want that in the code example below).

# imports
import numpy as np
import pathlib
import tensorflow as tf

# create huge numpy array and save it to disk
file = pathlib.Path("huge_data.npy")
examples = 5000
example_shape = (256, 256)
huge_data_shape = (examples, *example_shape)
huge_data_dtype = np.float64

# create file if does not exist
if not file.is_file():
    print("creating file with random data and saving to disk")
    numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)
    np.save(file, numpy_data)

# memmap the file
numpy_data_memmap = np.load(file, mmap_mode='r')


# generator function
def data_generator():
    return iter(numpy_data_memmap)


# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=huge_data_dtype,
    output_shapes=example_shape,
)

# consume huge dataset
for i, ex in enumerate(dataset):
    print(i, ex.shape, ex.dtype)

Output:

0 (256, 256) <dtype: 'float64'>
1 (256, 256) <dtype: 'float64'>
2 (256, 256) <dtype: 'float64'>
3 (256, 256) <dtype: 'float64'>
...
4995 (256, 256) <dtype: 'float64'>
4996 (256, 256) <dtype: 'float64'>
4997 (256, 256) <dtype: 'float64'>
4998 (256, 256) <dtype: 'float64'>
4999 (256, 256) <dtype: 'float64'>