且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将大型 csv 转换为稀疏矩阵以在 sklearn 中使用

更新时间:2022-02-16 09:44:31

您可以很容易地在内存中按行构建稀疏矩阵:

You can row-wise build a sparse matrix in memory pretty easily:

import numpy as np
import scipy.sparse as sps

input_file_name = "something.csv"
sep = "\t"

def _process_data(row_array):
    return row_array

sp_data = []
with open(input_file_name) as csv_file:
    for row in csv_file:
        data = np.fromstring(row, sep=sep)
        data = _process_data(data)
        data = sps.coo_matrix(data)
        sp_data.append(data)


sp_data = sps.vstack(sp_data)

这将更容易写入 hdf5,这是一种比文本文件更好的以这种规模存储数字的方式.

This will be easier to write into hdf5 which is a way better way to store numbers at this scale than a text file.