且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将巨大的.dat加载到数组中的最快方法

更新时间:2023-12-02 19:03:40

查看

Looking at the source, it appears that numpy.loadtxt contains a lot of code to handle many different formats. In case you have a well defined input file, it is not too difficult to write your own function optimized for your particular file format. Something like this (untested):

def load_big_file(fname):
    '''only works for well-formed text file of space-separated doubles'''

    rows = []  # unknown number of lines, so use list
    with open(fname) as f:
        for line in f:
            line = [float(s) for s in line.split()]
            rows.append(np.array(line, dtype = np.double))
    return np.vstack(rows)  # convert list of vectors to array

一种替代解决方案,如果以前知道行数和列数,则可能是:

An alternative solution, if the number of rows and columns is known before, might be:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            for icol, s in enumerate(line.split()):
                x[irow, icol] = float(s)
    return x

通过这种方式,您不必分配所有中间列表.

In this way, you don't have to allocate all the intermediate lists.

EDIT :似乎第二种解决方案要慢一些,列表理解可能比显式for循环要快.结合这两种解决方案,并使用Numpy进行从字符串到浮点的隐式转换的技巧(才刚刚发现),这可能会更快:

EDIT: Seems that the second solution is a bit slower, the list comprehension is probably faster than the explicit for loop. Combining the two solutions, and using the trick that Numpy does implicit conversion from string to float (only discovered that just now), this might possibly be faster:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            x[irow, :] = line.split()
    return x

要获得进一步的加速,您可能必须使用一些用C或Cython编写的代码.我想知道这些功能需要多少时间才能打开文件.

To get any further speedup, you would probably have to use some code written in C or Cython. I would be interested to know how much time these functions take to open your files.