且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

与 Matlab 相比,Numpy 加载 csv 太慢

更新时间:2023-02-27 09:20:18

是的,将 csv 文件读入 numpy 非常慢.沿着代码路径有很多纯 Python.这些天,即使我使用纯 numpy 我仍然使用 pandas 进行 IO:

>>>导入 numpy 作为 np,pandas 作为 pd>>>%time d = np.genfromtxt("./test.csv", delimiter=",")CPU 时间:用户 14.5 秒,系统:396 毫秒,总计:14.9 秒挂壁时间:14.9 秒>>>%time d = np.loadtxt("./test.csv", delimiter=",")CPU 时间:用户 25.7 秒,系统:28 毫秒,总计:25.8 秒挂墙时间:25.8 秒>>>%time d = pd.read_csv("./test.csv", delimiter=",").valuesCPU 时间:用户 740 毫秒,系统:36 毫秒,总计:776 毫秒挂墙时间:780 毫秒

或者,在这样一个足够简单的情况下,您可以使用类似 Joe Kington 写的内容此处:

>>>%time data = iter_loadtxt("test.csv")CPU 时间:用户 2.84 秒,系统:24 毫秒,总计:2.86 秒挂壁时间:2.86 秒

还有 Warren Weckesser 的 textreader 库,以防 pandas 太重一个依赖:

>>>导入文本阅读器>>>%time d = textreader.readrows("test.csv", float, ",")读取行数:numrows = 1500000CPU 时间:用户 1.3 秒,系统:40 毫秒,总计:1.34 秒挂壁时间:1.34 秒

I posted this question because I was wondering whether I did something terribly wrong to get this result.

I have a medium-size csv file and I tried to use numpy to load it. For illustration, I made the file using python:

import timeit
import numpy as np

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')

And then, I tried two methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy as np'
stmt1 = """
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """
my_data = np.loadtxt('./test.csv', delimiter=',')
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)

And the result shows that t1 = 32.159652940464184, t2 = 52.00093725634724.
However, When I tried using matlab:

tic
for i = 1:3
    my_data = dlmread('./test.csv');
end
toc

The result shows: Elapsed time is 3.196465 seconds.

I understand that there may be some differences in the loading speed, but:

  1. This is much more than I expected;
  2. Isn't it that np.loadtxt should be faster than np.genfromtxt?
  3. I haven't tried python csv module yet because loading csv file is a really frequent thing I do and with the csv module, the coding is a little bit verbose... But I'd be happy to try it if that's the only way. Currently I am more concerned about whether it's me doing something wrong.

Any input would be appreciated. Thanks a lot in advance!

Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms

Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s

There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s