且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从MySQL将数字数据加载到python/pandas/numpy数组的最快方法

更新时间:2021-11-02 08:59:19

问题"似乎是从MySQL的十进制类型到python的十进制类型转换.MySQLdb,pymysql和pyodbc对数据进行十进制转换.通过将MySQLdb中的converters.py文件更改为:

The "problem" seems to have been the type conversion which occurs from MySQL's decimal type to python's decimal.Decimal that MySQLdb, pymysql and pyodbc does on the data. By changing the converters.py file (at the very last lines) in MySQLdb to have:

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float

而不是十进制.十进制似乎可以完全解决问题,现在可以解决以下代码:

instead of decimal.Decimal seems to completely solve the problem and now the following code:

import MySQLdb
import numpy
import time

t = time.time()
conn = MySQLdb.connect(host='',...)
curs = conn.cursor()
curs.execute("select x,y from TABLENAME")
data = numpy.array(curs.fetchall(),dtype=float)
print(time.time()-t)

运行不到一秒钟! 有趣的是,十进制.在探查器中从来没有十进制出现问题.

Runs in less than a second! What is funny, decimal.Decimal never appeared to be the problem in the profiler.

类似的解决方案应该在pymysql包中工作. pyodbc比较棘手:它们都是用C ++编写的,因此您必须重新编译整个程序包.

Similar solution should work in pymysql package. pyodbc is more tricky: it is all written in C++, hence you would have to recompile the entire package.

更新

以下是不需要修改MySQLdb源代码的解决方案: Python MySQLdb返回datetime.date和十进制 然后,将数字数据加载到熊猫中的解决方案:

Here is a solution not requiring to modify the MySQLdb source code: Python MySQLdb returns datetime.date and decimal The solution then to load numeric data into pandas:

import MySQLdb
import pandas.io.sql as psql
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE

conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host='',user='',passwd='',db='')
sql = "select * from NUMERICTABLE"
df = psql.read_frame(sql, conn)

在加载200k x 9的表中,MATLAB的性能降低了约4倍!

Beats MATLAB by a factor of ~4 in loading 200k x 9 table!