且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在Python中有效地计算巨大的矩阵乘法(tfidf功能)?

更新时间:2022-12-09 21:24:17

您可能想看看scikit-learn中的random_projection模块. Johnson-Lindenstrauss引理说,随机投影矩阵可以保证保持成对的距离,直到某个公差eta,这在您计算所需的随机投影数量时是一个超参数.

You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta, which is a hyperparameter when you calculate the number of random projections needed.

长话短说,scikit-learn类SparseRandomProjection

To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.

sklearn.random_projection.johnson_lindenstrauss_min_dim中的公式表明,要保留高达10%的公差,您只需要johnson_lindenstrauss_min_dim(350363, .1) 10942功能.这是一个上限,因此您可以花更少的钱就可以逃脱.甚至1%的公差也只需要johnson_lindenstrauss_min_dim(350363, .01) 1028192功能,仍然比您现在拥有的功能要少得多.

The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.

尝试简单-如果您的数据是dtype ='float64',请尝试使用'float32'.仅此一项就可以节省大量空间,尤其是在您不需要双精度的情况下.

Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.

如果问题是您也不能在内存中存储最终矩阵",我建议您在HDF5Store中使用数据(如在使用pytables的熊猫中所见). 此链接有一些不错的入门代码,您可以迭代计算点积的大块并写入磁盘.在最近的45GB数据集项目中,我一直在广泛使用此方法,如果您决定采用这种方法,可以提供更多帮助.

If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.