且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从Mysql切换到MongoDB 2亿行

更新时间:2023-11-19 16:03:28

首先,Map/Reduce不是为实时分析而设计的.此外,MongoDB当前仅限于M/R的一个核心,这将使速度进一步降低.

因此,如果您要使用M/R来获取数据,它将不是实时"的,而是每X分钟(或几小时)更新一次.

这里有两种有效的方法:

  1. 增量M/R
  2. 实时计数器

选项1:增量M/R

对于此选项,请对所有数据运行一次M/R.然后,向前,您只对修改后的数据运行M/R.如果现在有200M个文档,那么下一个可能有210M个文档(这意味着M/R变得更慢).但是,如果您只需要运行新文档/更改过的文档,则所需时间将少于1小时.

在文档此处中查看reduce输出选项. /p>

同样的前提是,您仅M/R相关数据,并且系统根据现有数据重新缩减".这样,您将获得增量" M/R.

选项2:实时计数器

在此方法中,您有两个集合:一个集合用于数据,第二个集合用于摘要"的结果.插入数据时,您还需要对摘要进行增量.

假设您有以下数据:

Main Collection
{src: 1, dest: 2}
{src: 1, dest: 3}
{src: 1, dest: 10}
{src: 2, dest: 3}
{src: 2, dest: 10}
{src: 4, dest: 3}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 1}

您会收到一条新数据{src: 5, dest: 2}.您将进行两次更新:

db.main.insert({src: 5, dest: 2});
db.summary.update({dest: 2}, { $inc : { count: 1 } }, true); // upsert true

这是您的新数据:

Main Collection
{src: 1, dest: 2}
...
{src: 4, dest: 3}
{src: 5, dest: 2}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 2}

您会注意到我们已经更新了摘要:{dest: 2, count: 2}.

很明显,这里需要权衡.您需要更多更新/插入(2x),但是您可以获得实时计数器.现在,MongoDB中没有事务,因此您必须确定一种策略来确保两个更新都发生.有很多方法无法做到这一点(请参阅消息队列中的一种方法).

We are trying to move from mysql to mongodb. mysql structure is id_src int id_dest int unique key : id_src,id_dest

They are about 200 millions rows in mysql

data exemple : {id_src,id_dest} {1,2} {1,3} {1,10} {2,3} {2,10} {4,3}

We need to retrive data : {id_dest,count} {3,3} {10,2} {2,1}

I started to repoduced the structure of mysql in mongodb. Insert performance were huge (very good) : about 1 hour to insert 200 millions rows.

But I needed to use map reduce to get the group by. Map reduce took about 1 hour.

So I tried to create an other mongodb structure : {id_dest,{id_src1,id_src2}}

Each document can have hundred thousand of id_src.

Here is my insert.php code

$res=mysql_unbuffered_query("select * from ids limit 10000100");  
while ($tab=mysql_fetch_array($res)) {  
$collection->update(array('_id'=>(int)$tab['id_dest']),array('$push' => array('src'=>(int)$tab['id_src'])),array("upsert" => true));  
}  

But in that case performance are very bad, only few update per seconds.

Am I doing something wrong ?

First things first, Map / Reduce is not designed for real-time analysis. Additionally, MongoDB is currently limited to one core for M/R which will slow things down even more.

So if you're going to use M/R to get data, it will not be "real-time" it will be updated every X minutes (or hours).

There are two efficient approaches here:

  1. Incremental M/R
  2. Real-time counters

Option #1: Incremental M/R

For this option you run the M/R once for all of the data. Then, going forward, you only run M/R on the modified data. If you have 200M documents now, you may have 210M documents next (which means that M/R gets even slower). But if you only have to run new/changed documents, then it should take much less than 1 hour.

Take a look at the reduce output option in the docs here.

Again the premise is that you only M/R the relevant data and the system "re-reduces" against the existing data. In this way you get an "incremental" M/R.

Option #2: Real-Time counters

In this method, you have two collections: one for your data, a second for the result of the "summary". When you insert into data, you also do an increment on the summary.

Assume you have this data:

Main Collection
{src: 1, dest: 2}
{src: 1, dest: 3}
{src: 1, dest: 10}
{src: 2, dest: 3}
{src: 2, dest: 10}
{src: 4, dest: 3}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 1}

You receive a new piece of data {src: 5, dest: 2}. You would do two updates:

db.main.insert({src: 5, dest: 2});
db.summary.update({dest: 2}, { $inc : { count: 1 } }, true); // upsert true

Here's your new data:

Main Collection
{src: 1, dest: 2}
...
{src: 4, dest: 3}
{src: 5, dest: 2}

Summary Collection
{dest: 3, count: 3}
{dest: 10, count: 2}
{dest: 2, count: 2}

You'll notice that we've updated our summary: {dest: 2, count: 2}.

Obviously, there are trade-offs here. You need more updates/inserts (2x), but you get real-time counters. Now, there are no transactions in MongoDB, so you'll have to decide on a strategy for ensuring that both updates happen. There are lots of ways to do this which I cannot go into here (see message queues for one method).