且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Zookeeper的一次迁移故障

更新时间:2022-10-03 18:48:39

前阶段同事迁移Zookeeper(是给Kafka使用的以及flume使用)后发现所有Flume-producer/consumer端集体报错:

1
2
3
4
07 Jan 2014 01:19:32,571 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)  - Opening socket connection to server xxx:2181
07 Jan 2014 01:19:32,572 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.primeConnection:947)  - Socket connection established to xxx:2181, initiating session
07 Jan 2014 01:19:32,573 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.run:1183)  - Unable to read additional data from server sessionid 0x142f42b91871911, likely server has closed socket, closing socket connection and attempting reconnect
07 Jan 2014 01:19:32,845 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)  - Opening socket connection to server xxx:2181

一直在不断的重试连接失败再重试,问同事说:网路连通性早就验证过,然后查看server端日志发现:

1
2
3
4
5
6
7
8
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx:45282
2014-01-06 23:59:59,987 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old client xxx:45282; will
be dropped if server is in r-o mode
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@812] - Refusing session request for client xxx:45282 as it
has seen zxid 0x60fd15564 our last zxid is 0x10000000f client must try another server
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client xxx:45282 (no se
ssion established for client)
2014-01-06 23:59:59,989 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from xxx:45285

发现Flume还是保留原来的zxid,但是现在的zxid竟然是0,所以抛出异常!

1
2
3
4
5
6
7
8
9
10
11
if (connReq.getLastZxidSeen() > zkDb.dataTree.lastProcessedZxid) {
            String msg = "Refusing session request for client "
                + cnxn.getRemoteSocketAddress()
                " as it has seen zxid 0x"
                + Long.toHexString(connReq.getLastZxidSeen())
                " our last zxid is 0x"
                + Long.toHexString(getZKDatabase().getDataTreeLastProcessedZxid())
                " client must try another server";
            LOG.info(msg);
            throw new CloseRequestException(msg);
        }

   后来问同事是怎么做的迁移:先启动一套新的集群,然后关闭老的集群,同时在老集群的一个IP:2181起了一个haproxy代理新集群以为这样,可以做到透明迁移=。=,其实是触发了ZK的bug-832导致不停的重试连接,只有重启flume才可以解决

   正确的迁移方式是,把新集群加入老集群,然后修改Flume配置等一段时间(flume自动reconfig)后再关闭老集群就不会触发这个问题了.



本文转自MIKE老毕 51CTO博客,原文链接:http://blog.51cto.com/boylook/1365364,如需转载请自行联系原作者