且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Cassandra挂在任意命令上

更新时间:2023-12-01 22:32:46

cassandra.yaml



hsha 切换到 rpc_server_type code> sync 。


We're hosting Cassandra 2.0.2 cluster on AWS. We've recently started upgrading from normal to SSD drives, by bootstrapping new and decommissioning old nodes. It went fairly well, aside from two nodes hanging forever on decommission. Now, after the new 6 nodes are operational, we noticed that some of our old tools, using phpcassa stopped working. Nothing has changed with security groups, all ports TCP/UDP are open, telnet can connect via 9160, cqlsh can 'use' a cluster, select data, however, 'describe cluster' fails, in cli, 'show keyspaces' also fails - and by fail, I mean never exits back to prompt, nor returns any results. The queries work perfectly from the new nodes, but even the old nodes waiting to be decommissioned cannot perform them. The production system, also using phpcassa, does normal data requests - it works fine.

All cassandras have the same config, the same versions, the same package they were installed from. All nodes were recently restarted, due to seed node change.

Versions:

Connected to ### at ####.compute-1.amazonaws.com:9160. [cqlsh 4.1.0 | Cassandra 2.0.2 | CQL spec 3.1.1 | Thrift protocol 19.38.0]

I've run out out of ideas. Any hints would be greatly appreciated.

Update:

After a bit of random investigating, here's a bit more detailed description.

If I cassandra-cli to any machine, and do "show keyspaces", it works.

If I cassandra-cli to a remote machine, and do "show keyspaces", it hangs indefinitely.

If I cqlsh to a remote cassandra, and do a describe keyspaces, it hangs. ctrl+c, repeat the same query, it instantly responds.

If I cqlsh to a local cassandra, and do a describe keyspaces, it works.

If I cqlsh to a local cassandra, and do a select * from Keyspace limit x, it will return data up to a certain limit. I was able to return data with limit 760, the 761 would fail.

If I do a consistency all, and select the same, it hangs.

If I do a trace, different machines return the data, though sometimes source_elapsed is "null"

Not to forget, applications querying the cluster sometimes do get results, after several attempts.

Update 2

Further playing introduced failed bootstrapping of two nodes, one hanging on bootstrap for 4 days, and eventually failing, possibly due to a rolling restart, and the other plain failing after 2 days. Repairs wouldn't function, and introduced "Stream failed" errors, as well as "Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException". Also, after executing repair, started getting "Read an invalid frame size of 0. Are you using tframedtransport on the client side?", so..

Solution

Switch rpc_server_type from hsha to sync. All problems gone. We worked with hsha for months without issues.

If someone also stumbles here: http://planetcassandra.org/blog/post/hsha-thrift-server-corruption-cassandra-2-0-2-5/

In cassandra.yaml:

Switch rpc_server_type from hsha to sync.