Cassandra WriteTimeoutException exception in CounterMutationStage - node dies eventually -
i'm getting following exception in cassandra system.log:
warn [countermutationstage-25] 2017-07-25 13:25:35,874 abstractlocalawareexecutorservice.java:169 - uncaught exception on thread thread[countermutationstage-25,5,main]: {} java.lang.runtimeexception: org.apache.cassandra.exceptions.writetimeoutexception: operation timed out - received 0 responses. @ org.apache.cassandra.service.storageproxy$droppablerunnable.run(storageproxy.java:2490) ~[apache-cassandra-3.9.jar:3.9] @ java.util.concurrent.executors$runnableadapter.call(unknown source) ~[na:1.8.0_112] @ org.apache.cassandra.concurrent.abstractlocalawareexecutorservice$futuretask.run(abstractlocalawareexecutorservice.java:164) ~[apache-cassandra-3.9.jar:3.9] @ org.apache.cassandra.concurrent.abstractlocalawareexecutorservice$localsessionfuturetask.run(abstractlocalawareexecutorservice.java:136) [apache-cassandra-3.9.jar:3.9] @ org.apache.cassandra.concurrent.sepworker.run(sepworker.java:109) [apache-cassandra-3.9.jar:3.9] @ java.lang.thread.run(unknown source) [na:1.8.0_112] caused by: org.apache.cassandra.exceptions.writetimeoutexception: operation timed out - received 0 responses. @ org.apache.cassandra.db.countermutation.grabcounterlocks(countermutation.java:150) ~[apache-cassandra-3.9.jar:3.9] @ org.apache.cassandra.db.countermutation.applycountermutation(countermutation.java:122) ~[apache-cassandra-3.9.jar:3.9] @ org.apache.cassandra.service.storageproxy$9.runmaythrow(storageproxy.java:1473) ~[apache-cassandra-3.9.jar:3.9] @ org.apache.cassandra.service.storageproxy$droppablerunnable.run(storageproxy.java:2486) ~[apache-cassandra-3.9.jar:3.9] ... 5 common frames omitted
whenever happens, cpu goes down 0% minute or so, node becomes unresponsive recovers after that. eventually, node die (i.e. process keeps running, not respond commands more, shutdown not work, have kill process).
some more information:
- cassandra 3.9
- g1 garbage collector
- single node on windows server 2012 r2 (20 cores, 256 gb ram)
- using lot of counters , counter mutations
things have tried:
- eleminated other warnings log. used have warnings counter batches being large, rewrote code not use batching @ all. eleminated warning, not exception problem.
- migrated bigger machine, used bigger heap , fine tuned gc make sure problem not machine being overstressed. cpu load < 20%.
does have idea else do? main concern node dying completely. not sure exception causing hint have...
update 1:
updated cassandra 3.11 , node not seem die more now. however, write timeouts presists, node unresponsive several minutes @ least recovers now.
update 2:
solved problem (with of professional consultant). disc i/o speed on our node terrible, leading growing queue of flush writers. reason unknown, i/o speed tests on drive (raid 1 ssds) super good. moving node windows linux (and configuring according http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedsettings.html) solved problem.
real reason problem unknown; might have been windows per se or freak incompatibility raid setup. in case, cassandra tested on linux , far easier find linux setups. lesson learned.
it sounds beefy machine 20cores , 256gb ram. cassandra distributed system aimed scale horizontally. rather pushing load @ single node, try adding more commodity hardware , scale horizontally. can run multiple nodes of cassandra within same box.
atleast try running couple of nodes within box scale unresponsiveness. cpu not bottleneck cassandra. i/o single node can perform.
- check values on concurrent_writes in cassandra.yaml, guess based on recommendation 20 cores 160 (20 * 8).
- if feasible, try separating commitlog directory , data directory storage drives.
- best bet scale writes add more boxes (which smaller in configuration).
Comments
Post a Comment