Cassandra WriteTimeoutException exception in CounterMutationStage - node dies eventually -


i'm getting following exception in cassandra system.log:

warn  [countermutationstage-25] 2017-07-25 13:25:35,874 abstractlocalawareexecutorservice.java:169 - uncaught exception on thread thread[countermutationstage-25,5,main]: {} java.lang.runtimeexception: org.apache.cassandra.exceptions.writetimeoutexception: operation timed out - received 0 responses.     @ org.apache.cassandra.service.storageproxy$droppablerunnable.run(storageproxy.java:2490) ~[apache-cassandra-3.9.jar:3.9]     @ java.util.concurrent.executors$runnableadapter.call(unknown source) ~[na:1.8.0_112]     @ org.apache.cassandra.concurrent.abstractlocalawareexecutorservice$futuretask.run(abstractlocalawareexecutorservice.java:164) ~[apache-cassandra-3.9.jar:3.9]     @ org.apache.cassandra.concurrent.abstractlocalawareexecutorservice$localsessionfuturetask.run(abstractlocalawareexecutorservice.java:136) [apache-cassandra-3.9.jar:3.9]     @ org.apache.cassandra.concurrent.sepworker.run(sepworker.java:109) [apache-cassandra-3.9.jar:3.9]     @ java.lang.thread.run(unknown source) [na:1.8.0_112] caused by: org.apache.cassandra.exceptions.writetimeoutexception: operation timed out - received 0 responses.     @ org.apache.cassandra.db.countermutation.grabcounterlocks(countermutation.java:150) ~[apache-cassandra-3.9.jar:3.9]     @ org.apache.cassandra.db.countermutation.applycountermutation(countermutation.java:122) ~[apache-cassandra-3.9.jar:3.9]     @ org.apache.cassandra.service.storageproxy$9.runmaythrow(storageproxy.java:1473) ~[apache-cassandra-3.9.jar:3.9]     @ org.apache.cassandra.service.storageproxy$droppablerunnable.run(storageproxy.java:2486) ~[apache-cassandra-3.9.jar:3.9]     ... 5 common frames omitted 

whenever happens, cpu goes down 0% minute or so, node becomes unresponsive recovers after that. eventually, node die (i.e. process keeps running, not respond commands more, shutdown not work, have kill process).

some more information:

  • cassandra 3.9
  • g1 garbage collector
  • single node on windows server 2012 r2 (20 cores, 256 gb ram)
  • using lot of counters , counter mutations

things have tried:

  • eleminated other warnings log. used have warnings counter batches being large, rewrote code not use batching @ all. eleminated warning, not exception problem.
  • migrated bigger machine, used bigger heap , fine tuned gc make sure problem not machine being overstressed. cpu load < 20%.

does have idea else do? main concern node dying completely. not sure exception causing hint have...

update 1:

updated cassandra 3.11 , node not seem die more now. however, write timeouts presists, node unresponsive several minutes @ least recovers now.

update 2:

solved problem (with of professional consultant). disc i/o speed on our node terrible, leading growing queue of flush writers. reason unknown, i/o speed tests on drive (raid 1 ssds) super good. moving node windows linux (and configuring according http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedsettings.html) solved problem.

real reason problem unknown; might have been windows per se or freak incompatibility raid setup. in case, cassandra tested on linux , far easier find linux setups. lesson learned.

it sounds beefy machine 20cores , 256gb ram. cassandra distributed system aimed scale horizontally. rather pushing load @ single node, try adding more commodity hardware , scale horizontally. can run multiple nodes of cassandra within same box.

atleast try running couple of nodes within box scale unresponsiveness. cpu not bottleneck cassandra. i/o single node can perform.

  • check values on concurrent_writes in cassandra.yaml, guess based on recommendation 20 cores 160 (20 * 8).
  • if feasible, try separating commitlog directory , data directory storage drives.
  • best bet scale writes add more boxes (which smaller in configuration).

Comments

Popular posts from this blog

networking - Vagrant-provisioned VirtualBox VM is not reachable from Ubuntu host -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

android - IllegalStateException: Cannot call this method while RecyclerView is computing a layout or scrolling -