pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -

July 15, 2015

i'm using textfilestream() method in python api spark streaming read in xml files they're created, map them xml elementtree, take "interesting" items elementtree , flatmap them dictionary (key: value), reducebykey() aggregate counts each key.

so, if key string network name, value might packet count. upon reducing, i'm left total packet count each network (key) in dictionary.

my problem i'm having trouble streaming this. instead of keeping running total re-computes computation each time. think it's paradigmatic issue me i'm wondering if can please me stream analytic correctly, thanks!

ah, solution use updatestatebykey doc allows merge results previous step data in current step. in other words, allows keep running calculation without having store entire rdd , having recompute every time data received.

Search This Blog

TY

pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -

Comments

Post a Comment

Popular posts from this blog

python - Best design pattern for collection of objects -

android - IllegalStateException: Cannot call this method while RecyclerView is computing a layout or scrolling -

go - serving up pdfs using golang -