pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -

July 15, 2015

i'm using textfilestream() method in python api spark streaming read in xml files they're created, map them xml elementtree, take "interesting" items elementtree , flatmap them dictionary (key: value), reducebykey() aggregate counts each key.

so, if key string network name, value might packet count. upon reducing, i'm left total packet count each network (key) in dictionary.

my problem i'm having trouble streaming this. instead of keeping running total re-computes computation each time. think it's paradigmatic issue me i'm wondering if can please me stream analytic correctly, thanks!

ah, solution use updatestatebykey doc allows merge results previous step data in current step. in other words, allows keep running calculation without having store entire rdd , having recompute every time data received.

Search This Blog

TY

pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -

Comments

Post a Comment

Popular posts from this blog

html - How to set bootstrap input responsive width? -

javascript - Highchart x and y axes data from json -

javascript - Get js console.log as python variable in QWebView pyqt -