pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -
i'm using textfilestream()
method in python api spark streaming read in xml files they're created, map them xml elementtree, take "interesting" items elementtree , flatmap
them dictionary (key: value), reducebykey()
aggregate counts each key.
so, if key string network name, value might packet count. upon reducing, i'm left total packet count each network (key) in dictionary.
my problem i'm having trouble streaming this. instead of keeping running total re-computes computation each time. think it's paradigmatic issue me i'm wondering if can please me stream analytic correctly, thanks!
ah, solution use updatestatebykey
doc allows merge results previous step data in current step. in other words, allows keep running calculation without having store entire rdd , having recompute every time data received.
Comments
Post a Comment