pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -


i'm using textfilestream() method in python api spark streaming read in xml files they're created, map them xml elementtree, take "interesting" items elementtree , flatmap them dictionary (key: value), reducebykey() aggregate counts each key.

so, if key string network name, value might packet count. upon reducing, i'm left total packet count each network (key) in dictionary.

my problem i'm having trouble streaming this. instead of keeping running total re-computes computation each time. think it's paradigmatic issue me i'm wondering if can please me stream analytic correctly, thanks!

ah, solution use updatestatebykey doc allows merge results previous step data in current step. in other words, allows keep running calculation without having store entire rdd , having recompute every time data received.


Comments

Popular posts from this blog

android - IllegalStateException: Cannot call this method while RecyclerView is computing a layout or scrolling -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

c# - Oracle Advanced Queues - Dequeueing Commit/Rollback -