pyspark - How to do a Running (Streaming) reduceByKey in Spark Streaming -


i'm using textfilestream() method in python api spark streaming read in xml files they're created, map them xml elementtree, take "interesting" items elementtree , flatmap them dictionary (key: value), reducebykey() aggregate counts each key.

so, if key string network name, value might packet count. upon reducing, i'm left total packet count each network (key) in dictionary.

my problem i'm having trouble streaming this. instead of keeping running total re-computes computation each time. think it's paradigmatic issue me i'm wondering if can please me stream analytic correctly, thanks!

ah, solution use updatestatebykey doc allows merge results previous step data in current step. in other words, allows keep running calculation without having store entire rdd , having recompute every time data received.


Comments

Popular posts from this blog

networking - Vagrant-provisioned VirtualBox VM is not reachable from Ubuntu host -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

ruby on rails - ArgumentError: Missing host to link to! Please provide the :host parameter, set default_url_options[:host], or set :only_path to true -