Optimizing pandas groupby python -


i have dataframe contracts of products (with product_id). these contracts opened @ date (startdate) , closed @ moment in time (closedate). possible contract active @ moment in time, , therefore not have closedate.

there multiple clients have contracts, referenced id. these customers fill in surveys @ moments in time, moment in time indicated date (key_date).

what want calculate several number of features, example focus on number of unique products. want know how many unique products customer has, on moment of filling out survey.

we have dataframe df_result, has id of customer , date filled in survey. in dataframe append calculated feature:

import pandas pd import numpy np np.random.seed(256) df_result = pd.dataframe({'id' : np.random.randint(3, size=(10)),                       'key_date' : pd.date_range(start=pd.datetime(2015, 5, 21), periods=10, freq='m')}) df_result.head()      id  key_date 0   0   2015-05-31 1   2   2015-06-30 2   1   2015-07-31 3   0   2015-08-31 4   1   2015-09-30 

we have dataframe different contracts/products, named df_products:

np.random.seed(321) df_products = pd.dataframe({'id' : np.random.randint(5, size=(10)),                         'product_id' : np.random.randint(low = 101, high = 104, size=10),                       'startdate' : pd.date_range(start=pd.datetime(2015, 3, 1), periods=10, freq='m'),                        'closedate' : pd.date_range(start=pd.datetime(2016, 1, 1), periods=10, freq='m')}) df_products.head()      closedate   startdate   id  product_id 0   2016-01-31  2015-03-31  4   102 1   2016-02-29  2015-04-30  2   101 2   2016-03-31  2015-05-31  4   102 3   2016-04-30  2015-06-30  1   102 4   2016-05-31  2015-07-31  0   103 

i made function count unique products of customer filled in survey, contract still active on time of filling out, key_date (so start date of contract (startdate) before date, , end date (closedate) after date). want able give range before filling in date, unique products have been active in past year example. closed contracts of 11 months ago included. giving parameter timeperiod subtract of filling in date (making new date: low_date). then, closedate have later low_date, instead of key_date.

def unique_products(df,timeperiod,id,key_date):     low_date = key_date - relativedelta(months=timeperiod)     data = df.loc[(df['startdate'] <= key_date) &                    (df['closedate'] >= low_date) &               (df['id'] == id)].groupby(['id'], as_index = false)['product_id'].nunique().reset_index()     if 'product_id' in list(data):         try:             return float(data['product_id'])         except:             return np.nan 

after append these values in new column named unique_products in df_result:

df_result['unique_products'] = df_result.apply(lambda row: unique_products(df_products, 3, row['id'], row['key_date']), axis=1) df_result.head()       id  key_date    unique_products 0   0   2015-05-31  nan 1   2   2015-06-30  1.0 2   1   2015-07-31  1.0 3   0   2015-08-31  1.0 4   1   2015-09-30  2.0 

however when applying whole dateset, becomes quite slow due fact each surveyrow must evaluated since have different times. there way improve this?

thanks input :)

df_result['low_date'] = df_result['key_date'] - relativedelta(months=timeperiod) #creating low_date column df_result2 = pandas.merge(df_result,df_products,how = "outer",on = "id") #join both tables df_result2 = df_result2[(df_result2['startdate'] <= df_result2['key_date']) & (df_result2['closedate'] >= df_result2['low_date'])] # filter conditions df_result2 = df_result2.groupby(['id','key_date'])['product_id'].nunique().reset_index() 

try out once using cross join instead of kind of loop using.


Comments

Popular posts from this blog

networking - Vagrant-provisioned VirtualBox VM is not reachable from Ubuntu host -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

android - IllegalStateException: Cannot call this method while RecyclerView is computing a layout or scrolling -