dataframe - Choosing different amount of elements from each group in R -


i working on kaggle instacart competition, quite new r , have run can not figure out.

i have dataset 4 columns. first column order id (id1). second column product id (id2). third column probability want select product id2 order id1 can consider ranking, higher probability selected on smaller probability. finally, fourth column amount of products want select given order (a feature of order). example, have here first 12 rows of dataframe df:

        id1        id2       prob       num 1        17      13107   0.4756982        3 2        17      21463   0.3724126        3 3        17      38777   0.3534422        3 4        17      21709   0.3364623        3 5        17      47766   0.3364623        3 6        17      39275   0.3165896        3 7        34      16083   0.4093785        4 8        34      39475   0.3892882        4 9        34      47766   0.3892882        4 10       34       2596   0.3837562        4 11       34      21137   0.3762758        4 12       34      47792   0.3737032        4 

we can see id1 = 17 want choose 3 elements, , id1 = 34 want choose 4 elements. result should be

id1     id2  17     13107, 21463, 38777  34     16083, 39475, 47766, 2596 

or similar this.

at moment have tried using

df %>% group_by(id1) %>% top_n(n = num) 

but error

selecting num error in is_scalar_integerish(n) : object 'num' not found 

anyone know how go doing this?

thanks

you can pipe grouped data directly summarise statement:

df %>% group_by(id1) %>% summarise(id2 = tostring(id2[seq_len(first(num))])) ## tibble: 2 x 2 #    id1                       id2 #  <int>                     <chr> #1    17       13107, 21463, 38777 #2    34 16083, 39475, 47766, 2596 

in statement, id2[seq_len(first(num))] used extract first num per group, create sequence 1 num , sequence used subset first x id2 values.

the tostring creates string per id1 group.


here's base r option using aggregate:

aggregate(id2 ~ id1, fun=tostring, subset(df, ave(id1, id1, fun=seq_along) <= num)) #  id1                       id2 #1  17       13107, 21463, 38777 #2  34 16083, 39475, 47766, 2596 

please note assumed data orderd (as in example) decreasing probability.


Comments

Popular posts from this blog

networking - Vagrant-provisioned VirtualBox VM is not reachable from Ubuntu host -

c# - ASP.NET Core - There is already an object named 'AspNetRoles' in the database -

ruby on rails - ArgumentError: Missing host to link to! Please provide the :host parameter, set default_url_options[:host], or set :only_path to true -