r - Creating a unique ID variable as combination of variables -
i have data frame (df
) or data table (dt
) with, let’s 1000 variables , 1000 observations. checked there no duplicates in observations, dt[!duplicated(dt)]
has same length original file.
i create id variable observation combination of of 1000 variables have. differently other questions don’t know variables more suitable create id , need combination of, @ least, 3 or 4 variables.
is there package/function in r me efficient combination of variables create id variable? in real example struggling create id manually, , not best combination of variables.
example mtcars:
require(data.table) example <- data.table(mtcars) rownames(example) <- null # delete mtcars row names example <- example[!duplicated(example),] example[,id_var_wrong := paste0(mpg,"_",cyl)] length(unique(example$id_var_wrong)) # wrong id, there 27 different values variable despite 32 observations example[,id_var_good := paste0(wt,"_",qsec)] length(unique(example$id_var_good)) # id there equal number of unique values different observations.
is there function find wt
, qsec
automatically , not manually?
a homemade algorithm: principle greedily take variable distinct number of elements , filter remaining rows duplicates , iterate. doesn't give best solution it's easy way rather solution quickly.
set.seed(1) mat <- replicate(1000, sample(c(letters, letters), size = 100, replace = true)) library(dplyr) columnsid <- function(mat) { df <- df0 <- as_data_frame(mat) vars <- c() while(nrow(df) > 0) { var_best <- names(which.max(lapply(df, n_distinct)))[[1]] vars <- append(vars, var_best) df <- group_by_at(df0, vars) %>% filter(n() > 1) } vars } columnsid(mat) [1] "v68" "v32"
Comments
Post a Comment