dataframe - R replacing columns by lookup to dictionary -
in question need able lookup value dataframe's column not based on 1 attribute, based on more attributes , range comparing against dictionary. (yes, continuation of story in r conditional replace more columns lookup )
it should easy question r-known ppl, because provide working solution basic indexing, needs upgraded, possibly ... hard me, because iam in process of learning r.
from start:
when want replace missing values columns testcolnames (big) table df1 according column default of (small) dictionary testdefs (row selected making testdefs$labmet_id equal column name testcolnames), use code:
testcolnames=c("80","116") #...result of regexp on colnames(df1), longer df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x]; tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$labmet_id)]; tmpcol })
to go:
now - need upgrade solution. table testdefs have (example below) multiple rows of same labmet_id differing new 2 columns called lower , upper ... need bounds variable df1$rngvalue when selecting value replace.
in words - upgrade solution not select row testdefs (where testdefs$labmet_id equals column name), select these rows such row, df1$rngvalue in bounds of testdefs$lower , testdefs$upper (if none such exists, take range closest - either lowest or highest, if dictionary doesnt have labmet_id, can leave na in original data).
an example:
testdefs
"labmet_id","lower","upper","default","notuse","notuse2" 30,0,54750,25,80,2 #..."many columns dont care about" 46,0,54750,1.45,3.5,0.2 80,0,54750,0.03,0.1,0.01 116,0,30,0.09,0.5,0.01 116,31,365,0.135,0.7,0.01 116,366,5475,0.11,0.7,0.01 116,5476,54750,0.105,0.7,0.02
df1:
"rngvalue","80","116" 36,na,na 600000,na,na 367,5,na 90,na,6
to transformed into:
"rngvalue","80","116" 36,0.03,0.135 #col80 replaced 0.03 600000,0.03,0.105 #col116 needs decided on range, value bigger in dictionary take last 1 367,5,0.11 #5 not replaced, second column nicely looks 0.11 90,0.03,6 #6 not replaced
since intervals don't have gaps, can use findinterval
. change lookup table list containing break points , defaults each value using dlply
plyr
.
## transform lookup table list breaks intervals library(plyr) lookup <- dlply(testdefs, .(labmet_id), function(x) list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(t,f)], default=x$default))
so, lookups like
lookup[["116"]] # $breaks # [1] 0 31 366 5476 54750 # # $default # [1] 0.090 0.135 0.110 0.105
then, can lookup following
testcolnames=c("80","116") df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol <- df1[,x] defaults <- with(lookup[[x]], { default[pmax(pmin(length(breaks)-1, findinterval(df1$rngvalue, breaks)), 1)] }) tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)] tmpcol }) # rngvalue 80 116 # 1 36 0.03 0.135 # 2 600000 0.03 0.105 # 3 367 5.00 0.110 # 4 90 0.03 6.000
the findinterval
returns values below , above number of breaks if rngvalue outside of range. reason pmin
, pmax
in code above.
Comments
Post a Comment