scala - How to calculate the mean of a dataframe column and find the top 10% -
i new scala , spark, , working on self-made exercises using baseball statistics. using case class create rdd , assign schema data, , turning dataframe can use sparksql select groups of players via stats meet criteria.
once have subset of players interested in looking @ further, find mean of column; eg batting average or rbis. there break players percentile groups based on average performance compared players; top 10%, bottom 10%, 40-50%
i've been able use dataframe.describe() function return summary of desired column (mean, stddev, count, min, , max) strings though. there better way mean , stddev doubles, , best way of breaking players groups of 10-percentiles?
so far thoughts find values bookend percentile ranges , writing function groups players via comparators, feels bordering on reinventing wheel.
i able percentiles using windows functions , apply ntile() , cumedist() on window. ntile() can create grouping based off of input number. if want things grouped 10%, enter ntile(10), if 5% ntile(20). more fine-tuned restult, cumedist() applied on window output new column cumulative distribution, , can filtered there through select(), where(), or sql query.
Comments
Post a Comment