scala - How to calculate the mean of a dataframe column and find the top 10% -

- February 15, 2012

i new scala , spark, , working on self-made exercises using baseball statistics. using case class create rdd , assign schema data, , turning dataframe can use sparksql select groups of players via stats meet criteria.

once have subset of players interested in looking @ further, find mean of column; eg batting average or rbis. there break players percentile groups based on average performance compared players; top 10%, bottom 10%, 40-50%

i've been able use dataframe.describe() function return summary of desired column (mean, stddev, count, min, , max) strings though. there better way mean , stddev doubles, , best way of breaking players groups of 10-percentiles?

so far thoughts find values bookend percentile ranges , writing function groups players via comparators, feels bordering on reinventing wheel.

i able percentiles using windows functions , apply ntile() , cumedist() on window. ntile() can create grouping based off of input number. if want things grouped 10%, enter ntile(10), if 5% ntile(20). more fine-tuned restult, cumedist() applied on window output new column cumulative distribution, , can filtered there through select(), where(), or sql query.

Search This Blog

Post

scala - How to calculate the mean of a dataframe column and find the top 10% -

Comments

Post a Comment

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

Maven Javadoc 'Cannot find default setter' and fails -

javascript - SAPUI5 Filling SmartTable with OData from XMII -