scala - Does this code take full advantage of Spark's distributed computing and functional programming paradigms -
i have array of n tuples: (key, data). both custom defined objects. in each data object, have string field query.
if there query, string not equal default value "".
what want proportion of array has valid queries.
this code have right now:
val proportion = (inputarr map {case (key, data) => data map (d => if (d.query != "") 1 else 0) sum } sum) / inputarr.length
is optimal code is there better way of accomplishing want do? also, know spark automatically parallelizes reduce, ide suggested replace reduce (_ + _) sum, still automatically distributed , computed, right? also, i'm new functional approach, please let me know if i'm doing inefficiently.
updated:
val betterproportion = (inputarr count { case (key, listdatas) => (listdatas count (data => data.query != "")) > 0 }) / inputarr.length
is better code?
updated:
val betterproportion = (inputs filter { case (key, listdatas) => (listdatas count (data => data.query != "")) > 0 } count()) / inputs.count()
this code updated use original rdd instead of subset array address issue there not being length function on rdds.
Comments
Post a Comment