spark-scala: groupBy throws java.lang.ArrayIndexOutOfBoundsException (null) -
i started following rdd has string
((26468ede20e38,239394055),(7665710590658745,-414963169),0,f1,1420276980302) ((26468ede20e38,239394055),(8016905020647641,183812619),1,f4,1420347885727) ((26468ede20e38,239394055),(6633110906332136,294201185),1,f2,1420398323110) ((26468ede20e38,239394055),(6633110906332136,294201185),0,f1,1420451687525) ((26468ede20e38,239394055),(7722056727387069,1396896294),0,f1,1420537469065) ((26468ede20e38,239394055),(7722056727387069,1396896294),1,f1,1420623297340) ((26468ede20e38,239394055),(8045651092287275,-4814845),0,f1,1420720722185) ((26468ede20e38,239394055),(5170029699836178,-1332814297),0,f2,1420750531018) ((26468ede20e38,239394055),(7722056727387069,1396896294),0,f1,1420807545137) ((26468ede20e38,239394055),(4784119468604853,1287554938),0,f1,1421050087824)
just give high level view on description of data. can think first element in main tuple (first tuple) user identification, second tuple product identification, , third element user's preference on product. (for future reference going mark above data set val userdata
)
here how construct userdata
:
val userdata = data.map(x => { val userid = x._1.replace("_","") val programid = x._2 val feature = somemethod(x._5, x._4) val userhashtuple = (userid, userid.hashcode) val programhashtuple = (programid, programid.hashcode) (userhashtuple, programhashtuple, x._3, timefeature, x._4) })
my goal if user has casted both positive (1)
, negative (0)
preference product take record positive. example:
((26468ede20e38,239394055),(7722056727387069,1396896294),0,f1,1420537469065) ((26468ede20e38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
i want keep
((26468ede20e38,239394055),(7722056727387069,1396896294),1,f1,1420623297340)
based on provided answer followed following:
val grpdata = userdata.groupby(x => (x._1, x._2)).mapvalues(_.maxby(_._3)).values
but getting error @ stage of groupby
. java.lang.arrayindexoutofboundsexception (null) [duplicate 1]
, keep repeating same error while increasing duplicate number. exception doesn't point code having tough time figuring out going on.
at end following:
java.lang.arrayindexoutofboundsexception driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1203) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1192) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1191) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1191) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:693) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:693) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1393) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1354) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48)
does mean key empty? if key empty still carry out groupby ""
. bit confuse.
after digging around found following: https://issues.apache.org/jira/browse/spark-6772
but don't think issue same thats got.
Comments
Post a Comment