python - Exception trying to collect records from parquet file in pyspark -
i don't understand why, can't read data parquet file. made parquet file json file , read data frame:
df.printschema() |-- param: struct (nullable = true) | |-- form: string (nullable = true) | |-- url: string (nullable = true)
when try read record error:
df.select("param").first() 15/07/22 13:06:15 error executor: exception in task 0.0 in stage 8.0 (tid 4) java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89) @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79) @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189) @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138) @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41) @ org.apache.spark.scheduler.task.run(task.scala:70) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29: optional binary amp;utm_medium @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130) ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium @ java.lang.enum.valueof(enum.java:238) @ parquet.schema.type$repetition.valueof(type.java:70) @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201) ... 27 more 15/07/22 13:06:15 warn tasksetmanager: lost task 0.0 in stage 8.0 (tid 4, localhost): java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89) @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79) @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189) @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138) @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41) @ org.apache.spark.scheduler.task.run(task.scala:70) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29: optional binary amp;utm_medium @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130) ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium @ java.lang.enum.valueof(enum.java:238) @ parquet.schema.type$repetition.valueof(type.java:70) @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201) ... 27 more 15/07/22 13:06:15 error tasksetmanager: task 0 in stage 8.0 failed 1 times; aborting job 15/07/22 13:06:15 info taskschedulerimpl: removed taskset 8.0, tasks have completed, pool 15/07/22 13:06:15 info taskschedulerimpl: cancelling stage 8 15/07/22 13:06:15 info dagscheduler: shufflemapstage 8 (first @ <ipython-input-8-5cb9a7b45630>:1) failed in 0.083 s 15/07/22 13:06:15 info dagscheduler: job 4 failed: first @ <ipython-input-8-5cb9a7b45630>:1, took 0.159103 s --------------------------------------------------------------------------- py4jjavaerror traceback (most recent call last) <ipython-input-8-5cb9a7b45630> in <module>() ----> 1 df.select("param").first() /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in first(self) 676 row(age=2, name=u'alice') 677 """ --> 678 return self.head() 679 680 @ignore_unicode_prefix /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in head(self, n) 664 """ 665 if n none: --> 666 rs = self.head(1) 667 return rs[0] if rs else none 668 return self.take(n) /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in head(self, n) 666 rs = self.head(1) 667 return rs[0] if rs else none --> 668 return self.take(n) 669 670 @ignore_unicode_prefix /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in take(self, num) 338 [row(age=2, name=u'alice'), row(age=5, name=u'bob')] 339 """ --> 340 return self.limit(num).collect() 341 342 @ignore_unicode_prefix /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in collect(self) 312 """ 313 sccallsitesync(self._sc) css: --> 314 port = self._sc._jvm.pythonrdd.collectandserve(self._jdf.javatopython().rdd()) 315 rs = list(_load_from_socket(port, batchedserializer(pickleserializer()))) 316 cls = _create_cls(self.schema) /home/vagrant/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 temp_arg in temp_args: /home/vagrant/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise py4jjavaerror( 299 'an error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise py4jerror( py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.collectandserve. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 8.0 failed 1 times, recent failure: lost task 0.0 in stage 8.0 (tid 4, localhost): java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89) @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79) @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189) @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138) @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124) @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277) @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70) @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41) @ org.apache.spark.scheduler.task.run(task.scala:70) @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29: optional binary amp;utm_medium @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203) @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101) @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96) @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130) ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium @ java.lang.enum.valueof(enum.java:238) @ parquet.schema.type$repetition.valueof(type.java:70) @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201) ... 27 more driver stacktrace: @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1266) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1257) @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1256) @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59) @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47) @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1256) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730) @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730) @ scala.option.foreach(option.scala:236) @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:730) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1450) @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1411) @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48)
can me issue? wrong?
any chance renamed columns? got same error when used "sum(x)" column, after renaming. changing column name before saving parquet file solved me, cf. http://trustedanalytics.github.io/atk/versions/v0.4.1/errata.html.
Comments
Post a Comment