python - Exception trying to collect records from parquet file in pyspark -


i don't understand why, can't read data parquet file. made parquet file json file , read data frame:

df.printschema()  |-- param: struct (nullable = true)  |    |-- form: string (nullable = true)  |    |-- url: string (nullable = true) 

when try read record error:

df.select("param").first()  15/07/22 13:06:15 error executor: exception in task 0.0 in stage 8.0 (tid 4) java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89)         @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79)         @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189)         @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138)         @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41)         @ org.apache.spark.scheduler.task.run(task.scala:70)         @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213)         @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142)         @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617)         @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29:     optional binary amp;utm_medium         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130)         ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium         @ java.lang.enum.valueof(enum.java:238)         @ parquet.schema.type$repetition.valueof(type.java:70)         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201)         ... 27 more 15/07/22 13:06:15 warn tasksetmanager: lost task 0.0 in stage 8.0 (tid 4, localhost): java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89)         @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79)         @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189)         @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138)         @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41)         @ org.apache.spark.scheduler.task.run(task.scala:70)         @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213)         @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142)         @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617)         @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29:     optional binary amp;utm_medium         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130)         ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium         @ java.lang.enum.valueof(enum.java:238)         @ parquet.schema.type$repetition.valueof(type.java:70)         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201)         ... 27 more  15/07/22 13:06:15 error tasksetmanager: task 0 in stage 8.0 failed 1 times; aborting job 15/07/22 13:06:15 info taskschedulerimpl: removed taskset 8.0, tasks have completed, pool  15/07/22 13:06:15 info taskschedulerimpl: cancelling stage 8 15/07/22 13:06:15 info dagscheduler: shufflemapstage 8 (first @ <ipython-input-8-5cb9a7b45630>:1) failed in 0.083 s 15/07/22 13:06:15 info dagscheduler: job 4 failed: first @ <ipython-input-8-5cb9a7b45630>:1, took 0.159103 s --------------------------------------------------------------------------- py4jjavaerror                             traceback (most recent call last) <ipython-input-8-5cb9a7b45630> in <module>() ----> 1 df.select("param").first()  /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in first(self)     676         row(age=2, name=u'alice')     677         """ --> 678         return self.head()     679      680     @ignore_unicode_prefix  /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in head(self, n)     664         """     665         if n none: --> 666             rs = self.head(1)     667             return rs[0] if rs else none     668         return self.take(n)  /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in head(self, n)     666             rs = self.head(1)     667             return rs[0] if rs else none --> 668         return self.take(n)     669      670     @ignore_unicode_prefix  /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in take(self, num)     338         [row(age=2, name=u'alice'), row(age=5, name=u'bob')]     339         """ --> 340         return self.limit(num).collect()     341      342     @ignore_unicode_prefix  /home/vagrant/spark/python/pyspark/sql/dataframe.pyc in collect(self)     312         """     313         sccallsitesync(self._sc) css: --> 314             port = self._sc._jvm.pythonrdd.collectandserve(self._jdf.javatopython().rdd())     315         rs = list(_load_from_socket(port, batchedserializer(pickleserializer())))     316         cls = _create_cls(self.schema)  /home/vagrant/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)     536         answer = self.gateway_client.send_command(command)     537         return_value = get_return_value(answer, self.gateway_client, --> 538                 self.target_id, self.name)     539      540         temp_arg in temp_args:  /home/vagrant/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     298                 raise py4jjavaerror(     299                     'an error occurred while calling {0}{1}{2}.\n'. --> 300                     format(target_id, '.', name), value)     301             else:     302                 raise py4jerror(  py4jjavaerror: error occurred while calling z:org.apache.spark.api.python.pythonrdd.collectandserve. : org.apache.spark.sparkexception: job aborted due stage failure: task 0 in stage 8.0 failed 1 times, recent failure: lost task 0.0 in stage 8.0 (tid 4, localhost): java.lang.illegalargumentexception: problem reading type: type = group, name = param, original type = null         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:132)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:106)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.parse(messagetypeparser.java:89)         @ parquet.schema.messagetypeparser.parsemessagetype(messagetypeparser.java:79)         @ parquet.hadoop.parquetrecordreader.initializeinternalreader(parquetrecordreader.java:189)         @ parquet.hadoop.parquetrecordreader.initialize(parquetrecordreader.java:138)         @ org.apache.spark.sql.sources.sqlnewhadooprdd$$anon$1.<init>(sqlnewhadooprdd.scala:153)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:124)         @ org.apache.spark.sql.sources.sqlnewhadooprdd.compute(sqlnewhadooprdd.scala:66)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.rdd.mappartitionsrdd.compute(mappartitionsrdd.scala:35)         @ org.apache.spark.rdd.rdd.computeorreadcheckpoint(rdd.scala:277)         @ org.apache.spark.rdd.rdd.iterator(rdd.scala:244)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:70)         @ org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask.scala:41)         @ org.apache.spark.scheduler.task.run(task.scala:70)         @ org.apache.spark.executor.executor$taskrunner.run(executor.scala:213)         @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142)         @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617)         @ java.lang.thread.run(thread.java:745) caused by: java.lang.illegalargumentexception: expected 1 of [required, optional, repeated] got utm_medium @ line 29:     optional binary amp;utm_medium         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:203)         @ parquet.schema.messagetypeparser.addtype(messagetypeparser.java:101)         @ parquet.schema.messagetypeparser.addgrouptypefields(messagetypeparser.java:96)         @ parquet.schema.messagetypeparser.addgrouptype(messagetypeparser.java:130)         ... 24 more caused by: java.lang.illegalargumentexception: no enum constant parquet.schema.type.repetition.utm_medium         @ java.lang.enum.valueof(enum.java:238)         @ parquet.schema.type$repetition.valueof(type.java:70)         @ parquet.schema.messagetypeparser.asrepetition(messagetypeparser.java:201)         ... 27 more  driver stacktrace:         @ org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler.scala:1266)         @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1257)         @ org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler.scala:1256)         @ scala.collection.mutable.resizablearray$class.foreach(resizablearray.scala:59)         @ scala.collection.mutable.arraybuffer.foreach(arraybuffer.scala:47)         @ org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler.scala:1256)         @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730)         @ org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler.scala:730)         @ scala.option.foreach(option.scala:236)         @ org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler.scala:730)         @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1450)         @ org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler.scala:1411)         @ org.apache.spark.util.eventloop$$anon$1.run(eventloop.scala:48) 

can me issue? wrong?

any chance renamed columns? got same error when used "sum(x)" column, after renaming. changing column name before saving parquet file solved me, cf. http://trustedanalytics.github.io/atk/versions/v0.4.1/errata.html.


Comments

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

unity3d - Unity local avoidance in user created world -