xml parsing Python reads file incorrectly -

- February 15, 2012

i trying parse xml file online , obtain data need file. code displayed below:

import urllib2 xml.dom.minidom import parse import pandas pd import time  page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextxml.php?sid=kbfi&num=360') page_content = page.read() open('kbfi.xml', 'w') fid:     fid.write(page_content)  data = []  xml = parse('kbfi.xml') percp = 0 station in xml.getelementsbytagname('station'): ob in xml.getelementsbytagname('ob'):     # convert time sting time_struct ignoring last 4 chars ' pdt'     ob_time = time.strptime(ob.getattribute('time')[:-4],'%d %b %i:%m %p')     variable in xml.getelementsbytagname('variable'):         if variable.getattribute('var') == 'pcp1h':             percp = true             # unindent if want variables             if variable.getattribute('value') == 't':                 data.append([ob_time.tm_mday,                              ob_time.tm_hour,                              ob_time.tm_min,                              0])             elif variable.getattribute('value') >= 0:                 data.append((ob_time.tm_mday,                             ob_time.tm_hour,                             ob_time.tm_min,                             variable.getattribute('value')))     if not percp:         # if pcp1h wasn't found add 0         data.append([ob_time.tm_mday,                     ob_time.tm_hour,                     ob_time.tm_min,                     0]) print data

unfortunately cannot post image of xml file, version of saved current directory if script run.

i code check existence of 'variable' pcph1 , print 'value' if exists (only 1 entry per 'ob'). if doesn't exist or provides value of 't', print '0' particular hour. output (the script provided can run see output) contains incorrect values , there 6 entries per hour instead of one. wrong code?

main issue in code , in each for loop, getting elements using -

xml.getelementsbytagname('ob')

this starts search xml element, in case in root element, same in case of xml.getelementsbytagname('variable') , starts search @ root element, each time getting elements tag variable , why getting 6 entries per hour, instead of 1 (since there 6 of them in complete xml).

you should instead using -

ob.getelementsbytagname('variable')

and ob element using -

station.getelementsbytagname('ob')

so check inside particular element iterating on (not complete xml document).

also, side issue , doing -

elif variable.getattribute('value') >= 0:

if not wrong, getattribute() returns string, check true, irrespective of actual value is. in xml, see value has strings numbers, not sure want condition (though not main issue, main issue 1 described above).

example code changes -

import urllib2 xml.dom.minidom import parse import pandas pd import time  page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextxml.php?  sid=kbfi&num=360') page_content = page.read() open('kbfi.xml', 'w') fid:     fid.write(page_content.decode())  data = []  xml = parse('kbfi.xml') percp = 0 station in xml.getelementsbytagname('station'):     ob in station.getelementsbytagname('ob'):         # convert time sting time_struct ignoring last 4 chars ' pdt'         ob_time = time.strptime(ob.getattribute('time')[:-4],'%d %b %i:%m %p')         variable in ob.getelementsbytagname('variable'):             if variable.getattribute('var') == 'pcp1h':                 percp = true                 # unindent if want variables                 if variable.getattribute('value') == 't':                     data.append([ob_time.tm_mday,                                  ob_time.tm_hour,                                  ob_time.tm_min,                                  0])                 elif variable.getattribute('value') >= 0:                     data.append((ob_time.tm_mday,                                 ob_time.tm_hour,                                 ob_time.tm_min,                                 variable.getattribute('value')))         if not percp:             # if pcp1h wasn't found add 0             data.append([ob_time.tm_mday,                         ob_time.tm_hour,                         ob_time.tm_min,                         0]) print data

Search This Blog

Post

xml parsing Python reads file incorrectly -

Comments

Post a Comment

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

Maven Javadoc 'Cannot find default setter' and fails -