xml parsing Python reads file incorrectly -
i trying parse xml file online , obtain data need file. code displayed below:
import urllib2 xml.dom.minidom import parse import pandas pd import time page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextxml.php?sid=kbfi&num=360') page_content = page.read() open('kbfi.xml', 'w') fid: fid.write(page_content) data = [] xml = parse('kbfi.xml') percp = 0 station in xml.getelementsbytagname('station'): ob in xml.getelementsbytagname('ob'): # convert time sting time_struct ignoring last 4 chars ' pdt' ob_time = time.strptime(ob.getattribute('time')[:-4],'%d %b %i:%m %p') variable in xml.getelementsbytagname('variable'): if variable.getattribute('var') == 'pcp1h': percp = true # unindent if want variables if variable.getattribute('value') == 't': data.append([ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, 0]) elif variable.getattribute('value') >= 0: data.append((ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, variable.getattribute('value'))) if not percp: # if pcp1h wasn't found add 0 data.append([ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, 0]) print data
unfortunately cannot post image of xml file, version of saved current directory if script run.
i code check existence of 'variable' pcph1 , print 'value' if exists (only 1 entry per 'ob'). if doesn't exist or provides value of 't', print '0' particular hour. output (the script provided can run see output) contains incorrect values , there 6 entries per hour instead of one. wrong code?
main issue in code , in each for
loop, getting elements using -
xml.getelementsbytagname('ob')
this starts search xml
element, in case in root element, same in case of xml.getelementsbytagname('variable')
, starts search @ root element, each time getting elements tag variable
, why getting 6 entries per hour, instead of 1 (since there 6 of them in complete xml).
you should instead using -
ob.getelementsbytagname('variable')
and ob element using -
station.getelementsbytagname('ob')
so check inside particular element iterating on (not complete xml document).
also, side issue , doing -
elif variable.getattribute('value') >= 0:
if not wrong, getattribute()
returns string, check true, irrespective of actual value
is. in xml, see value
has strings numbers, not sure want condition (though not main issue, main issue 1 described above).
example code changes -
import urllib2 xml.dom.minidom import parse import pandas pd import time page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextxml.php? sid=kbfi&num=360') page_content = page.read() open('kbfi.xml', 'w') fid: fid.write(page_content.decode()) data = [] xml = parse('kbfi.xml') percp = 0 station in xml.getelementsbytagname('station'): ob in station.getelementsbytagname('ob'): # convert time sting time_struct ignoring last 4 chars ' pdt' ob_time = time.strptime(ob.getattribute('time')[:-4],'%d %b %i:%m %p') variable in ob.getelementsbytagname('variable'): if variable.getattribute('var') == 'pcp1h': percp = true # unindent if want variables if variable.getattribute('value') == 't': data.append([ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, 0]) elif variable.getattribute('value') >= 0: data.append((ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, variable.getattribute('value'))) if not percp: # if pcp1h wasn't found add 0 data.append([ob_time.tm_mday, ob_time.tm_hour, ob_time.tm_min, 0]) print data
Comments
Post a Comment