python - Parsing: How do I strip out Unicode Characters? -
i wrote code grab text in between break elements on webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&id=10478
i think on right track right getting bad values below results [u'2133 craigs store road', u'afton,\r\n\t\tva \xa0\r\n\t\t22920', u'contact person:', u'email address:', u'website:', u'phone: 434-882-3150', u'']
i need figure out how strip out unicode result values. can help?
r=requests.get('http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&id=10478') soup=beautifulsoup(r.content,'lxml') tbl=soup.findall('table')[2] contact=tbl.findall('p')[0] list=[] br in contact.findall('br'): next = br.nextsibling text=next.strip() list.append(text) print list
you can use replace built-in function str
type has.
text = next.strip().replace("\n", "").replace("\t", "").replace("\r", "")
that way can replace \n\t\r
, replace them nothing
Comments
Post a Comment