xml parsing - In Perl, extract text from related nodes, using XML::Twig -
following xml file want parse:
<?xml version="1.0" encoding="utf-8"?> <topic id="yerus5" xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/"> <title/> <shortdesc/> <body> <p><b>ccu_cnt_addr: (address=0x004 reset=32'h1)</b><table id="table_r5b_1xj_ts"> <tgroup cols="4"> <colspec colnum="1" colname="col1"/> <colspec colnum="2" colname="col2"/> <colspec colnum="3" colname="col3"/> <colspec colnum="4" colname="col4"/> <tbody> <row> <entry>field</entry> <entry>offset</entry> <entry>r/w access</entry> <entry>description</entry> </row> <row> <entry>reg2sm_cnt</entry> <entry>15:0</entry> <entry>r/w</entry> <entry>count value increment in extenral memory @ specified location. default value of 1. count value of 0 clear counter value</entry> </row> <row> <entry>ccu2bus_endianess</entry> <entry>24</entry> <entry>r/w</entry> <entry>endianess of data structure bit</entry> </row></tbody> </tgroup> </table><b>ccu_stat_addr: (address=0x008 reset=32'h0)</b><table id="table_mcc_1xj_ts"> <tgroup cols="4"> <colspec colnum="1" colname="col1"/> <colspec colnum="2" colname="col2"/> <colspec colnum="3" colname="col3"/> <colspec colnum="4" colname="col4"/> <tbody> <row> <entry>field</entry> <entry>offset</entry> <entry>r/w access</entry> <entry>description</entry> </row> <row> <entry>fifo_cnt</entry> <entry>1:0</entry> <entry>r</entry> <entry>status. 0x0 indicates engine free. 0x1 on write address</entry> </row> <row> <entry>rfifo_cnt</entry> <entry>3:2</entry> <entry>r</entry> <entry>status. 0x0 indicates there no pending read values ccu engine.</entry> </row> </tbody> </tgroup> </table></p> </body> </topic>
after running following code (available @ in perl, xml::simple not able dereference multi dimensional associative array parsed data::dumper):
use strict; use warnings; use xml::twig; use data::dumper; @headers; $column_to_show = 'field'; sub process_row { %entries; ( $twig, $row ) = @_; @row_entries = map { $_->text } $row->children; if (@headers) { @entries{@headers} = @row_entries; print $column_to_show, " => ", $entries{$column_to_show}, "\n"; } else { @headers = @row_entries; } } $twig = xml::twig->new( 'pretty_print' => 'indented_a', twig_handlers => { 'row' => \&process_row } )->parsefile ( 'your_file.xml' );
i able access each data of <entry></entry>
.
i not able extract details particularly each <b></b>
text. yes, able extract <b></b>
text. not able extract <row></row>
each <b></b>
separately. following sample output:
name: ccu_cnt_addr: (address=0x004 reset=32'h1) field: reg2sm_cnt offset: 15:0 access: r/w description: count value increment in extenral memory @ specified location. default value of 1. count value of 0 clear counter value filed: ccu2bus_endianess offset: 24 access: r/w description: endianess of data structure bit . . . . . . . name: ccu_stat_addr: (address=0x008 reset=32'h0) field: fifo_cnt . . . . . . .
i tried following not working:
foreach $b ( $twig -> get_xpath ("//b") ) # extract text of <b></b> { print $b ->text, "\n"; foreach $row ( $twig -> get_xpath ("//row") ) { print $row ->text, "\n"; } }
ok, given example - it's irritating, because xml doesn't explicitly associate 'heading' 'table' (e.g. encapsulating them within xml node).
however can use prev_sibling
method previous element @ same level.
#!/usr/bin/env perl use strict; use warnings; use xml::twig; $twig = xml::twig->new->parsefile ( 'your_file.xml' ); foreach $table ( $twig->get_xpath('//table') ) { $header = $table->prev_sibling->text; print "name: $header\n"; @headers; foreach $row ( $table->get_xpath("tgroup/tbody/row") ) { %entries; @row_entries = map { $_->text =~ s/\n\s+//rg; } $row->children; if (@headers) { @entries{@headers} = @row_entries; foreach $field (@headers) { print "$field: $entries{$field}\n"; } } else { @headers = @row_entries; } } print "----\n"; }
note - assumes 'element before table
' header. works in specific case, work if there always element directly preceding <table>
want display.
- we run 'foreach' loop, picking out elements called
table
(of there 2 in sample. - each table, assume previous sibling element header. in case, that's
<b>
elements. wary of though,<b>
denotes bold in html , formatting tag. - we same thing otherwise - each table, decompose rows such have header , bunch of columns, , print them 1 per line.
- as part of doing this, use regex remove 'linefeed , whitespace' (
s/\n\s+//gr
) because formatting on description looked bit 'off'. can remove if it's undesired. (note - works on newer perl versions - 5.14+ iirc)
this produces:
name: ccu_cnt_addr: (address=0x004 reset=32'h1) field: reg2sm_cnt offset: 15:0 r/w access: r/w description: count value increment in extenral memory @ specified location.default value of 1. count value of 0 clear counter value field: ccu2bus_endianess offset: 24 r/w access: r/w description: endianess of data structure bit ---- name: ccu_stat_addr: (address=0x008 reset=32'h0) field: fifo_cnt offset: 1:0 r/w access: r description: status. 0x0 indicates engine free. 0x1 on write toaddress field: rfifo_cnt offset: 3:2 r/w access: r description: status. 0x0 indicates there no pending read values ccu engine. ----
Comments
Post a Comment