How to configure solr dataimport handler to parse wikipedia xml document? -


so have done far.

i have added request handler in solrconfig.xml follows:

<requesthandler name="/dataimport" class="org.apache.solr.handler.dataimport.dataimporthandler">     <lst name="defaults">         <str name="config">wiki-data-config.xml</str>     </lst> </requesthandler> 

in same configuration directory have created file wiki-data-config.xml contains following,

<dataconfig>     <datasource type="filedatasource" encoding="utf-8" />     <document>         <entity name="page"                 pk="id"                 processor="xpathentityprocessor"                 stream="true"                 foreach="/mediawiki/page/"                 url="/home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml"                 flatten="true" >             <field column="id"        xpath="/mediawiki/page/id" />             <field column="title"     xpath="/mediawiki/page/title" />             <field column="revision"  xpath="/mediawiki/page/revision/id" />             <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />             <field column="userid"    xpath="/mediawiki/page/revision/contributor/id" />             <field column="text"      xpath="/mediawiki/page/revision/text" />             <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" datetimeformat="yyyy-mm-dd't'hh:mm:ss'z'" />             <field column="$skipdoc"  regex="^#redirect .*" replacewith="true" sourcecolname="text"/>        </entity>     </document> </dataconfig> 

and schema.xml contains following,

<!-- tanny edit starts -->  <field name="id"        type="int"  indexed="true" stored="true" required="true"/> <field name="title"     type="string"  indexed="true" stored="false"/> <field name="revision"  type="int"    indexed="true" stored="true"/> <field name="user"      type="string"  indexed="true" stored="true"/> <field name="userid"    type="int"     indexed="true" stored="true"/> <field name="text"      type="text_en"    indexed="true" stored="false"/> <field name="timestamp" type="date"    indexed="true" stored="true"/> <field name="titletext" type="text_en"    indexed="true" stored="true"/> <uniquekey>id</uniquekey> <copyfield source="title" dest="titletext"/>  <!-- tanny edit ends --> 

now after restarting solr, try post wikimedia xml data using ./bin/post script in following way,

tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml 

and prints following in console

/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -dauto=yes -dc=core-base-wiki -ddata=files org.apache.solr.util.simpleposttool /home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml simpleposttool version 5.0.0 posting files [base] url http://localhost:8983/solr/core-base-wiki/update... entering auto mode. file endings considered xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log posting file enwiki-20150702-stub-articles8.xml (application/xml) [base] 1 files indexed. committing solr index changes http://localhost:8983/solr/core-base-wiki/update... time spent: 0:00:00.863 

however, when go ui , check overview says 0 documents indexed. @ loss understand configuration missing out on. help/guidance higly appreciated.

p.s.: dataset enwiki-20150702-stub-articles8.xml downloaded wikimedia page. few sample lines document mentioned follows,

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">   <siteinfo>     <sitename>wikipedia</sitename>     <dbname>enwiki</dbname>     <base>https://en.wikipedia.org/wiki/main_page</base>     <generator>mediawiki 1.26wmf11</generator>     <case>first-letter</case>     <namespaces>       <namespace key="-2" case="first-letter">media</namespace>       <namespace key="829" case="first-letter">module talk</namespace>       ...       ...       <namespace key="2600" case="first-letter">topic</namespace>     </namespaces>   </siteinfo>   <page>     <title>700 (number)</title>     <ns>0</ns>     <id>465001</id>     <revision>       <id>663854862</id>       <parentid>655386821</parentid>       <timestamp>2015-05-24t21:01:24z</timestamp>       <contributor>         <username>cnwilliams</username>         <id>10190671</id>       </contributor>       <comment>disambiguated: [[tintin]] → [[the adventures of tintin]]</comment>       <model>wikitext</model>       <format>text/x-wiki</format>       <text id="669059875" bytes="12464" />       <sha1>q15fslnvlsrgbeo8f6mcyrg00l2d2a5</sha1>     </revision>   </page>   <page>     <title>canadian federal election, 1957</title>     <ns>0</ns>     <id>465004</id>     <revision>       <id>666418811</id>       <parentid>666417048</parentid>       <timestamp>2015-06-11t01:38:05z</timestamp>       <contributor>         <username>wehwalt</username>         <id>458237</id>       </contributor>       <comment>/* impact */ clarify</comment>       <model>wikitext</model>       <format>text/x-wiki</format>       <text id="671713242" bytes="77788" />       <sha1>05g14m9sfavo7buuirpr8lx4c6vfwee</sha1>     </revision>   </page>   ...   ...   <page>     <title>professional players tournament (snooker)</title>     <ns>0</ns>     <id>665001</id>     <redirect title="world open (snooker)" />     <revision>       <id>359952698</id>       <parentid>25566787</parentid>       <timestamp>2010-05-03t23:48:34z</timestamp>       <contributor>         <username>xqbot</username>         <id>8066546</id>       </contributor>       <minor/>       <comment>robot: fixing double redirect [[world open (snooker)]]</comment>       <model>wikitext</model>       <format>text/x-wiki</format>       <text id="360810125" bytes="34" />       <sha1>lxtjwcda9vk58fphj8ie2logjm607mv</sha1>     </revision>   </page> </mediawiki> 

the data got indexed after tried ingest using command: "curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import".

somehow ./bin/post not able same. didn't research more on same, if else has figured out how to, requested share findings.


Comments

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

unity3d - Unity local avoidance in user created world -