How to configure solr dataimport handler to parse wikipedia xml document? -
so have done far.
i have added request handler in solrconfig.xml
follows:
<requesthandler name="/dataimport" class="org.apache.solr.handler.dataimport.dataimporthandler"> <lst name="defaults"> <str name="config">wiki-data-config.xml</str> </lst> </requesthandler>
in same configuration directory have created file wiki-data-config.xml
contains following,
<dataconfig> <datasource type="filedatasource" encoding="utf-8" /> <document> <entity name="page" pk="id" processor="xpathentityprocessor" stream="true" foreach="/mediawiki/page/" url="/home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml" flatten="true" > <field column="id" xpath="/mediawiki/page/id" /> <field column="title" xpath="/mediawiki/page/title" /> <field column="revision" xpath="/mediawiki/page/revision/id" /> <field column="user" xpath="/mediawiki/page/revision/contributor/username" /> <field column="userid" xpath="/mediawiki/page/revision/contributor/id" /> <field column="text" xpath="/mediawiki/page/revision/text" /> <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" datetimeformat="yyyy-mm-dd't'hh:mm:ss'z'" /> <field column="$skipdoc" regex="^#redirect .*" replacewith="true" sourcecolname="text"/> </entity> </document> </dataconfig>
and schema.xml
contains following,
<!-- tanny edit starts --> <field name="id" type="int" indexed="true" stored="true" required="true"/> <field name="title" type="string" indexed="true" stored="false"/> <field name="revision" type="int" indexed="true" stored="true"/> <field name="user" type="string" indexed="true" stored="true"/> <field name="userid" type="int" indexed="true" stored="true"/> <field name="text" type="text_en" indexed="true" stored="false"/> <field name="timestamp" type="date" indexed="true" stored="true"/> <field name="titletext" type="text_en" indexed="true" stored="true"/> <uniquekey>id</uniquekey> <copyfield source="title" dest="titletext"/> <!-- tanny edit ends -->
now after restarting solr, try post wikimedia xml data using ./bin/post
script in following way,
tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml
and prints following in console
/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -dauto=yes -dc=core-base-wiki -ddata=files org.apache.solr.util.simpleposttool /home/tanny/downloads/data/wiki/enwiki-20150702-stub-articles8.xml simpleposttool version 5.0.0 posting files [base] url http://localhost:8983/solr/core-base-wiki/update... entering auto mode. file endings considered xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log posting file enwiki-20150702-stub-articles8.xml (application/xml) [base] 1 files indexed. committing solr index changes http://localhost:8983/solr/core-base-wiki/update... time spent: 0:00:00.863
however, when go ui , check overview says 0 documents indexed. @ loss understand configuration missing out on. help/guidance higly appreciated.
p.s.: dataset enwiki-20150702-stub-articles8.xml downloaded wikimedia page. few sample lines document mentioned follows,
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en"> <siteinfo> <sitename>wikipedia</sitename> <dbname>enwiki</dbname> <base>https://en.wikipedia.org/wiki/main_page</base> <generator>mediawiki 1.26wmf11</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">media</namespace> <namespace key="829" case="first-letter">module talk</namespace> ... ... <namespace key="2600" case="first-letter">topic</namespace> </namespaces> </siteinfo> <page> <title>700 (number)</title> <ns>0</ns> <id>465001</id> <revision> <id>663854862</id> <parentid>655386821</parentid> <timestamp>2015-05-24t21:01:24z</timestamp> <contributor> <username>cnwilliams</username> <id>10190671</id> </contributor> <comment>disambiguated: [[tintin]] → [[the adventures of tintin]]</comment> <model>wikitext</model> <format>text/x-wiki</format> <text id="669059875" bytes="12464" /> <sha1>q15fslnvlsrgbeo8f6mcyrg00l2d2a5</sha1> </revision> </page> <page> <title>canadian federal election, 1957</title> <ns>0</ns> <id>465004</id> <revision> <id>666418811</id> <parentid>666417048</parentid> <timestamp>2015-06-11t01:38:05z</timestamp> <contributor> <username>wehwalt</username> <id>458237</id> </contributor> <comment>/* impact */ clarify</comment> <model>wikitext</model> <format>text/x-wiki</format> <text id="671713242" bytes="77788" /> <sha1>05g14m9sfavo7buuirpr8lx4c6vfwee</sha1> </revision> </page> ... ... <page> <title>professional players tournament (snooker)</title> <ns>0</ns> <id>665001</id> <redirect title="world open (snooker)" /> <revision> <id>359952698</id> <parentid>25566787</parentid> <timestamp>2010-05-03t23:48:34z</timestamp> <contributor> <username>xqbot</username> <id>8066546</id> </contributor> <minor/> <comment>robot: fixing double redirect [[world open (snooker)]]</comment> <model>wikitext</model> <format>text/x-wiki</format> <text id="360810125" bytes="34" /> <sha1>lxtjwcda9vk58fphj8ie2logjm607mv</sha1> </revision> </page> </mediawiki>
the data got indexed after tried ingest using command: "curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import".
somehow ./bin/post
not able same. didn't research more on same, if else has figured out how to, requested share findings.
Comments
Post a Comment