欢迎大家来到IT世界,在知识的湖畔探索吧!
一、实现目标
使用Nutch、MongoDB、ElasticSearch实现一个简单搜索引擎,使用Nutch爬虫抓取网页,存储到MongoDB中,通过mongodb-connector同步到ElasticSearch中,建立索引,通过RestFul API从ElasticSearch中检索内容。
二、实验环境
CentOS7 Linux x86_64
JDK 1.8.0_161
apache-ant-1.9.4
apache-nutch-2.3.1
mongodb 2.6.12-6
elasticsearch 6.2.2
mongo-connector 6.2.2
三、安装Oracle JDK
3.1卸载OpenJDK
yum list installed | grep java
yum remove java-1.8.0-openjdk-headless
yum remove javapackages-tools
yum remove python-javapackages
yum remove tzdata-java
…
3.2下载安装配置Oracle JDK 1.8.0_161
下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
rpm -ivh jdk-8u161-linux-x64.rpm
/etc/profile:
export JAVA_HOME=/usr/java/jdk1.8.0_161/
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
source /etc/profile
四 、安装配置MongoDB
yum install mongodb-server
yum install mongodb
/etc/mongodb/shard1.conf
port=47017
replSet=rs1
fork=true
dbpath=/data/mongodb/shard1
logpath=/data/mongodb/logs/shard1.log
shardsvr=true
directoryperdb=true
/etc/mongodb/shard2.conf
port=47018
replSet=rs1
fork=true
dbpath=/data/mongodb/shard2
logpath=/data/mongodb/logs/shard2.log
shardsvr=true
directoryperdb=true
/etc/mongodb/shard3.conf
port=47019
replSet=rs1
fork=true
dbpath=/data/mongodb/shard3
logpath=/data/mongodb/logs/shard3.log
shardsvr=true
directoryperdb=true
/etc/mongodb/config1.conf:
port=37017
fork=true
dbpath=/data/mongodb/config1
logpath=/data/mongodb/logs/config1.log
configsvr=true
directoryperdb=true
/etc/mongodb/config2.conf:
port=37018
fork=true
dbpath=/data/mongodb/config2
logpath=/data/mongodb/logs/config2.log
configsvr=true
directoryperdb=true
/etc/mongodb/config3.conf:
port=37019
fork=true
dbpath=/data/mongodb/config3
logpath=/data/mongodb/logs/config3.log
configsvr=true
directoryperdb=true
/etc/mongodb/router1.conf:
port = 27017
fork = true
logpath = /data/mongodb/logs/router1.log
configdb = vminger:37017,vminger:37018,vminger:37019
maxConns = 1000000
logappend = true
/etc/mongodb/router2.conf:
port = 27018
fork = true
logpath = /data/mongodb/logs/router2.log
configdb = vminger:37017,vminger:37018,vminger:37019
maxConns = 1000000
logappend = true
/etc/mongodb/router3.conf:
port = 27019
fork = true
logpath = /data/mongodb/logs/router3.log
configdb = vminger:37017,vminger:37018,vminger:37019
maxConns = 1000000
logappend = true
启动shard1-3、config1-3、router1-3:
mongod -f /etc/mongodb/shard1.conf
mongod -f /etc/mongodb/shard2.conf
mongod -f /etc/mongodb/shard3.conf
mongod -f /etc/mongodb/config1.conf
mongod -f /etc/mongodb/config2.conf
mongod -f /etc/mongodb/config3.conf
mongos -f /etc/mongodb/router1.conf
mongos -f /etc/mongodb/router2.conf
mongos -f /etc/mongodb/router3.conf
开启sharding:
mongo –port 27017
>use admin
>db.runCommand({addshard:”vminger:47017″,allowLocal:true })
>db.runCommand({addshard:”vminger:47018″,allowLocal:true })
>db.runCommand({addshard:”vminger:47019″,allowLocal:true })
开启replica sets(mongo-connector同步数据需要):
mongo –port 47017
> config={_id:”rs1″,members:[{_id:0,host:”vminger:47017″}]}
> rs.initiate(config)
> rs.add(“vminger:47018”)
> rs.add(“vminger:47019”)
创建数据库和用户:
mongo –port 27017
>use test1
>db.createUser({user: “root”, pwd: “root”, roles: [{ role: “dbOwner”, db: “test1” }]})
五、安装配置ElasticSearch
下载elasticsearch-6.2.3.tar.gz,并解压,下载地址https://www.elastic.co/downloads
配置config/elasticsearch.yml:
cluster.name: vminger
node.name: node-1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
安装中文分词插件elasticsearch-analysis-ik-6.2.2.zip:https://github.com/medcl/elasticsearch-analysis-ik
解压到elasticsearch-6.2.2/plugins目录
进入elasticsearch-6.2.2目录,执行./bin/elasticsearch -d (注意:使用非root用户启动)
创建test1 index,开启ik中文分词(提示: mongo-conntor同步建索引时,未指定ik,使用了默认standard,而elasticsearch无法修改,所以一种方法提前建index,设置ik)
POST http://192.168.132.33:9200/test1
{
“settings” : {
“index” : {
“analysis.analyzer.default.type”: “ik_max_word”
}
}
}
六、安装配置mongo-connector
pip install mongo-connector
pip install elastic2-doc-manager
mongo-connector -m vminger:27017 -t vminger:9200 -d elastic2_doc_manager
七、安装配置nutch
下载apache-ant-1.9.4-bin.tar.gz,并解压,下载地址:
/etc/profile:
export ANT_HOME=/home/vminger/workspace/sysapp/ant/apache-ant-1.9.4
export PATH=$PATH:$ANT_HOME/bin
source /etc/profile
下载apache-nutch-2.3.1-src.tar.gz,并解压,下载地址:
http://nutch.apache.org/downloads.html
/etc/profile:
export NUTCH_HOME=/home/vminger/workspace/sysapp/nutch/apache-nutch-2.3.1/runtime/local
export PATH=$PATH:$NUTCH_HOME/bin
source /etc/profile
conf/nutch-site.xml:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>Hist Crawler</value>
</property>
</configuration>
ivy/ivy.xml:
<dependency org=”org.apache.gora” name=”gora-mongodb” rev=”0.6.1″ conf=”*->default” />
conf/gora.properties:
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=vminger:27017
gora.mongodb.db=test1
gora.mongodb.login=root
gora.mongodb.secret=root
编译Nutch:ant runtime
设置抓取URL过滤规则:
conf/regex-urlfilter.txt:
+^http://([a-z0-9]*\.)*sina.com.cn/
设置URL种子:
runtime/local/urls/seed.ini
进入runtime/local目录,开始抓取,id1,深度为3:
./bin/crawl urls/ id1 3
使用RESTFul API查询内容:
POST
{
“query” : { “match” : { “text” : “中国” }},
“highlight” : {
“pre_tags” : [“<tag1>”, “<tag2>”],
“post_tags” : [“</tag1>”, “</tag2>”],
“fields” : {
“text” : {}
}
}
}
查询结果:
免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://itzsg.com/49132.html