使用Hadoop架上自己的搜尋引擎nutch－Samuel

OS:LINUX ubuntu server 9.10

依上一篇的兩台叢集設定

1.在本機/etc/bash.bashrc最後增加

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/nutch
export HADOOP_CONF_DIR=/opt/nutch/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/opt/nutch
export NUTCH_CONF_DIR=/opt/nutch/conf

2.下載套件nutch-1.0.tar.gz在/opt 並解壓縮改名

mv nutch-1.0 nutch

3.將/opt/hadoop下的東西都貼過去/opt/nutch

cp -rf /opt/hadoop/* /opt/nutch

4.將/opt/nutch/下的jar都貼到/opt/nutch/lib下

cd /opt/nutch

cp -rf *.jar lib/

5.編輯/opt/nutch/conf下的設定檔hadoop-env.sh 加入

載入此環境值

source ./hadoop-env.sh

6.編輯/opt/nutch/conf下的設定檔nutch-site.xml

<configuration>
<property>
  <name>http.agent.name</name>
  <value>nutch</value>
  <description>HTTP 'User-Agent' request header. </description>
</property>
<property>
  <name>http.agent.description</name>
  <value>MyTest</value>
  <description>Further description</description>
</property>
<property>
  <name>http.agent.url</name>
  <value>localhost</value>
  <description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
  <name>http.agent.email</name>
  <value>test@test.org.tw</value>
  <description>An email address
  </description>
</property>
<property>
  <name>plugin.folders</name>
  <value>/opt/nutch/plugins</value>
  <description>Directories where nutch plugins are located. </description>
</property>
<property>
  <name>plugin.includes</name>
  <value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description> Regular expression naming plugin directory names</description>
</property>
<property>
  <name>parse.plugin.file</name>
  <value>parse-plugins.xml</value>
  <description>The name of the file that defines the associations between
  content-types and parsers.</description>
</property>
<property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description> </description>
</property>
<property>
   <name>http.content.limit</name>
   <value>-1</value>
</property>
<property>
  <name>indexer.mergeFactor</name>
  <value>500</value>
  <description>The factor that determines the frequency of Lucene segment
  merges. This must not be less than 2, higher values increase indexing
  speed but lead to increased RAM usage, and increase the number of
  open file handles (which may lead to "Too many open files" errors).
  NOTE: the "segments" here have nothing to do with Nutch segments, they
  are a low-level data unit used by Lucene.
  </description>
</property>
<property>
  <name>indexer.minMergeDocs</name>
  <value>500</value>
  <description>This number determines the minimum number of Lucene
  Documents buffered in memory between Lucene segment merges. Larger
  values increase indexing speed and increase RAM usage.
  </description>
</property>
</configuration>

7.編輯/opt/nutch/conf下的設定檔crawl-urlfilter.txt 修改以下

-^(ftp|mailto):
-[*!@]
# accecpt anything else (加入)
+.*

8.將/opt/nutch/ 整個複製到另一台

scp -r /opt/nutch 主機二IP:/opt/

9.在本機啟動/opt/nutch下的hadoop

bin/hadoop namenode -format

bin/start-all.sh

10.在/opt/nutch下增加urls檔案夾與檔案

cd /opt/nutch

mkdir urls

echo "http://tw.yahoo.com" >> ./urls/urls.txt

增加一個urls.txt檔為引擎要抓的網站

11.傳到DFS檔案系統 (在nutch用dfs指令)

bin/hadoop dfs -put urls urls

12.開始抓

bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000

13.用tomcat架起網站

下載apache-tomcat-6.0.18.tar.gz到/opt

解壓縮並改名

tar -xzvf apache-tomcat-6.0.18.tar.gz

mv apache-tomcat-6.0.18 tomcat

修改/opt/tomcat/conf/server.xml 將網站改用萬國碼且用8080port

將hadoop算好放在dfs檔案系統上的search資料夾下載下來

cd /opt/nutch

bin/hadoop dfs -get search /opt/search

將tomcat的根目錄改用nutch

mv /opt/tomcat/webapps/ROOT /opt/tomcat/webapps/ROOT-backup

cd /opt/nutch

mkdir web

cd web

jar -xvf ../nutch-1.0.war

cd /opt/nutch

mv /opt/nutch/web /opt/tomcat/webapps/ROOT

修改/opt/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml 指出searcher.dir的路徑為/opt/search

<configuration>
   <property>
   <name>searcher.dir</name>
   <value>/opt/search</value>
   </property>
</configuration>

啟動tomcat

/opt/tomcat/bin/startup.sh

http://本機IP:8080

也可用nutchez快速佈署

32位元
http://trac.nchc.org.tw/cloud/export/107/package/nutchez_0.1-3_i386.deb
64位元
http://trac.nchc.org.tw/cloud/export/107/package/nutchez_0.1-3_amd64.deb

執行

dpkg -i nutchez_0.1-*.deb

打nutchez啟動

caramels

Samuel

caramels 發表在痞客邦留言(0) 人氣()

E-mail轉寄

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Samuel

Samuel's Blog

使用Hadoop架上自己的搜尋引擎nutch

歷史上的今天

留言列表

月曆

熱門文章

文章分類

文章彙整

參觀人氣

最新迴響

QR Code

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

«	四月 2025					»
日	一	二	三	四	五	六
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30