OS:LINUX ubuntu server 9.10
依上一篇的兩台叢集設定
1.在本機/etc/bash.bashrc最後增加
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/nutch
export HADOOP_CONF_DIR=/opt/nutch/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/opt/nutch
export NUTCH_CONF_DIR=/opt/nutch/conf
2.下載套件nutch-1.0.tar.gz在/opt 並解壓縮改名
mv nutch-1.0 nutch
3.將/opt/hadoop下的東西都貼過去/opt/nutch
cp -rf /opt/hadoop/* /opt/nutch
4.將/opt/nutch/下的jar都貼到/opt/nutch/lib下
cd /opt/nutch
cp -rf *.jar lib/
5.編輯/opt/nutch/conf下的設定檔hadoop-env.sh 加入
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/nutch
export HADOOP_CONF_DIR=/opt/nutch/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/opt/nutch
export NUTCH_CONF_DIR=/opt/nutch/conf
載入此環境值
source ./hadoop-env.sh
6.編輯/opt/nutch/conf下的設定檔nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. </description>
</property>
<property>
<name>http.agent.description</name>
<value>MyTest</value>
<description>Further description</description>
</property>
<property>
<name>http.agent.url</name>
<value>localhost</value>
<description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
<name>http.agent.email</name>
<value>test@test.org.tw</value>
<description>An email address
</description>
</property>
<property>
<name>plugin.folders</name>
<value>/opt/nutch/plugins</value>
<description>Directories where nutch plugins are located. </description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description> Regular expression naming plugin directory names</description>
</property>
<property>
<name>parse.plugin.file</name>
<value>parse-plugins.xml</value>
<description>The name of the file that defines the associations between
content-types and parsers.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description> </description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>indexer.mergeFactor</name>
<value>500</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
<property>
<name>indexer.minMergeDocs</name>
<value>500</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
</configuration>
7.編輯/opt/nutch/conf下的設定檔crawl-urlfilter.txt 修改以下
-^(ftp|mailto):
-[*!@]
# accecpt anything else (加入)
+.*
8.將/opt/nutch/ 整個複製到另一台
scp -r /opt/nutch 主機二IP:/opt/
9.在本機啟動/opt/nutch下的hadoop
bin/hadoop namenode -format
bin/start-all.sh
10.在/opt/nutch下增加urls檔案夾與檔案
cd /opt/nutch
mkdir urls
echo "http://tw.yahoo.com" >> ./urls/urls.txt
增加一個urls.txt檔 為引擎要抓的網站
11.傳到DFS檔案系統 (在nutch用dfs指令)
bin/hadoop dfs -put urls urls
12.開始抓
bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000
13.用tomcat架起網站
下載apache-tomcat-6.0.18.tar.gz到/opt
解壓縮並改名
tar -xzvf apache-tomcat-6.0.18.tar.gz
mv apache-tomcat-6.0.18 tomcat
修改/opt/tomcat/conf/server.xml 將網站改用萬國碼 且用8080port
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"
useBodyEncodingForURI="true" />
將hadoop算好放在dfs檔案系統上的search資料夾下載下來
cd /opt/nutch
bin/hadoop dfs -get search /opt/search
將tomcat的根目錄改用nutch
mv /opt/tomcat/webapps/ROOT /opt/tomcat/webapps/ROOT-backup
cd /opt/nutch
mkdir web
cd web
jar -xvf ../nutch-1.0.war
cd /opt/nutch
mv /opt/nutch/web /opt/tomcat/webapps/ROOT
修改/opt/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml 指出searcher.dir的路徑為/opt/search
<configuration>
<property>
<name>searcher.dir</name>
<value>/opt/search</value>
</property>
</configuration>
啟動tomcat
/opt/tomcat/bin/startup.sh
http://本機IP:8080
也可用nutchez快速佈署
32位元
http://trac.nchc.org.tw/cloud/export/107/package/nutchez_0.1-3_i386.deb
64位元
http://trac.nchc.org.tw/cloud/export/107/package/nutchez_0.1-3_amd64.deb
執行
dpkg -i nutchez_0.1-*.deb
打nutchez啟動
留言列表