使用java跑MapReduce @ Samuel

OS:LINUX ubuntu server 9.10

架好namenode以後

下載下來的hadoop裡面有個hadoop-*-examples.jar

依前言應在/opt/hadoop裡

可用

cd /opt/hadoop

下面這句是說我要上傳的資料夾myinput 上傳到HDFS的input檔案夾沒有input檔案夾會自動建立

bin/hadoop fs -put myinput input

下面這句是說用範例裡面的wordcount(字元計算以空白鍵作間隔) 以input檔案裡的資料做計算將結果放入output 沒有output會自動建立

bin/hadoop jar hadoop-*-examples.jar wordcount input output

關於此範例有很多種可看http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/package-summary.html

如果要自己寫程式去跑可用

cd /opt/hadoop

下面指用 hadoop-*-core.jar (如 hadoop-0.18.3-core.jar) 去將mycode.java做編譯

正常來說 mycode.java會包含mapper 與 reducer這兩個類別

1編譯 javac -classpath hadoop-*-core.jar -d Myjava mycode.java

封裝成myjar.jar

2封裝 jar -cvf myjar.jar -C Myjava

使用主要.java的名去做執行(如mycode)

3執行 bin/hadoop jar myjar.jar mycode input output

就可以看到跑的結果在output裡了

也可使用eclipse來編輯

apt-get purge java-gcj-compat

apt-get install java-common sun-java6-bin sun-java6-jdk sun-java6-jre

下載jdk-java6-doc (jdk-6u10-docs.zip) 放到/tmp裡

apt-get install sun-java6-doc

建立html的連結在/usr/lib/jvm/java-6-sun/docs/ 裡

ln -sf /usr/share/doc/sun-java6-jdk/html /usr/lib/jvm/java-6-sun/docs

取得eclipse-SDK-3.3.2-linux-gtk.tar.gz 套件並解壓縮移至/opt下

tar -zxvf eclipse-SDK-3.3.2-linux-gtk.tar.gz

mv eclipse /opt

在/usr/local/bin/裡建立eclipse執行檔的連結

ln -sf /opt/eclipse/eclipse /usr/local/bin/

將/opt/hadoop裡的eclipse plugin搬到eclipse/plugin裡

cd /opt/hadoop

cp /opt/hadoop/contrib/eclipse-plugin/hadoop-0.18.3-eclipse-plugin.jar /opt/eclipse/plugins

參看 /opt/eclipse/eclipse.ini 是否有以下內容

-showsplash
org.eclipse.platform
-vmargs
-Xms40m
-Xmx256m

打開eclipse執行檔

設定workspace目錄

在windows -> open perspective -> other 選 MapReduce

點左邊的DFS Locations 可以看到整個編輯畫面下面視窗有黃色大象

file -> new -> project 選 MapReduce project

輸入專案名稱(ex temp) 並在configure Hadoop install Directory籃色字按下編輯輸入 /opt/hadoop finish

在左邊視窗對專案temp 點右鍵選 properties -> java build path -> libraries

將有hadoop 字樣的做以下編輯

source attachment: /opt/hadoop/src/core

javadoc location: file:/opt/hadoop/docs/api/

再設定 properties -> javadoc location

javadoc location path: file:/usr/lib/jvm/java-6-sun/docs/api/

回到主畫面點左邊的DFS Locations 可以看到整個編輯畫面下面視窗有黃色大象此視窗右上角有個小藍象

點下設定

location name : hadoop

map reduse master: host: localhost port: 9001

DFS master: host: localhost port: 9000

User Name :　使用者名稱

設定完成就可以看到DFS裡的檔案結構　如htto://localhost:50070裡面的顯示

按file ->new -> Mapper

file -> new -> reducer

file -> new -> Map/Reduce Driver (主要class)

在Map/Reduce Driver的main function一定要設定


JobConf conf = new JobConf(WordCount.class);
conf.setMapperClass(mapper.class);
conf.setReducerClass(reducer.class);FileInputFormat.setInputPaths(conf, new Path("/user/使用者名稱/input"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
JobClient.runJob(conf);

有關聯繫mapper 與 reducer 與 輸入輸出路徑 還有執行

在專案裡有src 與 bin

src是裝java檔 bin是裝class檔也就是編譯檔

可直接對專案檔點右鍵按 Run as -> Run on Hadoop

也可用 File -> Export -> java -> JAR file 產生jar檔在用指令方式去跑

也可用Makefile方式去做

JarFile="test1.jar"
MainFunc="Sample.test"
LocalOutDir="/tmp/output"
HADOOP_BIN="/opt/hadoop/bin"

all:jar run output clean

jar:
jar -cvf ${JarFile} -C bin/ .

run:
${HADOOP_BIN}/hadoop jar ${JarFile} ${MainFunc} input output

clean:
${HADOOP_BIN}/hadoop fs -rmr output

output:
rm -rf ${LocalOutDir}
${HADOOP_BIN}/hadoop fs -get output ${LocalOutDir}
gedit ${LocalOutDir}/part-r-00000 &