Oozie4.2.0配置安裝實戰

ltww3128 8年前發布 | 82K 次閱讀 分布式/云計算/大數據

來自: http://blog.csdn.net/fansy1990/article/details/50570518

在Hadoop中執行的任務有時候需要把多個Map/Reduce作業連接到一起,這樣才能夠達到目的。[1]在Hadoop生態圈中,有一種相對比較新的組件叫做Oozie[2],它讓我們可以把多個Map/Reduce作業組合到一個邏輯工作單元中,從而完成更大型的任務。本文中,我們會向你介紹Oozie以及使用它的一些方式。

什么是Oozie?

Oozie是一種Java Web應用程序,它運行在Java servlet容器——即Tomcat——中,并使用數據庫來存儲以下內容:

  • 工作流定義
  • 當前運行的工作流實例,包括實例的狀態和變量

Oozie工作流是放置在控制依賴DAG(有向無環圖 Direct Acyclic Graph)中的一組動作(例如,Hadoop的Map/Reduce作業、Pig作業等),其中指定了動作執行的順序。我們會使用hPDL(一種XML流程定義語言)來描述這個圖。

軟件版本:

Oozie4.2.0,Hadoop2.6.0,Spark1.4.1,Hive0.14,Pig0.15.0,Maven3.2,JDK1.7,zookeeper3.4.6,HBase1.1.2,MySQL5.6

集群部署:

node1~4.centos.com     node1~4      192.168.0.31~34          1G*4 內存    1核*4 虛擬機

node1:NameNode 、ResourceManager;

node2:SecondaryNameNode、Master、HMaster、HistoryServer、JobHistoryServer

node3:oozie-server(tomcat)、DataNode、NodeManager、HRegionServer、Worker、QuorumPeerMain

node4:DataNode、NodeManager、HRegionServer、Worker、Pig client、Hive Client、HiveServer2、QuorumPeerMain、mysql

1. 編譯Oozie4.2.0

此篇參考 http://oozie.apache.org/docs/4.2.0/DG_QuickStart.html#Building_Oozie 、 http://blog.csdn.net/u014729236/article/details/47188631 

1.1 編譯環境準備

使用tomcat7,而不是tomcat6的下載地址:
1)下載壓縮包oozie-4.2.0.tar.gz,并解壓縮到/usr/local/oozie目錄
2)修改pom.xml
/usr/local/oozie/oozie-4.2.0/distro/pom.xml
<get src="http://archive.apache.org/dist/tomcat/tomcat-6    ==>
<get src="http://archive.apache.org/dist/tomcat/tomcat-7

3) 修改maven setting.xml ,使用開源中國的庫
<mirror>
      <id>nexus-osc</id>
      <name>OSChina Central</name>                                                                             
      <url>http://maven.oschina.net/content/groups/public/</url>
      <mirrorOf>*</mirrorOf>
</mirror>

1.2 編譯

進入oozie解壓縮目錄,使用下面的命令:
bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.auth.version=2.6.0 -Ddistcp.version=2.6.0 -Dspark.version=1.4.1 -Dpig.version=0.15.0 -Dtomcat.version=7.0.52
如果加入了hbase或者hive,并且指定到較高版本,則會出錯,如:
#bin/mkdistro.sh -DskipTests -Phadoop-2 -Dhadoop.auth.version=2.6.0 -Ddistcp.version=2.6.0 -Dspark.version=1.4.1 -Dpig.version=0.15.0 -Dtomcat.version=7.0.52 #-Dhive.version=0.14.0 -Dhbase.version=1.1.2 ## 指定hive和hbase到較高版本編譯通不過

1.3 修改HDFS配置:

 修改hadoop core-site.xml,內容如下:
<property>
    <name>hadoop.proxyuser.[USER].hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.[USER].groups</name>
    <value>*</value>
  </property>
其中,[USER]需要改為后面啟動oozie tomcat的用戶
不重啟hadoop集群,而使配置生效
hdfs dfsadmin -refreshSuperUserGroupsConfiguration
  yarn rmadmin -refreshSuperUserGroupsConfiguration

1.4 配置Oozie

(由于是在node3上部署oozie,所以把下面的壓縮包拷貝到node3上)

1) 取得壓縮包:
oozie-4.2.0/distro/target/oozie-4.2.0-distro.tar.gz
2) 解壓縮:
tar -zxf oozie-4.2.0-distro.tar.gz

3)在oozie-4.2.0目錄下新建libext目錄,并把
ext-2.2.zip 拷貝到該目錄下;
并拷貝hadoop相關jar包到該目錄下
cp $HADOOP_HOME/share/hadoop/*/*.jar libext/
cp $HADOOP_HOME/share/hadoop/*/lib/*.jar libext/

把hadoop與tomcat沖突jar包去掉
mv servlet-api-2.5.jar servlet-api-2.5.jar.bak
mv jsp-api-2.1.jar jsp-api-2.1.jar.bak
mv jasper-compiler-5.5.23.jar jasper-compiler-5.5.23.jar.bak
mv jasper-runtime-5.5.23.jar jasper-runtime-5.5.23.jar.bak

拷貝mysql驅動到該目錄下(使用mysql數據庫,默認是derby)
scp mysql-connector-java-5.1.25-bin.jar node3:/usr/oozie/oozie-4.2.0/libext/

4)配置數據庫連接,文件是conf/oozie-site.xml
<property>
    <name>oozie.service.JPAService.create.db.schema</name>
    <value>true</value>
</property>
<property>
    <name>oozie.service.JPAService.jdbc.driver</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>oozie.service.JPAService.jdbc.url</name>
    <value>jdbc:mysql://node4:3306/oozie?createDatabaseIfNotExist=true</value>
</property>

<property>
    <name>oozie.service.JPAService.jdbc.username</name>
    <value>root</value>
</property>

<property>
    <name>oozie.service.JPAService.jdbc.password</name>
    <value>root</value>
</property>
<property>
    <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
    <value>*=/usr/hadoop/hadoop-2.6.0/etc/hadoop</value>
</property>


最后一個配置,是需要配置的,不然后面運行調度的時候,任務會報File /user/root/share/lib does not exist 的錯誤


5)啟動前的初始化
a. 打war包  
bin/oozie-setup.sh prepare-war

b. 初始化數據庫
bin/ooziedb.sh create -sqlfile oozie.sql -run


c. 修改oozie-4.2.0/oozie-server/conf/server.xml文件,注釋掉下面的記錄
<!--<Listener className="org.apache.catalina.mbeans.ServerLifecycleListener" />-->

d. 上傳jar包
bin/oozie-setup.sh sharelib create -fs hdfs://node1:8020 

1.5 啟動

bin/oozied.sh start



2. 流程實例

數據為:bank.csv ,并已經上傳到hdfs://node1:8020/user/root/bank.csv ,可以在http://zeppelin-project.org/docs/tutorial/tutorial.html頁面下載該數據
(當執行Hive、Pig任務的時候需要把第一行數據刪除)
默認所有操作用戶都是root,如果是其他用戶,可能需要修改對應的目錄
配置環境變量:export OOZIE_URL=http://node3:11000/oozie

2.1 MR任務流程

1. job.properties :
oozie.wf.application.path=hdfs://node1:8020/user/root/workflow/mr_demo/wf
#Hadoop"R
jobTracker=node1:8032
#Hadoop"fs.default.name
nameNode=hdfs://node1:8020/
#Hadoop"mapred.queue.name
queueName=default

2. workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
    <start to="mr-node"/>
    <action name="mr-node">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${nameNode}/user/${wf:user()}/workflow/mr_demo/output"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>mapreduce.mapper.class</name>
                    <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
                </property>
                <property>
                    <name>mapreduce.reducer.class</name>
                    <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
                </property>
                <property>
                    <name>mapred.map.tasks</name>
                    <value>1</value>
                </property>
                <property>
                    <name>mapred.input.dir</name>
                    <value>/user/${wf:user()}/bank.csv</value>
                </property>
                <property>
                    <name>mapred.output.dir</name>
                    <value>/user/${wf:user()}/workflow/mr_demo/output</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

3. 運行:
1)拷貝workflow.xml文件到HDFS的 hdfs://node1:8020/user/root/workflow/mr_demo/wf/workflow.xml 目錄;
2)在node3(node3既作為oozie的server也作為client)上運行 bin/oozie job -config job.properties -run ,即可提交任務,提交任務后會返回一個jobId ,例如:
0000004-160123180442501-oozie-root-W
3) 使用 bin/oozie job -info 0000004-160123180442501-oozie-root-W 即可查看流程狀態;
4) 流程結束后,查看流程狀態以及在對應的目錄即可查看輸出結果;

2.2 Pig任務流程

1. job.properties
oozie.wf.application.path=hdfs://node1:8020/user/root/workflow/pig_demo/wf
oozie.use.system.libpath=true #pig流程必須配置此選項
#Hadoop"ResourceManager
resourceManager=node1:8032
#Hadoop"fs.default.name
nameNode=hdfs://node1:8020/
#Hadoop"mapred.queue.name
queueName=default

2.  workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2"
name="whitehouse-workflow">
<start to="transform_job"/>
    <action name="transform_job">
        <pig>
            <job-tracker>${resourceManager}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/user/root/workflow/pig_demo/output"/>
            </prepare>
            <script>transform_job.pig</script>
        </pig>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Job failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name="end"/>
</workflow-app>

3 . transform_job.pig pig任務用到的腳本
bank_data= LOAD '/user/root/bank.csv' USING PigStorage(';') AS
(age:int, job:chararray, marital:chararray,education:chararray,
 default:chararray,balance:int,housing:chararray,loan:chararray,
contact:chararray,day:int,month:chararray,duration:int,campaign:int,
pdays:int,previous:int,poutcom:chararray,y:chararray);

age_gt_30 = FILTER bank_data BY age >= 30;

store age_gt_30 into '/user/root/workflow/pig_demo/output' using PigStorage(',');
4. 運行
1) 把 transform_job.pig ,workflow.xml 文件拷貝到 hdfs://node1:8020/user/root/workflow/pig_demo/wf/ 目錄下面
2) 運行 bin/oozie job -config job.properties -run 
3) 運行 bin/oozie job -info jobId 查看對應任務的進度狀態,或者在瀏覽器中的node3:11000 URL中查看所有任務;

2.3 Hive任務流程

注意:hive 任務運行完成后,bank.csv文件會被刪除(應該是移動到hive的warehouse目錄下),所以進行其他或者再次運行時需要重新上傳文件
1. job.properties
nameNode=hdfs://node1:8020
jobTracker=node1:8032
queueName=default
maxAge=30
input=/user/root/bank.csv
output=/user/root/workflow/hive_demo/output
oozie.use.system.libpath=true

oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/hive_demo/wf
2. workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="hive-wf">
    <start to="hive-node"/>

    <action name="hive-node">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${output}/hive"/>
                <mkdir path="${output}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <script>script.hive</script>
            <param>INPUT=${input}</param>
            <param>OUTPUT=${output}/hive</param>
            <param>maxAge=${maxAge}</param>
     </hive>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>
3. hive任務用到的腳本 script.hive
DROP TABLE IF EXISTS bank;

CREATE TABLE bank(
    age int,
    job string,
    marital string,education string,
 default string,balance int,housing string,loan string,
contact string,day int,month string,duration int,campaign int,
pdays int,previous int,poutcom string,y string
) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
 STORED AS TEXTFILE;

 LOAD DATA INPATH '${INPUT}' INTO TABLE bank;

INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM bank where age > '${maxAge}';
注意:‘\073’ 代表分號;
4. 運行,參考上面


2.4 Hive 2 任務流程

1. job.properties 
nameNode=hdfs://node1:8020
jobTracker=node1:8032
queueName=default
jdbcURL=jdbc:hive2://node4:10000/default # hiveserver2 時,配置此選項
maxAge=30
input=/user/root/bank.csv
output=/user/root/workflow/hive2_demo/output
oozie.use.system.libpath=true

oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/hive2_demo/wf

2. workflow.xml 
<workflow-app xmlns="uri:oozie:workflow:0.5" name="hive2-wf">
    <start to="hive2-node"/>

    <action name="hive2-node">
        <hive2 xmlns="uri:oozie:hive2-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${output}/hive"/>
                <mkdir path="${output}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>

        <jdbc-url>${jdbcURL}</jdbc-url>
            <script>script2.hive</script>
            <param>INPUT=${input}</param>
            <param>OUTPUT=${output}/hive</param>
            <param>maxAge=${maxAge}</param>
     </hive2>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Hive2 failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>
3. hive2用到的腳本: script2.hive
DROP TABLE IF EXISTS bank2;

CREATE TABLE bank2(
    age int,
    job string,
    marital string,education string,
 default string,balance int,housing string,loan string,
contact string,day int,month string,duration int,campaign int,
pdays int,previous int,poutcom string,y string
) 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
 STORED AS TEXTFILE;

 LOAD DATA INPATH '${INPUT}' INTO TABLE bank2;

INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM bank2 where age > '${maxAge}';

4. 運行,參考上面

2.5 Spark 任務流程

1. job.properties :
nameNode=hdfs://node1:8020
jobTracker=node1:8032
#master=spark://node2:7077 
master=spark://node2:6066
sparkMode=cluster
queueName=default
oozie.use.system.libpath=true
input=/user/root/bank.csv
output=/user/root/workflow/spark_demo/output
# the jar file must be local
jarPath=${nameNode}/user/root/workflow/spark_demo/lib/oozie-examples.jar
oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/spark_demo/wf
由于sparkMode采用cluster,所以master的鏈接需要是下面的6066,:
sparkMode使用client沒有試驗成功;

2. workflow.xml
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy'>
    <start to='spark-node' />

    <action name='spark-node'>
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <master>${master}</master>
        <mode>${sparkMode}</mode>   
            <name>Spark-FileCopy</name>
     <class>org.apache.oozie.example.SparkFileCopy</class>
            <jar>${jarPath}</jar>
            <arg>${input}</arg>
            <arg>${output}</arg>
        </spark>
        <ok to="end" />
        <error to="fail" />
    </action>

    <kill name="fail">
        <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name='end' />
</workflow-app>

3. 運行:
1) 這里用到的oozie-examples.jar 是在oozie-examples.tar.gz解壓后的examples/apps/spark/lib目錄下面
2) 上傳oozie-examples.jar 到hdfs://node1:8020/user/root/workflow/spark_demo/lib/oozie-examples.jar 目錄;上傳workflow.xml到hdfs://node1:8020/user/root/workflow/spark_demo/wf/workflow.xml文件;
3) bin/oozie job -config job.properties -run 即可運行;

4. 相關問題:
1) 這種方式提交任務是通過yarn開啟任務,然后提交到spark集群運行的,并不是直接由spark集群運行的,如下圖:
首先在8088 界面看到yarn開啟的任務:

接著去spark監控界面,同樣可以看到監控界面:


但是這樣時間就不對了,看日志:

可以看到連接到了yarn的resourcemanager后,直接就連接了spark的master了,然后提交了任務,接著就直接yarn的任務就successed了,然后yarn就返回了;
查看spark的日志,時間也是吻合的:

最后保存文件,關閉driver:

2.6  spark on yarn任務流程

參考官網的提示:


1. job.properties:
nameNode=hdfs://node1:8020
jobTracker=node1:8032
#master=spark://node2:7077
#master=spark://node2:6066
master=yarn-cluster
#sparkMode=cluster
queueName=default
oozie.use.system.libpath=true
input=/user/root/bank.csv
output=/user/root/workflow/sparkonyarn_demo/output

jarPath=${nameNode}/user/root/workflow/sparkonyarn_demo/lib/oozie-examples.jar
oozie.wf.application.path=${nameNode}/user/${user.name}/workflow/sparkonyarn_demo
2. workflow.xml:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkFileCopy_on_yarn'>
    <start to='spark-node' />

    <action name='spark-node'>
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="${output}"/>
            </prepare>
            <master>${master}</master>
            <name>Spark-FileCopy-on-yarn</name>
     <class>org.apache.oozie.example.SparkFileCopy</class>
            <jar>${jarPath}</jar>
            <spark-opts>--conf spark.yarn.historyServer.address=http://node2:18080 --conf spark.eventLog.dir=hdfs://node1:8020/spark-log --conf spark.eventLog.enabled=true</spark-opts>
        <arg>${input}</arg>
            <arg>${output}</arg>
        </spark>
        <ok to="end" />
        <error to="fail" />
    </action>

    <kill name="fail">
        <message>Workflow failed, error
            message[${wf:errorMessage(wf:lastErrorNode())}]
        </message>
    </kill>
    <end name='end' />
</workflow-app>

3. 運行;
1)環境準備:拷貝workflow.xml 到hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/workflow.xml文件
2)拷貝oozie-exmaples.jar 到 hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/lib/oozie-examples.jar文件
3)拷貝$SPARK_HOME/lib/spark-assembly-1.4.1-hadoop2.6.0.jar文件到hdfs;//node1:8020/user/root/workflow/sparkonyarn_demo/lib/spark-assembly-1.4.1-hadoop2.6.0.jar 
4) bin/oozie job -config job.properties -run 
5) 查看任務狀態:

4. 相關問題

1) spark 提交和spark on yarn 方式的區別:
spark on yarn也是使用yarn來提交任務,但是沒有spark的任務,全部在yarn上運行,看日志的區別:
在8088的區別:


0000003-160123180442501-oozie-root-W任務前后只有一個,并且有一個spark的任務(node2:8080),對照時間
spark on yarn的方式

看到 0000009-160123180442501-oozie-root-W 這個任務其實是有兩個yarn的任務組成的

查看oozie的日志監控:

所以spark 的方式是yarn啟動任務,然后由spark集群運行任務,然后結束;中間需要spark集群啟動(也需要yarn集群啟動)
而spark on yarn的方式則是yarn啟動任務A ,然后在任務中調用另外一個yarn任務B,當任務B完成后,再返回到任務A,最后任務A結束。中間不需要spark集群啟動(這個看下圖就知道了)






分享,成長,快樂

腳踏實地,專注

轉載請注明blog地址:http://blog.csdn.net/fansy1990


 本文由用戶 ltww3128 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!