Hadoop Hive與Hbase整合
用hbase做數據庫,但由于hbase沒有類sql查詢方式,所以操作和計算數據非常不方便,于是整合hive,讓hive支撐在hbase數據庫層面 的 hql查詢.hive也即 做數據倉庫
1. 基于Hadoop+Hive架構對海量數據進行查詢:http://blog.csdn.net/kunshan_shenbin/article/details/7105319
2. HBase 0.90.5 + Hadoop 1.0.0 集成:http://blog.csdn.net/kunshan_shenbin/article/details/7209990
本文的目的是要講述如何讓Hbase和Hive能互相訪問,讓Hadoop/Hbase/Hive協同工作,合為一體。
本文測試步驟主要參考自:http://running.iteye.com/blog/898399
當然,這邊博文也是按照官網的步驟來的:http://wiki.apache.org/hadoop/Hive/HBaseIntegration
1. 拷貝hbase-0.90.5.jar和zookeeper-3.3.2.jar到hive/lib下。
注意:如何hive/lib下已經存在這兩個文件的其他版本(例如zookeeper-3.3.1.jar),建議刪除后使用hbase下的相關版本。
2. 修改hive/conf下hive-site.xml文件,在底部添加如下內容:
[html] view plaincopy <!--
<property>
<name>hive.exec.scratchdir</name>
<value>/usr/local/hive/tmp</value></property>
--><property>
<name>hive.querylog.location</name>
<value>/usr/local/hive/logs</value>
</property><property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/hive/lib/hive-hbase-handler-0.8.0.jar,file:///usr/local/hive/lib/hbase-0.90.5.jar,file:///usr/local/hive/lib/zookeeper-3.3.2.jar</value></property> </pre>
注意:如果hive-site.xml不存在則自行創建,或者把hive-default.xml.template文件改名后使用。
具體請參見:http://blog.csdn.net/kunshan_shenbin/article/details/7210020
3. 拷貝hbase-0.90.5.jar到所有hadoop節點(包括master)的hadoop/lib下。
4. 拷貝hbase/conf下的hbase-site.xml文件到所有hadoop節點(包括master)的hadoop/conf下。
注意,hbase-site.xml文件配置信息參照:http://blog.csdn.net/kunshan_shenbin/article/details/7209990
注意,如果3,4兩步跳過的話,運行hive時很可能出現如下錯誤:
[html] view plaincopy org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop. hbase.zookeeper.ZooKeeperWatcher.
參考:http://blog.sina.com.cn/s/blog_410d18710100vlbq.html
現在可以嘗試啟動Hive了。
單節點啟動:
> bin/hive -hiveconf hbase.master=master:60000
集群啟動:
> bin/hive -hiveconf hbase.zookeeper.quorum=slave
如何hive-site.xml文件中沒有配置hive.aux.jars.path,則可以按照如下方式啟動。
> bin/hive --auxpath /usr/local/hive/lib/hive-hbase-handler-0.8.0.jar, /usr/local/hive/lib/hbase-0.90.5.jar, /usr/local/hive/lib/zookeeper-3.3.2.jar -hiveconf hbase.zookeeper.quorum=slave
接下來可以做一些測試了。
1.創建hbase識別的數據庫:
[sql] view plaincopy
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
hbase.table.name 定義在hbase的table名稱
hbase.columns.mapping 定義在hbase的列族
2.使用sql導入數據
a) 新建hive的數據表
[sql] view plaincopy
<span><span></span></span>hive> CREATE TABLE pokes (foo INT, bar STRING);
b) 批量插入數據
[sql] view plaincopy
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLEpokes;
c) 使用sql導入hbase_table_1
[sql] view plaincopy
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT FROM pokes WHERE foo=86;
3. 查看數據
[sql] view plaincopy
hive> select from hbase_table_1;
這時可以登錄Hbase去查看數據了.
> /usr/local/hbase/bin/hbase shell
hbase(main):001:0> describe 'xyz'
hbase(main):002:0> scan 'xyz'
hbase(main):003:0> put 'xyz','100','cf1:val','www.360buy.com'
這時在Hive中可以看到剛才在Hbase中插入的數據了。
hive> select from hbase_table_1
4. hive訪問已經存在的hbase
使用CREATE EXTERNAL TABLE
[sql] view plaincopy
CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table");
多列和多列族(Multiple Columns and Families)
1.創建數據庫
Java代碼
CREATE TABLE hbase_table_2(key int, value1 string, value2 int, value3 int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,a:b,a:c,d:e"
);
2.插入數據
Java代碼
INSERT OVERWRITE TABLE hbase_table_2 SELECT foo, bar, foo+1, foo+2
FROM pokes WHERE foo=98 OR foo=100;
這個有3個hive的列(value1和value2,value3),2個hbase的列族(a,d)
Hive的2列(value1和value2)對應1個hbase的列族(a,在hbase的列名稱b,c),hive的另外1列(value3)對應列(e)位于列族(d)
3.登錄hbase查看結構
Java代碼
hbase(main):003:0> describe "hbase_table_2" DESCRIPTION ENABLED {NAME => 'hbase_table_2', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN _MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 1.0630 seconds
4.查看hbase的數據
Java代碼
hbase(main):004:0> scan 'hbase_table_2' ROW COLUMN+CELL 100 column=a:b, timestamp=1297695262015, value=val_100 100 column=a:c, timestamp=1297695262015, value=101 100 column=d:e, timestamp=1297695262015, value=102 98 column=a:b, timestamp=1297695242675, value=val_98 98 column=a:c, timestamp=1297695242675, value=99 98 column=d:e, timestamp=1297695242675, value=100 2 row(s) in 0.0380 seconds
5.在hive中查看
Java代碼
hive> selectfrom hbase_table_2;
OK
100 val_100 101 102
98 val_98 99 100
Time taken: 3.238 seconds </pre>
參考資料:
http://running.iteye.com/blog/898399
http://heipark.iteye.com/blog/1150648
http://www.javabloger.com/article/apache-hadoop-hive-hbase-integration.html