Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

jopen 10年前發布 | 9K 次閱讀分布式/云計算/大數據

1. 新建表

1) 新建表結構

create table user_table(

id int,

userid bigint,

name string,

describe string comment 'desc表示用戶的描述'

)

comment '這是用戶信息表'

partitioned by(country string, city string) -- 建立分區，所謂的分區就是文件夾

clustered by (id) sorted by (userid) into 32 buckets

//通過id進行hash取值來分桶，桶類通過userid來排序排序

分桶便于有用數據加載到有限的內存中（性能上的優化----還有join,group by,distinct）

row format delimited -- 指定分隔符解析數據

fields terminated by '\001' -- 字段之間的分隔符

collection items terminated by '\002' -- array字段內部的分隔符

map keys terminated by '\003' -- map字段內部分隔符

//用來分隔符解析數據（load進去的原始數據，hive是不會對它進行任何處理）

stored as textfile; -- 存儲格式( rcfile/ textfile / sequencefile )

//存儲格式(原始數據，就是textfile格式就行)

總結：

相比textfile和SequenceFile，rcfile由于列式存儲方式，數據加載時性能消耗較大，但是具有較好的壓縮比和查詢響應。數據倉庫的特點是一次寫入、多次讀取，因此，整體來看，rcfile相比其余兩種格式具有較明顯的優勢。

a) Table 內部表（大小寫無所謂）

創建:

create table t1(id string);

create table t2(id string, name string) row format delimited fields terminated by '\t';

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t1;

load data inpath '/seq100w.txt' into table t1; (hdfs中數據移動到/hive/t1文件夾中)

（因此我們直接把hdfs中數據移動到我們表對應的文件夾中也能讀取到數據）

load data local inpath '/root/Downloads/seq100w.txt' overwrite into table t1;

b) Partition 分區表

創建:

create table t3(id string) partitioned by (province string);

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

查看某個表中所有的分區

Hive>show partitions 表名;

c) Bucket Table 桶表

創建: create table t4(id string) clustered by (id) into 4 buckets; //通過id來分桶

create table t4(id string) clustered by (id) sorted by (id asc) into 4 buckets; //對桶中數據進行升序排序，使每個桶的連接變成了高效的合并排序（merge-sort）,因此可以進一步提升map端連接的效率

設置均勻插入：set hive.enforce.bucketing = true;

加載: insert into table t4 select id from t3 where province='beijing';

覆蓋： insert overwrite table bucket_table select name from stu;

抽樣查詢：select * from bucket_table tablesample(bucket 1 out of 4 on id); //表示在表中隨機選擇1個桶的數據

select * from bucket_table tablesample(bucket 1 out of 2 on id); //表示隨機選擇半個桶的數據

select * from bucket_table tablesample(bucket 1 out of 4 on rand()); //表示隨機選擇1個桶的數據的部分數據（從某個桶中取樣，它會掃描整個表的數據集）

l 數據加載到桶表時，會對字段取hash值，然后與桶的數量取模。把數據放到對應的文件中。任何一桶里都會有一個隨機的用戶集合

d) External Table 外部表

（t5可以不放在倉庫中，可以自定義存儲位置,以wlan為倉庫）

創建: create external table t5(id string) location '/wlan'; wlan 表示文件夾

EXTERNAL關鍵字表示創建外部表；數據有外部倉庫控制，不是由hive控制，只有元數據（也就是表結構）由hive控制；因此不會把數據移到hive的倉庫目錄下，而是移動到外部倉庫中去，當你drop table 表名，元數據(表結構)會刪除，但是數據在外部倉庫中，因此不會被hive刪除。

hive>create external table t1(id ) row format delimited fields terminated by '\t' location ‘/wlan’；加上便于讀取數據，查詢的時候不會為Null（\t就是數據的分隔符） ;wlan 表示文件夾，wlan最好與你要創建的表名一致，這樣方便查看和管理

create external table hadoop_1(id int,name string) row format delimited fields terminated by '\t' location '/wenjianjia';

load data inpath '/wenjianjia/hello' into table hadoop_1 ;

2) 復制現有表結構

// 新建new_table 表結構和 user_table 一樣

create table new_table like user_table;

3) 表重命名

hive> alter table new_table rename to new_table_1;

4) 創建表分區

創建:

create table t3(id string) partitioned by (province string);

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

2. 刪除表

1) 清空表中數據

hadoop fs –rmr /… 直接刪除表在hdfs中存放的數據就行

如果不小心把表也在hdfs中刪除了

2) 刪除表

drop table test1

3) 刪除表分區（刪除分區和分區中的數據）

hive> alter table dm_newuser_active_month drop partition (batch_date="201404");

刪除表分區，一定要batch_date一定要加：冒號

3. 修改表信息

1) 表添加一個字段

hive> alter table test1 add columns(name string);

2) 修改表的某個字段

注意：change 取代現有表的要修改的列，它修改表模式而不是數據。

alter table 表名 change 要修改的列名修改后的列名修改后的類型 comment ‘備注信息’;

3) 修改表的所有字段

注意：replace 取代現有表的所有列，它修改表模式而不是數據。

alter table 表名replace columns(age int comment 'only keep the first column');

4) 添加表分區

hive> alter table ods_smail_mx_201404 add partition (day=20140401); 單獨添加分區

create table user_table_2(

id int,

name string

)

comment '這是用戶信息表'

partitioned by(dt string)

stored as textfile;

insert overwrite table user_table_2

partition(dt='2015-11-01')

select id, col2 name

from table_4;

4. 查看表

1) 查看建表語句

show create table tmp_jzl_20150310_diff;

2) 查看表結構

desc tmp_jzl_20150310_diff;

3) 查看表分區

show partitions tmp_jzl_20150310_diff;

4) 查看庫中表名

hive> use tmp;

查看tmp庫中所有的表

hive> show tables;

查看tmp庫中 tmp_jzl_20150504開頭的表

hive> show tables 'tmp_jzl_20150504*';

tmp_jzl_20150504_1

tmp_jzl_20150504_2

tmp_jzl_20150504_3

tmp_jzl_20150504_4

來自： http://my.oschina.net/repine/blog/552428

本文由用戶 jopen 自行上傳分享，僅供網友學習交流。所有權歸原作者，若您的權利被侵害，請聯系管理員。

轉載本站原創文章，請注明出處，并保留原始鏈接、圖片水印。

本站是一個以用戶分享為主的開源技術平臺，歡迎各類分享！

本文地址：http://www.baiduhome.net/lib/view/open1451371883339.html

分布式/云計算/大數據

Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

1. 新建表

1) 新建表結構

2) 復制現有表結構

3) 表重命名

4) 創建表分區

2. 刪除表

1) 清空表中數據

2) 刪除表

3) 刪除表分區（刪除分區和分區中的數據）

3. 修改表信息

1) 表添加一個字段

2) 修改表的某個字段

3) 修改表的所有字段

4) 添加表分區

4. 查看表

1) 查看建表語句

2) 查看表結構

3) 查看表分區

4) 查看庫中表名

相關經驗

相關資訊

相關文檔

目錄

Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

1. 新建表

1) 新建表結構

2) 復制現有表結構

3) 表 重命名

4) 創建表分區

2. 刪除表

1) 清空表中數據

2) 刪除表

3) 刪除表分區（刪除分區和分區中的數據）

3. 修改表信息

1) 表 添加一個字段

2) 修改表的某個字段

3) 修改表的所有字段

4) 添加表分區

4. 查看表

1) 查看建表語句

2) 查看表結構

3) 查看表分區

4) 查看庫中表名

相關經驗

相關資訊

相關文檔

目錄

3) 表重命名

1) 表添加一個字段