本文最后更新于：2021年7月23日晚上

Hive

简介

注意问题

在Hive中，每一个database在HDFS上都会对应一个目录
在Hive中，没有主键的概念，也就意味着在定义表的时候不需要定义主键
Hive在建表的时候就需要指定字段之间的间隔符号，建好表之后就无法修改
在插入数据的时候，insert into表示追加数据；insert overwrite表示将原来的数据清空之后再加入数据

基本操作

SQL	解释
show databases;	查看所有的库
create database hivedemo;	创建库
drop database demo;	删除库
use hivedemo;	使用库
create table person (id int, name string, age int, gender string);	建立person表
insert into person values(1, ‘Sam’, 19, ‘male’);	插入数据
select * from person;	查询数据
load data local inpath ‘/opt/hivedemo/person’ into table person;	加载数据
drop table person;	删除表
create table person (id int, name string, age int, gender string) row format delimited fields terminated by ‘ ‘;	在建表的时候指定字段之间的间隔符号
desc person;	描述表结构
create table p2 like person;	创建一个和person表的表结构一致的表p2
insert into table p2 select * from person where age >= 18;	将person表中age>=18的数据查询出来放到p2表中
from person insert overwrite table p2 select * where gender = ‘male’ insert into table p3 select * where age < 18;	将person表中性别为男生的数据放到p2表中，同时将age<18的数据查询出来放到p3表中
insert overwrite local directory ‘/opt/hivedata’ row format delimited fields terminated by ‘\t’ select * from person where age >= 18;	将person表中age>=18的数据查询出来放到本地目录下
insert overwrite directory ‘/person’ row format delimited fields terminated by ‘,’ select * from person where gender =’female’;	将person表中性别为女的数据查询出来放到HDFS的person目录下
alter table person rename to p1;	修改表名
alter table p1 add columns(height double);	动态添加列

数据类型

概述

在Hive中，提供了相对丰富的数据类型，大概可以分为两类：基本类型和复杂类型
基本类型

Hive类型	Java类型
tinyint	byte
smallint	short
int	int
bigint	long
float	float
double	double
boolean	boolean
string	String
timestamp	Timestamp
binary	byte[]

3. 复杂类型：array，map，struct

复杂类型

array：数组类型，对应了Java中的数组或者集合
原始数据

1
2
3
4
1 lucy,lily  david,evan
2 adair,bruce,lee simon,tony,tom,rose
3 bob,alex,cindy frank,fred
4 henry,william kite,job,thomas

建表

1
create table battles (id int, groupa array<string>, groupb array<string>) row format delimited fields terminated by ' ' collection items terminated by ',';

加载数据

1
load data local inpath '/opt/hivedemo/battles' into table battles;

判断非空

1
select groupa[2] from battles where groupa[2] is not null;

map：映射类型，对应了Java中的Map类型
原始数据

1
2
3
4
5
1 tom,15 sam,17
2 lily,16 lucy,16
3 david,14 danny,15
4 frank,19 fred,19
5 henry,17 hack,18

建表语句

1
create table groups (groupid int, membera map<string,int>, memberb map<string,int>) row format delimited fields terminated by ' ' map keys terminated by ',';

加载数据

1
load data local inpath '/opt/hivedemo/groups' into table groups;

查询数据

1
select membera['frank'] from groups  where membera['frank'] is not null;

struct：结构体类型，对应了Java中的对象
原始数据

1
2
3
4
1 tom,19,male,182.5,68.7
2 tony,18,male,181.3,70.2
3 thomas,18,male,183.6,79.1
4 vincent,17,female,165.9,50.1

建表语句

1
create table infos (id int, info struct<name:string, age:int, gender:string, height:double, weight:double>) row format delimited fields terminated by ' ' collection items terminated by ',';

加载数据

1
load data local inpath '/opt/hivedemo/infos' into table infos;

查询数据

1
select info.age from infos where info.name = 'vincent';

表结构

内部表和外部表

在Hive中手动建表手动添加数据(包括insert和load)，这种表称之为内部表
在Hive中手动建表来管理HDFS上已经存在的数据，这种表称之为外部表
外部表建表语句

1	`create external table orders (orderid int, orderdate string, productid int, num int) row format delimited fields terminated by ' 'location '/orders';`

可以通过命令来确定一个表是内部表还是外部表

1
2
3

desc extended p1;
# 或者
desc formatted p1;

如果Table Type的属性值为MANAGED_TABLE，就表示这是一个内部表；如果Table Type的属性值为EXTERNAL_TABLE，那么就表示这是一个外部表

内部表在被删除的时候，在HDFS上对应的目录会一起删除；外部表在被删除的时候，在HDFS上对应的目录不会被删除
在实际生产过程中，数据前期的采集和管理使用的是外部表；后期对数据进行处理和分析的时候，大部分时候采用的是内部表

分区表

分区表的作用是对数据进行分类
分区表建表语句

1	`create table cities (id int, name string) partitioned by (province string) row format delimited fields terminated by ' ';`

加载数据

1 2	`load data local inpath '/opt/hivedemo/hebei' into table cities partition(province = 'hebei'); load data local inpath '/opt/hivedemo/henan' into table cities partition(province = 'henan');`

在Hive中，每一个分区在HDFS上都会形成一个单独的目录
当对分区表进行查询的时候，如果指定了分区条件，那么分区表的查询速度要高于未分区表；如果在查询的时候进行可跨分区查询，那么此时未分区表的查询速度要高于分区表
手动添加分区

1	`alter table cities add partition(province = 'guangdong') location '/user/hive/warehouse/hivedemo.db/cities/province=guangdong';`

修复表

1	`msck repair table cities; # 这个命令有执行失败的可能`

修改分区表

1	`alter table cities partition(province = 'shanxi') rename to partition(province = 'test');`

删除分区

1	`alter table cities drop partition(province = 'test');`

在Hive中，要求分区表中被分区的字段在原始数据中不存在
动态分区
原始数据

1
2
3
4
5
6
7
8
9
10
1 hebei 邢台
2 hebei 承德
3 shanxi 太原
4 shanxi 大同
5 liaoning 沈阳
6 liaoning 大连
7 jilin 长春
8 liaoning 鞍山
9 shanxi 阳泉
10 liaoning 抚顺

在Hive中建立临时表用于管理原始数据

1
create table cities_tmp (tid int, tprovince string, tname string) row format delimited fields terminated by ' ';

将数据加载到临时表中

1
load data local inpath '/opt/hivedemo/cities' into table cities_tmp;

关闭严格模式

1
set hive.exec.dynamic.partition.mode=nonstrict;

从未分区表中查询出来放到已分区表中

1
insert into table cities partition(province) select tid, tname, tprovince from cities_tmp distribute by tprovince;

Hive本身支持多字段分区，多个字段之间，前一个字段形成的目录会包含后一个字段形成的目录，此时会形成多级目录，实际过程中，会利用多字段分区来实现多级分类的效果。例如年级班级、省市县等
原始数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1 1 1 tom
1 1 2 sam
1 1 3 bob
1 1 4 alex
1 2 1 bruce
1 2 2 cindy
1 2 3 jack
1 2 4 john
2 1 1 tex
2 1 2 helen
2 1 3 charles
2 1 4 frank
2 2 1 david
2 2 2 simon
2 2 3 lucy
2 2 4 lily

在Hive中建表来管理数据

1
create table students_tmp (tgrade int, tclass int, tid int, tname string) row format delimited fields terminated by ' ';

加载数据

1
load data local inpath '/opt/hivedemo/students' into table students_tmp;

抽样数据，以确定数据正确加载

1
select * from students_tmp tablesample(5 rows);

建立分区表

1
create table students (id int, name string) partitioned by (grade int, class int) row format delimited fields terminated by '\t';

关闭严格模式

1
set hive.exec.dynamic.partition.mode=nonstrict;

动态分区 - 这次是多字段分区

1
insert into table students partition(grade, class) select tid, tname, tgrade, tclass from students_tmp distribute by tgrade, tclass;

分桶表

分桶表的作用是对数据进行抽样
数据分的桶的数量越多，执行的时候花费的内存越多
在Hive中，分桶机制默认不开启，需要开启分桶机制

1	`set hive.enforce.bucketing = true;`

建立分桶表

1	`create table cities_bucket (id int, name string) clustered by (name) into 3 buckets row format delimited fields terminated by '\t';`

向分桶表中添加信息，但是注意的是，分桶表只能通过insert方式来添加数据不能通过load方式来添加数据

1	`insert overwrite table cities_bucket select id, name from cities;`

对桶中的数据进行抽样

1	`select * from cities_bucket tablesample(bucket 1 out of 2 on name);`

函数

概述

Hive的目的是对数据进行分析，因此在Hive中，提供了非常丰富的函数，可以通过

1	`show functions;`

来查看Hive中所有的函数

在Hive中，可以通过

1	`desc function xxx;`

来描述这个函数的用法

在Hive中，所有的函数不能单独使用

Program 实训

Hadoop Hive

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

字符编码Unicode和UTF8的关系上一篇

Mac和Win对应的几个按键下一篇

项目实训第七天

Hive

简介

注意问题

基本操作

数据类型

概述

复杂类型

表结构

内部表和外部表

分区表

分桶表

函数

概述