load使用:mysql创建表元数据,通过load命令加载本地文件,
load data local inpath '/home/hadoop/emp.txt' into table emp_1;
load data local inpath 文件路径 [overwrite] into table 表名
hive> create table emp_1 > (empno int, > ename string, > job string, > mgr int, > hirdate string, > sal double, > comm double, > deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';OKTime taken: 0.157 seconds
hive> load data local inpath '/home/hadoop/emp.txt' overwrite into table emp_1; Loading data to table default.emp_1Table default.emp_1 stats: [numFiles=1, numRows=0, totalSize=700, rawDataSize=0]OKTime taken: 0.444 seconds
hive> select * from emp_1;OK7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 207499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 307521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 307566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 207654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 307698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 307782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 107788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 207839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 107844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 307876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 207900 JAMES CLERK 7698 1981-12-3 950.0 NULL 307902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 207934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 108888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULLTime taken: 0.125 seconds, Fetched: 15 row(s)
yarn查看任务进度:
http://192.168.83.11:8088/cluster,
要指定分隔符,所以创建表的时候要指定分隔符(默认分隔符是\001 ^A); 空格、制表符(\t)
使用load加载hdfs文件到表中;
hadoop上传文件emp1.txt
hadoop fs -put /home/hadoop/emp1.txt /
create table emp1 as select * from emp1 where 1>1;
load data inpath '/emp1.txt' into table emp_2;
这里需要注意下,采用如上方法并没有创建分隔符语法。
hive> show create table emp_2;OKCREATE TABLE `emp_2`( `empno` int, `ename` string, `job` string, `mgr` int, `hirdate` string, `sal` double, `comm` double, `deptno` int)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION 'hdfs://hd1:9000/user/hive/warehouse/emp_2'TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'numFiles'='2', 'numRows'='-1', 'rawDataSize'='-1', 'totalSize'='700', 'transient_lastDdlTime'='1540480415')Time taken: 0.138 seconds, Fetched: 24 row(s)
load data inpath '/home/hadoop/emp1.txt' into table emp1;
创建外部表:
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
create external table exter_emp
(empno int, ename string, job string, mgr int, hirdate string, sal double, comm double, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/external/';location '/external' 为hdfs路径,把文件丢在里面,表会按照分隔符自动加载数据。如果 '/external/'下面有多个文件,那么外部表 exter-emp都会把数据加载进来。
当然外部表也可以加载本地数据, load data local inpath '/home/hadoop/emp1.txt' into table exter_emp;
分区表:
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
create table part_emp(
empno int, ename string, job string, hirdate string, sal double, comm double, deptno int) partitioned by(mgr int) row format delimited fields terminated by '\t';load data local inpath '/home/hadoop/emp.txt' into table part_emp PARTITION (mgr=10,mgr=20,mgr=30);
需要注意一点:分区表的分区字段不能出现在表字段里面。如果通过load加载数据就需要把txt文档中的mgr这个列的数据删除,不然加载数据会错位,导致txt数据跟分区表数据不一致。
使用insert插入数据
Usage:INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 [IF NOT EXISTS]] select_statement1 FROM from_statement;
注意:select跟insert列必须一一对应
hive> insert overwrite table t1 select * from emp_1;
hive> insert overwrite table t2 partition(mgr=10) select empno,ename,job,hirdate,sal,comm,deptno from part_emp;
注意:t2也必须为分区表且select列跟分区表T2列数量一致。
多重插入:
hive> create table emp2 (j string,s double,c double);
OK Time taken: 0.114 seconds hive> from emp_1 > insert into emp1 select empno,ename > insert into emp2 select job,sal,comm ; Query ID = hadoop_20181026015757_f2abf70b-7249-44e5-895d-ad22eddfcd7a Total jobs = 5 Launching Job 1 out of 5 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1540475767311_0010, Tracking URL = http://hd1:8088/proxy/application_1540475767311_0010/ Kill Command = /home/hadoop/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1540475767311_0010 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0 2018-10-26 02:02:47,297 Stage-2 map = 0%, reduce = 0% 2018-10-26 02:02:54,835 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec MapReduce Total cumulative CPU time: 1 seconds 510 msec Ended Job = job_1540475767311_0010 Stage-5 is selected by condition resolver. Stage-4 is filtered out by condition resolver. Stage-6 is filtered out by condition resolver. Stage-11 is selected by condition resolver. Stage-10 is filtered out by condition resolver. Stage-12 is filtered out by condition resolver. Moving data to: hdfs://hd1:9000/user/hive/warehouse/emp1/.hive-staging_hive_2018-10-26_02-02-34_121_1621965597337759090-1/-ext-10000 Moving data to: hdfs://hd1:9000/user/hive/warehouse/emp2/.hive-staging_hive_2018-10-26_02-02-34_121_1621965597337759090-1/-ext-10002 Loading data to table default.emp1 Loading data to table default.emp2 Table default.emp1 stats: [numFiles=1, numRows=0, totalSize=164, rawDataSize=0] Table default.emp2 stats: [numFiles=1, numRows=0, totalSize=278, rawDataSize=0] MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Cumulative CPU: 1.51 sec HDFS Read: 4567 HDFS Write: 580 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 510 msec OK Time taken: 22.873 seconds
hive的基本操作
create database if not exists 库名; #创建库
alter database dbname set dbproperties('edited-by'='joe'); #修改库(不能删除或“重置”数据库属性) describe database extended dbname; #查询库 drop database [if exists] dbname; #删除库 desc database extended 库; #显示库的扩展信息hive>create external table dat0204(filmname string ,filmdate date ,filmscore string)
>comment '别名' >row format delimited >fields terminated by '\t' >lines terminated by '\n' >stored as textfile; #创建外部表,外部表相对来说更加安全些,数据组织也更加灵活,方便共享源数据。hive>create table if not exists dat0204(id int , name string , age int)
>comment '别名' >row format delimited >fields terminated by '\t' >lines terminated by '\n' >stored as textfile; #创建内部表
desc 表; #表的描述
desc formatted 表; #查询表的结构 desc extended 表; #显示表的扩展信息 select * from 表; #查询表的信息 create table 库名1.表名1 like 库名2.表名2; #复制表(表结构+数据)alter table hive1.test2 add partition(province='hebei',city='baoding') #添加分区
show partitions hive1; #查看表的分区
insert overwrite table test2 partition(provice='hebei',city='shijiazhuang') select id , name , age from test1; #增加数据
drop table 表; #删除空表
drop table 表 cascade; #删除非空表show tables like '*name*'; #模糊搜索表
插入数据(加载到HDFS)
hive>load data local inpath 'path/filename' overwrite into table 表名; #从本地数据导入Hive表
hive>load data inpath 'path/filename' into table 表名; #HDFS上导入数据到Hive表
hive> insert overwrite directory "hodoop目录" select user, login_time from user_login; #将查询数据输出hdfs目录(hdfs目录不存在)
$ hive -e "sql语句" > /tmp/out.txt #保存sql语句查询信息到本地文件
hive命令模型
hive>dfs -lsr / //显示dfs下文件:路径/库/表/文件
hive>dfs -rmr /目录 //dfs命令,删除目录 hive>!clear ; //hive中执行shell命令 hive>!dfs -lsr / ; //hive中执行hdfs命令元数据都储存在mysql
use hive用户库;
select * from VERSION; #查看hive版本 select * from TBLS \G; #查看有哪些表,易区分各表。 select * from SDS \G; #查看表对应的hdfs目录的metedata select * from PARTITIONS where TBL_ID=1 \G; #查看某个表的partitions: select * from COLUMNS_V2; #查看某个表的列: select * from PARTITION_KEYS; #查看某个表的partition select * from DBS; #查看数据仓库信息
调优
1.explain———解释执行计划 explain select sum(*) from test2 ;
hive的起源与应用
1、起源:由Facebook开源用于解决海量结构化日志的数据统计;
2、结构:Hive是基于hadoop的一个数据仓库工具,可以将结构化的数据文件映射成一张表,并提供类SQL查询功能; (使用HQL作为查询接口;使用HDFS存储;使用MapReduce计算;)本质是:将HQL转化成MapReduce程序。 3、应用:适合离线数据处理。 4、schema(模式,元信息存放到数据库中) 5、数据库和表都是路径。
6、hive在写操作是不校验,读时校验。
在hive里面创建了这么多表,前面有提到过 hive的元数据都是存储在mysql中,那么hive中的元数据在哪里呢?
use hive用户库;
select * from VERSION; #查看hive版本
select * from TBLS \G; #查看有哪些表,易区分各表。 select * from SDS \G; #查看表对应的hdfs目录的metedata select * from PARTITIONS where TBL_ID=1 \G; #查看某个表的partitions: select * from COLUMNS_V2; #查看某个表的列: select * from PARTITION_KEYS; #查看某个表的partition select * from DBS; #查看数据仓库信息调优
1.explain———解释执行计划 explain select sum(*) from test2 ;mysql> select * from TBLS;+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+| 1 | 1540322901 | 1 | 0 | hadoop | 0 | 1 | test | MANAGED_TABLE | NULL | NULL || 6 | 1540378940 | 6 | 0 | hadoop | 0 | 6 | t1 | MANAGED_TABLE | NULL | NULL || 13 | 1540477772 | 1 | 0 | hadoop | 0 | 13 | emp_1 | MANAGED_TABLE | NULL | NULL || 16 | 1540481399 | 1 | 0 | hadoop | 0 | 16 | emp_2 | MANAGED_TABLE | NULL | NULL || 17 | 1540481750 | 1 | 0 | hadoop | 0 | 17 | emp_3 | MANAGED_TABLE | NULL | NULL || 21 | 1540482372 | 1 | 0 | hadoop | 0 | 21 | exter_emp | EXTERNAL_TABLE | NULL | NULL || 26 | 1540484035 | 1 | 0 | hadoop | 0 | 26 | part_emp | MANAGED_TABLE | NULL | NULL |+--------+-------------+-------+------------------+--------+-----------+-------+-----------+----------------+--------------------+--------------------+7 rows in set (0.00 sec)
都在HIVE.TBLS表中。
desc 表; #表的描述
desc formatted 表; #查询表的结构 desc extended 表; #显示表的扩展信息 select * from 表; #查询表的信息修改表字段:
hive> alter table test change id id string;
OK Time taken: 0.388 seconds hive> desc test; OK id string Time taken: 0.179 seconds, Fetched: 1 row(s)like 创建表并查看表结构:
hive> create table t1 like emp_1;OKTime taken: 0.146 secondshive> desc formatted t1l;FAILED: SemanticException [Error 10001]: Table not found t1lhive> desc formatted emp_1;OK# col_name data_type comment empno int ename string job string mgr int hirdate string sal double comm double deptno int # Detailed Table Information Database: default Owner: hadoop CreateTime: Thu Oct 25 22:29:32 CST 2018 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://hd1:9000/user/hive/warehouse/emp_1 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 numRows 0 rawDataSize 0 totalSize 700 transient_lastDdlTime 1540477792 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim \t serialization.format \t Time taken: 0.214 seconds, Fetched: 39 row(s)