Hive的数据存储格式-白红宇

Hive的数据存储格式

阅读量：4608 次

发布时间：2019-06-09

本文共 1279 字，大约阅读时间需要 4 分钟。

1.默认存储格式为：纯文本

　　stored as textfile;

2.二进制存储的格式

　　顺序文件，avro文件，parquet文件，rcfile文件，orcfile文件。

3.转存parquet格式

　　hive>create table hive.stocks_parquet stored as parquet as select * from stocks;

　　　说明：原始数据大小为stocks表[40万条]，21M，转存parquet格式后，hdfs上数据文件大小为6M，压缩比在3倍左右；

4.转存rcfile

　　hive> create table hive.stocks_rcfile stored as rcfile as select * from stocks ;

　　　　说明：原始数据大小为stocks表[40万条]，21M，转存rcfile格式后，hdfs上数据文件大小为16M，压缩比在0.7倍左右；

5.转存orcfile

　　hive> create table hive.stocks_orcfile stored as orcfile as select * from stocks ;

　　　　说明：原始数据大小为stocks表[40万条]，21M，转存orcfile格式后，hdfs上数据文件大小为5M，压缩比在4倍左右；

6.测试执行时间

　　hive>select count(*) from stocks ;

　　　　执行时间：exec/fetch time: 0.227/1.580 sec

　　hive>select count(*) from hive.stocks_parquet ;

　　　　执行时间：exec/fetch time: 0.144/2.846 sec

　　hive>select count(*) from hive.stocks_rcfile ;

　　　　执行时间：exec/fetch time: 0.114/1.238 sec

　　hive>select count(*) from hive.stocks_orcfile ;

　　　　执行时间：exec/fetch time: 0.129/2.027 sec

UDF自定义函数

　　1.首先创建JAVA类，继承UDF.class

　　2.重写evaluate()方法；

　　3.打jar包；

　　4.加载自定义函数的jar包;

　　　　hive>add jar /home/hyxy/XXX.jar ;

　　　　hive>create temporary function {function_name} as 'com.hyxy.hive.udf.xxx'

　　5.自定义函数类型

　　　　a.UDF:单行进-->单行出

　　　　b.UDAF：多行进-->单行出

　　　　c.UDTF：单行进-->多行出

转载于:https://www.cnblogs.com/lyr999736/p/9474005.html

你可能感兴趣的文章