How to Build Optimal Hive Tables Using ORC, Partitions and Metastore Statistics

Landon Robinson
Hadoopsters
Published in
5 min readDec 19, 2017

--

Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.

We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think.

For analysts and data scientists, implementing these methods means you can spend more time learning from the data than learning how to use it, or waiting for queries to return. For engineers and developers, this means you can lower your block count and file sizes, and make your analysts and data scientists happy at the same time.

In a nutshell, the steps are:

  • Create Your Tables in ORC Format
  • Partition Your Tables
  • Analyze Your Tables When You Make Changes To Them
  • Use ORC, Partitioning and Analyzing for a Powerful Combo

Quick Primer on the ORC Format

ORC is a file format designed for use with Hive, Hadoop and Spark. Its features are listed below. In a nutshell, it’s an efficient format for querying with SQL/HQL because of said features.

  • columnar-style storage (beneficial to the big data querying lifestyle): storing columns separately from each other as to reduce read, decompression and processing of data that’s not even queried
  • indexing: indexes are built on the stripes of a file that showcase where in a file certain rows exist
  • stripes: related to the columnar storage aspect, but is a chunk of data that makes up a file that is listed in an index, and allows queries on the data to only read the necessary stripe(s) instead of scanning the entire data file
  • reduced file sizes (super beneficial to Hadoop clusters and managing block counts): not only does this solve the problem of space for your data, it also reduces the amount of files the Hadoop NameNode has to keep track of and index
  • network-friendly compression: ZLib is a more intense compression algorithm that results in small files for tremendously large datasets (but you compromise on the network transfer speed), while SNAPPY splits the difference and has smaller file sizes, but not as small. This gives you fast network transfers (shuffling among nodes), and stills saves far more space than text.
  • predicate pushdown: uses indexes to determine which stripes (chunks) of a file to read for a particular query. This can narrow a search to less than 10k rows, and keeps Hive from reading entire files.
  • support for complex types: structs, lists, maps, and unions are all supported in ORC

Anyway, onto the list!

Create Your Table in ORC Format

This is the standard way of creating a basic Hive table. The only difference? Instead of using the default storage format of TEXT, this table uses ORC, a columnar file format in Hive/Hadoop that uses compression, indexing, and separated-column storage to optimize your Hive queries and data storage. With this created, data can be freely inserted into it, and data will be converted to this ORC format on-the-fly!

CREATE TABLE my_database.my_table
(
column_1 string,
column_2 int,
column_3 double
)
STORED AS ORC
TBLPROPERTIES('ORC.COMPRESS'='SNAPPY'); -- ensure SNAPPY is uppercase, lowercase triggers a nasty bug in Hive (fixed in later versions)

You can even do it through a convenient CTAS statement. The CTAS method will not work if you want to create a partitioned, external, or list-bucketing table. Use the CTAS method for simple datasets like lookups, or quick and dirty temporary tables, not enormous ones you plan to keep — even small tables can gain from ORC’s benefits.

CREATE TABLE my_database.my_table
STORED AS ORC TBLPROPERTIES('ORC.COMPRESS'='SNAPPY') as
SELECT * FROM my_database.my_other_table WHERE YEAR=2017 AND MONTH=11 AND DAY=30;

Partition Your Table

Partitioning your tables is a fantastic way to improve the processing times of queries on your table. This is because of how Hive scans partitions to execute job tasks in parallel — partitioning your data logically assists the job planner in that process.

CREATE TABLE my_database.my_table
(
column_1 string,
column_2 int,
column_3 double
)
PARTITIONED BY
(
year int,
month smallint,
day smallint,
hour smallint
)

If you want a turbo-boost to your queries, use partitioning and the ORC format on your tables. It’s highly recommended. If we had a Hadoopsters Approves stamp, we’d use it here.

CREATE TABLE my_database.my_table
(
column_1 string,
column_2 int,
column_3 double
)
PARTITIONED BY
(
year int,
month smallint,
day smallint,
hour smallint
)
STORED AS ORC
TBLPROPERTIES('ORC.COMPRESS'='SNAPPY'); -- ensure SNAPPY is uppercase, lowercase triggers a nasty bug in Hive (fixed in later versions)

Analyze Your Table When You Make Changes To It

Analyzing a table (also known as computing statistics) is a built-in Hive operation that you can execute to collect metadata on your table. This can vastly improve query times on the table because it collects the row count, file count, and file size (bytes) that make up the data in the table and gives that to the query planner before execution. By running this query, you collect that information and store it in the Hive Metastore (metadata store), which will make future queries on this table more optimal. Every time the contents of the table change, stats should be recollected.

ANALYZE TABLE my_database.my_table compute statistics;

You can (and really should) analyze partitioned tables as well. In fact, it’s particularly helpful for tables with partitions, as, again, it assists in query planning and optimization when querying the table.

ANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30) compute statistics;

You can also analyze the columns of your table and/or partitions. This is a more intense stat-collecting function that collects metadata on columns you specify, and stores that information in the Hive Metastore for query optimization. That information includes, per column, The number of distinct values, The number of NULL values, Minimum or maximum K values where K could be given by a user, Histogram: frequency and height balanced, Average size of the column, Average or sum of all values in the column if their type is numerical, and Percentiles of the value.

ANALYZE TABLE my_database.my_table compute statistics for column1, column2, column3; -- column stats for non-partitioned tableANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30, HOUR=0) compute statistics for column1, column2, column3; -- column stats for single hour of partitioned tableANALYZE TABLE my_database.my_table PARTITION (YEAR=2017, MONTH=11, DAY=30, HOUR) compute statistics for column1, column2, column3; -- column stats for a single day of partitioned table

Make sure (when you’re ready to query your table) you have these Hive settings enabled to ensure you make the most use out of your calculated stats!

set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;

In Summary: Use ORC, Partitioning and Analyzing for a Powerful Combo

  1. Build your table with partitions, ORC format, and SNAPPY compression.
  2. Analyze your table when you make changes or add a partition, and analyze the partition.
  3. Analyze the columns you use most often (or all of them) at the partition level when you add a partition.

Originally published at http://hadoopsters.com on December 19, 2017.

--

--

Writing about big data since 2015. Data Science Engineering Manager at Disney.