Creating Hive tables is a common experience to all of us that use Hadoop. It enables us to mix and merge datasets into unique, customized tables. And, there are many ways to do it.

We have some recommended tips for Hive table creation that can increase your query speeds and optimize and reduce the storage space of your tables. And it’s simpler than you might think.

For analysts and data scientists, implementing these methods means you can spend more time learning from the data than learning how to use it, or waiting for queries to return. For engineers and developers, this means you can lower your block count and file sizes, and make your analysts and data scientists happy at the same time.

In a nutshell, the steps are:

  • Create Your Tables in ORC Format
  • Partition Your Tables
  • Analyze Your Tables When You Make Changes To Them
  • Use ORC, Partitioning and Analyzing for a Powerful Combo

** Quick Primer on the ORC Format **

ORC is a file format designed for use with Hive, Hadoop and Spark. Its features are listed below. In a nutshell, it’s an efficient format for querying with SQL/HQL because of said features.

  • columnar-style storage (beneficial to the big data querying lifestyle): storing columns separately from each other as to reduce read, decompression and processing of data that’s not even queried
  • indexing: indexes are built on the stripes of a file that showcase where in a file certain rows exist
  • stripes: related to the columnar storage aspect, but is a chunk of data that makes up a file that is listed in an index, and allows queries on the data to only read the necessary stripe(s) instead of scanning the entire data file
  • reduced file sizes (super beneficial to Hadoop clusters and managing block counts): not only does this solve the problem of space for your data, it also reduces the amount of files the Hadoop NameNode has to keep track of and index
  • network-friendly compression: ZLib is a more intense compression algorithm that results in small files for tremendously large datasets (but you compromise on the network transfer speed), while SNAPPY splits the difference and has smaller file sizes, but not as small. This gives you fast network transfers (shuffling among nodes), and stills saves far more space than text.
  • predicate pushdown: uses indexes to determine which stripes (chunks) of a file to read for a particular query. This can narrow a search to less than 10k rows, and keeps Hive from reading entire files.
  • support for complex types: structs, lists, maps, and unions are all supported in ORC

Anyway, onto the list!

  1. Create Your Table in ORC Format

    This is the standard way of creating a basic Hive table. The only difference? Instead of using the default storage format of TEXT, this table uses ORC, a columnar file format in Hive/Hadoop that uses compression, indexing, and separated-column storage to optimize your Hive queries and data storage. With this created, data can be freely inserted into it, and data will be converted to this ORC format on-the-fly!

    Create Table Statement

    You can even do it through a convenient CTAS statement. The CTAS method will not work if you want to create a partitioned, external, or list-bucketing table. Use the CTAS method for simple datasets like lookups, or quick and dirty temporary tables, not enormous ones you plan to keep — even small tables can gain from ORC’s benefits.

    Create Table As SELECT
  2. Partition Your Table

    Partitioning your tables is a fantastic way to improve the processing times of queries on your table. This is because of how Hive scans partitions to execute job tasks in parallel — partitioning your data logically assists the job planner in that process.

    Partitioning a Table

    If you want a turbo-boost to your queries, use partitioning and the ORC format on your tables. It’s highly recommended. If we had a Hadoopsters Approves stamp, we’d use it here.

    Partitioning + ORC
  3. Analyze Your Table When You Make Changes To It

    Analyzing a table (also known as computing statistics) is a built-in Hive operation that you can execute to collect metadata on your table. This can vastly improve query times on the table because it collects the row count, file count, and file size (bytes) that make up the data in the table and gives that to the query planner before execution. By running this query, you collect that information and store it in the Hive Metastore (metadata store), which will make future queries on this table more optimal. Every time the contents of the table change, stats should be recollected.

    Analyze a Table

    You can (and really should) analyze partitioned tables as well. In fact, it’s particularly helpful for tables with partitions, as, again, it assists in query planning and optimization when querying the table.

    Analyze a Partitioned Table
    You can also analyze the columns of your table and/or partitions. This is a more intense stat-collecting function that collects metadata on columns you specify, and stores that information in the Hive Metastore for query optimization. That information includes, per column, The number of distinct values, The number of NULL values, Minimum or maximum K values where K could be given by a user, Histogram: frequency and height balanced, Average size of the column, Average or sum of all values in the column if their type is numerical, and Percentiles of the value.
    Analyzing Table Columns
    Make sure (when you’re ready to query your table) you have these Hive settings enabled to ensure you make the most use out of your calculated stats!

  4. Use ORC, Partitioning and Analyzing for a Powerful Combo

    1. Build your table with partitions, ORC format, and SNAPPY compression.
    2. Analyze your table when you make changes or add a partition, and analyze the partition.
    3. Analyze the columns you use most often (or all of them) at the partition level when you add a partition.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.