How many times have you been querying your data in Hive, only to come across some gnarly looking (or mostly null) records? Maybe it’s only one or two entries, perhaps it’s thousands – either way, you have some (seemingly) busted data, and you’re not happy about it.
Let’s talk about some ways to pinpoint problems in data in Hive, leveraging the tools available on our stack.
Pinpointing the Issue
First of all, let’s fix our mindset: the data might not actually be the problem – it could be with how Hive simply interprets it. After all, Hive is a RDBMS-like system that extends the schema-on-read philosophy that Hadoop champions.
So, it could be that your trouble data has a newline in it, it might have the same delimiter embedded within a string column that’s used to delimit the lines themselves, and so on. I’ve seen data with carriage returns embedded in string descriptions because previous users copy-pasted data from an OS editor like Windows, resulting in Hive showing “nulls” for those records, even though the actual text content of the data was concrete at first glance.
All we have to do is leverage some cool Hive features to help guide us to the location in the raw data where the issue is (at the byte level, to be more specific). In shorter terms, we’re going to let Hive tell us exactly where the records live in HDFS at the most specific level (again, to the byte).
Leveraging Virtual Columns in Hive
Hive has a concept of virtual columns, which are metadata-like pieces of information that can be included with any old query. Here’s an example: say I have a bunch of comma-separated records in a Hive table, and four (4) of them are showing up as nulls. Let’s say I can, for the sake of brevity, see those four (4) records by doing something as simple as this:
select * from my_table limit 10;
Let’s say six (6) of these records are good, four (4) are filled with bad bad nulls. Obviously this is a dead simple query and I’d be using more WHERE clauses in all likelihood to find my trouble records, but just roll with it.
Say that query shows me those records. I know I can find these records in a single text file if I either open it up in an editor or cat it on command line and CONTROL+F for it.
But what if this data is the result of a map-reduce operation and comes in multiple files? What if it’s just spread across multiple files in general? Well, you’re in luck. Hive has a virtual column (the first of two I’ll show you) called INPUT__FILE__NAME. Yes, those are double underscores between the words input, file and name. If you spell it any other way, Hive will throw an error.
Using this column in your select query will return the absolute path to the location of the file in which the record you’re viewing lives. See and believe:
hive> select INPUT__FILE__NAME, * from my_table limit 10; OK hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00004 3 MILL MILLWORK hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 90 MGMT MANAGEMENT hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 19 STOR HOME ORGANIZATION hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00001 103 ADJ3 ADJ BUDGET BALANCING - 3 hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 99 INCHR INCORRECTLY ASSIGNED HOURS hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002 9 FURN FURNITURE hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002 107 ADJ7 ADJ BUDGET BALANCING - 7 hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 5 ELCR ROUGH ELECTRICAL hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 80 SUPP STORE SUPPORT hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00003 50 SERV STORE SERVICES
Now that’s super handy. But what if I’m crazy, and I want to see TO THE BYTE where each record I’m selecting lives? Well I am crazy, and I DO want to see the byte offset of those troublesome little records – which I can achieve with another virtual column called BLOCK__OFFSET__INSIDE__FILE.
hive> select INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, * from my_table limit 10; OK hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00004 31 3 MILL MILLWORK hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 62 90 MGMT MANAGEMENT hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 94 19 STOR HOME ORGANIZATION hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00001 125 103 ADJ3 ADJ BUDGET BALANCING - 3 hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 156 99 INCHR INCORRECTLY ASSIGNED HOURS hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002 188 9 FURN FURNITURE hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00002 219 107 ADJ7 ADJ BUDGET BALANCING - 7 hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 250 5 ELCR ROUGH ELECTRICAL hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00000 282 80 SUPP STORE SUPPORT hdfs://mycluster.mycompany.com/mycompany/data/raw/rdbms/my_table/part-m-00003 313 50 SERV STORE SERVICES
Isn’t that cool? That new row shows what byte each record starts at, also known as its global file position or byte offset. And while you could peruse that file byte by byte (character by character), that’s painful.
Pro tip, use Hue’s file viewer to view the file itself (if it’s raw text), and you can navigate to the very byte you want to inspect. Using that, you can pull the row out to your favorite text editor like Sublime and look for newlines, carriage returns, or whatever else might be causing Hive to throw nulls in your face.
It doesn’t solve every problem, but it does help pinpoint it. I hope you find a way to use Hive’s virtual columns to find issues in your data (but I hope more that you simply have no issues in your data)!