If your Hadoop cluster allows you to connect to Hive through the command line interface (CLI), you can very easily export a Hive table of data in Hadoop to a CSV.
It only takes a few small lines of code, which I’ve written into a few bash/shell scripts:
Approach One (Hive Insert Overwrite a Directory):
#!/bin/bash | |
hive -e "insert overwrite local directory '/path/in/local/' | |
row format delimited fields terminated by ',' | |
select * from my_database.my_table" | |
cat /path/in/local/* > /another/path/in/local/my_table.csv |
This approach writes the contents of a Hive table to a local path (linux) in as many files as it needs. It then uses a Linux “cat” command to merge all files to one csv.
Here’s what happened:
- shebang line (optional)
- Command issued to Hive that selects all records from a table in Hive, separates the fields/columns by a comma, and writes the file to a local directory (wiping anything previously in that path).
- Cat command issued to get/merge all part files (remember, the output was from a Map/Reduce job) in directory into a single .csv file.
Approach Two (Hive CSV Dump Internal Table):
#!/bin/bash | |
hive -e "drop table if exists csv_dump; | |
create table csv_dump ROW FORMAT DELIMITED | |
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' | |
LOCATION '/temp/storage/path' as | |
select * from my_data_table;" | |
hadoop fs -getmerge /temp/storage/path/ /local/path/my.csv |
This approach writes a table’s contents to an internal Hive table called csv_dump, delimited by commas — stored in HDFS as usual. It then uses a hadoop filesystem command called “getmerge” that does the equivalent of Linux “cat” — it merges all files in a given directory, and produces a single file in another given directory (it can even be the same directory).
In either approach, that .csv now lives on your local edge node, and can be placed into HDFS, used in other scripts, or SCP’d to your local desktop. It’s a very efficient and easy way to get the contents of a Hive table into a easily human and application-readable format.
For more Hadoop tutorials, subscribe to this blog (button in sidebar).
Reblogged this on Landon Robinson's Blog and commented:
New post about a convenient Hive feature!
LikeLike
In approach one, I believe the cat command should be “cat /path/in/local/* > /another/path/in/local/my_table.csv”.
LikeLike
Nice catch! I guess wordpress’s HTML chewed it up. It’s not the best for code snippets. Thanks!
LikeLike
this is not working
LikeLike
Well, what errors are you seeing?
LikeLike
if i want to export the data with headers of table. how i can load?
LikeLike
There are a few ways. If you do option 1, what I’ve done is create a file on the edge node that has the file headers already in it, and then do an append instead of an overwrite on the last command: cat /path/in/local/* >> /another/path/in/local/my_table.csv
Of course this is static. You can also insert the header with a bash command using awk/sed/cut/etc. This is “more” dynamic, but not enormously so.
LikeLike