Image result for original hard drive
You have to be a bit surgical.

As you progress in your big data journey with Hadoop, you may find that your datanodes’ hard drives are gradually getting more and more full. A tempting thing to do is simply plug in more hard drives to your servers: you’ve got extra slots on your racks and adding entirely new nodes is an expensive (and a little tedious) task. This is particularly relevant when hard drives start failing on your data nodes.

Unless you want to spend a long time fixing your cluster’s data distribution, I urge you,

Don’t Just Plug That Disk In.

Here’s what’ll happen:
You’ve got a data node with three disks that you use as your HDFS storage:

FS      Size Used Avail Use% Mounted on
/dev/vdb 100G 70G 30G 70% /data/disk01
/dev/vdc 100G 70G 30G 70% /data/disk02
/dev/vdd 100G 70G 30G 70% /data/disk03

You say to yourself: “Hey, I don’t like that my HDFS storage is getting kind of high and I’ve got three spare hard drives just laying around. I’ll increase storage by plugging them in!”

So you plug them in, create a file system, and mount the drives to new directories:

FS      Size Used Avail Use% Mounted on
/dev/vdb 100G 70G 30G 70% /data/disk01
/dev/vdc 100G 70G 30G 70% /data/disk02
/dev/vdd 100G 70G 30G 70% /data/disk03
/dev/vde 100G 100G 0G 0% /data/disk04
/dev/vdf 100G 100G 0G 0% /data/disk05
/dev/vdg 100G 100G 0G 0% /data/disk06

And just like that you’ve doubled your cluster’s capacity!

You’re also about to have a bad time.

Three Weeks Later

You get a phone call in the middle of the night saying that every job running on the cluster is failing with “NoSpaceLeftOnDevice” errors.

“That’s not possible!”, you cry, exasperated. “We just doubled our storage three weeks ago!”

You log in to DataNode01 and look at the disks:

FS      Size Used Avail Use% Mounted on
/dev/vdb 100G 99G 1G 99% /data/disk01
/dev/vdc 100G 99G 1G 99% /data/disk02
/dev/vdd 100G 99G 1G 99% /data/disk03
/dev/vde 100G 29G 71G 29% /data/disk04
/dev/vdf 100G 29G 71G 29% /data/disk05
/dev/vdg 100G 29G 71G 29% /data/disk06

I’ve taken the liberty of bolding the offending disks. You’ll notice they’re the original disks. And despite your best efforts, they’re basically full.

What Happened:
HDFS by default uses a round-robin style of file management. So if you’ve got 7 files (or blocks) to write, they’ll actually be written like this:

  1. Block1 –> Disk01
  2. Block2 –> Disk02
  3. Block3 –> Disk03
  4. Block4 –> Disk04
  5. Block5 –> Disk05
  6. Block6 –> Disk06
  7. Block7 –> Disk01

The important thing to see here is that Blocks1 and 7 are BOTH written to Disk01. Adding those new disks certainly increased your total HDFS storage. Your old disks, though, are still nearing 100%. Whenever HDFS tries to write a block from a job, you’ll run out of space and the job will fail.

The Fullest Disk is the Weakest Link. 

This phenomenon and fix is fully discussed on a JIRA ticket here:

TL;DR: round-robin or available space based HDFS by setting the config “dfs.datanode.fsdataset.volume.choosing.policy” to the value “org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy” in your hdfs-site.xml file.

So how do you fix it?

You’ve got a few options, none are particularly pleasant:

  1. Create a new node (really only available in a cloud environment), decommission the old node, and then recommission the old node once everything’s balanced out.
  2. Decommission the data node, format all the disks, and recommission it allowing HDFS to gradually replace ‘under-replicated blocks’ on the newly formatted disks.
  3. If you like building things from scratch, there’s a few disk balancers out there .
  4. MANUALLY (aka — dangerous) move the data blocks from disk to disk with the following command. I take no responsibility for lost or corrupt data. In fact, I recommend against this option because you need to know exactly what you’re doing to not mess up. Don’t forget to turn off your DataNode process first!
    rsync --remove-source-files -av disk01/hadoop/hdfs/data/current/<BlockPool String>/current/finalized/<subdir0-N>/* disk06/hadoop/hdfs/data/current/<BlockPool String>/current/finalized/<subdir0-N>/


Avoid the situation

You can, of course, avoid this situation entirely by making a few key changes to how you add space to your cluster. Adding fresh, new nodes to the cluster instead of plugging your unused hard drives into existing nodes will circumvent the round-robin issue. You will have to rebalance HDFS once the nodes have been stood up.

This is the recommended practice as cluster hardware will get old, fail, or fall out of warranty. By maintaining a healthy cycle of servers flowing in and out of your cluster, you can protect yourself against sudden failures or full disks.

Got other ideas? Let me know below!

2 thoughts

    1. Unless you’re working with some sort of elastic storage medium like AWS EBS or EFS, the individual storage devices that are mounted on each node would be static in size. There’s ways that you can eek out a few more bytes out of disks (removing any reserved space for `root`, for example). Alternatively, I suppose you could mount a new storage device to the server and do a hot-swap of sorts.

      The steps might look like: 1. turn off the data node process, 2. mount the new device, 3. copy the contents of the old device to the new device, 4. unmount the old device from its original location (keep it around if you need to revert these changes), 5. unmount and remount the new device in the original, old device’s location. 6. turn the datanode process back on. That would get you “more” disk space but only because you plugged in a new device and played a trick on the OS.

      By far though, the easiest way to increase disk space in Hadoop clusters is by adding new nodes.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.