Hadoop should be a ‘Data Buffet’ not a ‘Data Lake’.

The nicest buffet I have ever seen.

Here’s why Hadoop is called a ‘data lake’ at all the conferences you’ll ever go to:

A lake is a big body of water that has been collecting for years and years. The water flows in from everywhere into one specific, defined area.

Everyone needs water (ask Californians).

Everyone needs data.

Not pictured: Malformed XML

The ‘data lake’ is a nice metaphor that seems peaceful and has a certain mystery about it. Big business executives like the notion of a peaceful, mysterious collection of data that answers all the deepest and darkest questions one might have about whatever.

It’s very clever and honestly, I do like the metaphor.

But any one who actually works with Loch Data knows it’s not particularly peaceful and knows how much of a Monster it can actually be.

The data retains its mystery, however.

I mean, think about it:

  1. You NEVER know what’s beneath you in a lake *shudder*
  2. There’s really no control over what is actually going into a lake
  3. The ‘fish-to-water’ ratio is incredibly small (more on this later)

So here’s why Hadoop should be called a data buffet:

Everything is clearly labeled

With a Data Buffet, you always know exactly what you’re putting on your plate.  The food (data) has been put there because someone will want that weird looking morsel eventually.

He’s been waiting hours for this query to finish

More importantly, when everything is labeled clearly people don’t spend a long time fishing for data. This is the fish-to-water ratio I was talking about earlier. Analysts are always after the ‘fish’ in the data. In a Data Buffet, the data is not hiding in the depths of a lake somewhere, waiting only to come out with a sparkly lure. It’s right there for the taking.

Anyone should be able to look at your data architecture and tell you exactly what is going on.

Mix whatever you want together

A little chicken, a little pizza and some cocktail shrimp. Yeah, it’s weird. But you need that combination at the moment. That’s the whole point of it all anyways! What’s the use in collecting all this data in one big environment if you can’t tie it in with other stuff later down the road?

Now I’ve made myself hungry

Some of the biggest discoveries in science have come from mixing two seemingly unconnected ideas together.

So do a little experimenting! Look for correlations, dig around in logs, peek at something that’s covered in database-dust.

Insight is there, you just have to try new things.

It takes work to maintain

Lakes just come into being. They’ve been there for a while and they’ll be there longer still. Even man-made lakes gradually fill up with no particular schedule in mind.

Apache Blur will be the next super-fast M/R framework

Buffets, on the other hand, are created and maintained by the hands of cooks and servers. If you start to run out of something, it’s already on the burner in the kitchen getting ready to go. Its their job to keep the food clean and ready to be served. Without the cooks and the servers, the whole buffet grinds to a halt.

There’s prep work, there’s clean up, and there’s a massive number of steps between open and close each day. But each step and each member of the team is essential for keeping the buffet spectacular and abundant for each customer.

Data will never be self organizing. It will always take a huge amount of hard work to use data effectively.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.