Spark SQL supports an incredibly useful feature: predicate subqueries. Documentation on the DataBricks website defines it as:

Predicate subqueries are predicates in which the operand is a subquery. Spark 2.0 supports both the EXISTS and IN based forms. Spark 2.0 currently only supports predicate subqueries in WHERE clauses.

That means you can conveniently filter rows in your main query using contents from another query. This can be incredibly helpful in situations where you need to only select rows from a table where matching contents exist in another table.

Predicate Subqueries might be a solution for you if you need, say: to provide an otherwise massive list for an IN statement in a WHERE clause.

A good use case example of this, is only selecting web traffic from a click-stream table where users are defined in another table. For example, you might be wanting to see rows only related to certain users, but your list of users (stored in another table) is dozens, hundreds or even millions of users long.

That is where Predicate Subqueries in Spark shine. Take for example this query below that uses predicate subquerying to filter users in clickstream based on the users found in another table, interesting_users.

select count(1) from
from some_db.clickstream
where user in (select distinct user from some_db.interesting_users)

Note: Using select * for your base table might cause errors (I’ve only seen it in Zeppelin, but your results may vary), and you might need to explicitly select the columns you need. A small price to pay for this level of convenience.

Source: DataBricks Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.