Spark SQL supports an incredibly useful feature: predicate subqueries. Documentation on the DataBricks website defines it as:
Predicate subqueries are predicates in which the operand is a subquery. Spark 2.0 supports both the
EXISTS
andIN
based forms. Spark 2.0 currently only supports predicate subqueries inWHERE
clauses.
That means you can conveniently filter rows in your main query using contents from another query. This can be incredibly helpful in situations where you need to only select rows from a table where matching contents exist in another table.
Predicate Subqueries might be a solution for you if you need, say: to provide an otherwise massive list for an IN
statement in a WHERE
clause.
A good use case example of this, is only selecting web traffic from a click-stream table where users are defined in another table. For example, you might be wanting to see rows only related to certain users, but your list of users (stored in another table) is dozens, hundreds or even millions of users long.
That is where Predicate Subqueries in Spark shine. Take for example this query below that uses predicate subquerying to filter users in clickstream based on the users found in another table, interesting_users.
select count(1) from from some_db.clickstream where user in (select distinct user from some_db.interesting_users)
Note: Using select * for your base table might cause errors (I’ve only seen it in Zeppelin, but your results may vary), and you might need to explicitly select the columns you need. A small price to pay for this level of convenience.
Source: DataBricks Blog