Correlated Subqueries and Why You Need To Know Them

Queries are fun when you understand them!

Introduction

Let's say you have 2 tables:

  1. table1: Having a column col1

  2. table2: Having a column col2

create table table1 (col1 int);
create table table2 (col2 int);

Let's discuss a query:

select * from table1 where col1 in (select col1 from table2);

On a first look, it seems like this query should error out, because duh, col1 does not exist in table2 . Congratulations, that's NOT what will happen.

I'll spoil the answer: This query would return all rows from table1 if table2 is non-empty.

Try the following 2 cases for yourself.

Case 1: table2 is non-empty

postgres@localhost:akjn> select * from table1; -- table1 has some rows
+------+
| col1 |
|------|
| 1    |
| 2    |
| 3    |
| 4    |
+------+

postgres@localhost:akjn> select * from table2; -- table2 is non-empty
+------+
| col2 |
|------|
| 5    |
+------+

postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2); -- This returns all rows from table1
+------+
| col1 |
|------|
| 1    |
| 2    |
| 3    |
| 4    |
+------+

Case 2: table2 is empty

postgres@localhost:akjn> delete from table2; -- Delete all rows from table2
DELETE 1

postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2); -- This returns nothing since table2 is empty
+------+
| col1 |
|------|
+------+

Now that we've gone over "what" we have to wrap our heads around, we will try to understand why it works the way it works, and then we will try to understand why understanding this is important using an actual incident of my life that happened a couple of weeks back.

The Query Explained :)

To understand the query, we need to understand a couple of things first:

  1. Scope Resolution

  2. Correlated Queries

Scope Resolution

How does the query engine know which column are we talking about? We could have a bunch of tables in a complex query, with common column names. In such cases, it tries to resolve the column names (as in, it finds out which table the column name is referring to) based on some rules.

To keep it simple, let's discuss our example query:

select * from table1 where col1 in (select col1 from table2);

In a subquery, the column names are resolved by looking at the innermost scope and then moving out. The first scope where the column is resolved is used.

There can be 3 cases:

  • Case 1: When table2 has a column named col1

    • col1 would be resolved as table2.col1

    • The query would be run as select * from table1 where col1 in (select table2.col1 from table2);

  • Case 2: When table2 does not have a column named col1, but table1 has (This is our case!!!!)

    • col1 would be resolved as table1.col1

    • The query would be run as select * from table1 where col1 in (select table1.col1 from table2);

  • Case 3: When neither table2 nor table1 have a column named col1

    • It will error out saying column "col1" does not exist

Correlated Queries

What's a subquery?

A subquery (or nested query) is a query nested inside another query.

Example:

select * from table1 where col1 in (select col2 from table2);

In the above example, the subquery select col2 from table2 is executed first, and the output of the subquery is used by the outer query. Hence, steps of execution:

  1. Subquery is executed (only once)

  2. Outer query is executed using output of step 1 (only once)

What's a Correlated Subquery?

A correlated subquery is a subquery that contains a reference to a table from the outer query. The outer query uses the result of the inner query.

Example:

select * from table1 where col1 in (select col1 from table2);

This is the query we have been looking from the very start!

Why is this a correlated subquery? It's because the subquery (select col1 from table2) is referencing a column (col1) from a table of the outer query (table1) --- thanks to scope resolution that we learned in the previous section. Note that this is a correlated subquery only because col1 does not exist in table2, otherwise scope resolution would've resolved select col1 from table2 as select table2.col1 from table2.

Now let's discuss how correlated subqueries are executed, in general.

For correlated subqueries, outer query executes first, and for every outer query row returned in the execution, the inner query is executed using the value of the outer query row.

Let's try to understand the above confusing statement with our query.

Coming back to our query

Our query:

select * from table1 where col1 in (select col1 from table2);

I've changed the data in the two tables to make the example easier to understand, and to avoid any confusion. Current state:

postgres@localhost:akjn> select * from table1;
+------+
| col1 |
|------|
| 1    |
| 4    |
| 42   |
+------+
postgres@localhost:akjn> select * from table2;
+------+
| col2 |
|------|
| 777  |
+------+

Step 1: Execute outer query.

postgres@localhost:akjn> select * from table1;
+------+
| col1 |
|------|
| 1    |
| 4    |
| 42   |
+------+

Note that the above is just an intermittent state that I've highlighted to demonstrate how it works.

Step 2: For each row returned by the outer query, run the inner query using the value of the outer query row.

Our inner query was select col1 from table2

Which means, step 2, in layman terms, can be considered as:

  1. select 1 from table2

  2. select 4 from table2

  3. select 42 from table2

Now if we try to consider the entire query, it becomes:

  1. select * from table1 where col1 in (select 1 from table2);

  2. select * from table1 where col1 in (select 4 from table2);

  3. select * from table1 where col1 in (select 42 from table2);

Which essentially becomes:

postgres@localhost:akjn> select * from table1 where col1 in (select 1 from table2);
+------+
| col1 |
|------|
| 1    |
+------+

postgres@localhost:akjn> select * from table1 where col1 in (select 4 from table2);
+------+
| col1 |
|------|
| 4    |
+------+

postgres@localhost:akjn> select * from table1 where col1 in (select 42 from table2);
+------+
| col1 |
|------|
| 42   |
+------+

And then the combined result set is returned, which is:

postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2);
+------+
| col1 |
|------|
| 1    |
| 4    |
| 42   |
+------+

Remember when we said that this is only the case when table2 is NOT empty? Can you guess why? Hint:

postgres@localhost:akjn> select * from table2; -- table2 is non-empty
+------+
| col2 |
|------|
| 777  |
+------+

postgres@localhost:akjn> select 1 from table2; -- returns 1
+----------+
| ?column? |
|----------|
| 1        |
+----------+

postgres@localhost:akjn> delete from table2; -- Delete all rows from table2
DELETE 1

postgres@localhost:akjn> select 1 from table2; -- returns nothing because table2 is empty :D
+----------+
| ?column? |
|----------|
+----------+

Why do I need to know all this?

TLDR: Whatever we discussed also applies to DELETE queries :)

Instead of selecting all rows of outer table, you could delete them (all of them!), like I did 2 weeks back :)

Story of my life

A couple of weeks back, I ran a DELETE query in our staging environment, which was something like:

delete from table1 where table1_column in (select table1_column from table2 where table2_column='something');

Result: All rows from table1 were deleted.

Hey Akshat, why did you run such a query without hesitation?

  • I had faith in my PostgreSQL skills.

  • I had faith in the where clause. What's the worst that can happen? The rows not matching the where clause can't be affected, right? Right? But unfortunately that was not the case and I learned it the hard way.

Fortunately this was just a table in our staging environment, so it didn't escalate much. But this could've been worse. A LOT WORSE.

Murphy's law: Anything that can go wrong will go wrong.

Learnings

  1. Always use transactions when running such queries, so that you have the option to rollback any changes made.

  2. This also taught me why aliases are important to be explicit about things, instead of letting it resolve the column names itself. Aliases are assigned during query execution. Ever faced query execution errors using ORMs in your application? Notice them the next time, they ALWAYS have aliases.

  3. While ORMs take care of aliases if properly defined, if any of your applications are directly raw querying the DB, please use aliases. Generally such layers don't have any info on the DB schema, and blindly run the raw queries. Imagine a scenario where the table columns have changed in the future 🤷‍♂️

Conclusion

That's all folks!

Hope this was an informative article, and you got some takeaways from it. Thanks for your time!

PS:

  1. I'd highly appreciate any feedback on this article - the good, the bad. How was the writing style? Was it easy to understand? Was it too long? Anything else?

  2. I'll be trying to write more blogs now. So, if you liked this one, make sure you follow me on Hashnode to tag along in the journey :)