Correlated Subqueries and Why You Need To Know Them
Queries are fun when you understand them!
Introduction
Let's say you have 2 tables:
table1
: Having a columncol1
table2
: Having a columncol2
create table table1 (col1 int);
create table table2 (col2 int);
Let's discuss a query:
select * from table1 where col1 in (select col1 from table2);
On a first look, it seems like this query should error out, because duh, col1
does not exist in table2
. Congratulations, that's NOT what will happen.
I'll spoil the answer: This query would return all rows from table1
if table2
is non-empty.
Try the following 2 cases for yourself.
Case 1: table2
is non-empty
postgres@localhost:akjn> select * from table1; -- table1 has some rows
+------+
| col1 |
|------|
| 1 |
| 2 |
| 3 |
| 4 |
+------+
postgres@localhost:akjn> select * from table2; -- table2 is non-empty
+------+
| col2 |
|------|
| 5 |
+------+
postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2); -- This returns all rows from table1
+------+
| col1 |
|------|
| 1 |
| 2 |
| 3 |
| 4 |
+------+
Case 2: table2
is empty
postgres@localhost:akjn> delete from table2; -- Delete all rows from table2
DELETE 1
postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2); -- This returns nothing since table2 is empty
+------+
| col1 |
|------|
+------+
Now that we've gone over "what" we have to wrap our heads around, we will try to understand why it works the way it works, and then we will try to understand why understanding this is important using an actual incident of my life that happened a couple of weeks back.
The Query Explained :)
To understand the query, we need to understand a couple of things first:
Scope Resolution
Correlated Queries
Scope Resolution
How does the query engine know which column are we talking about? We could have a bunch of tables in a complex query, with common column names. In such cases, it tries to resolve the column names (as in, it finds out which table the column name is referring to) based on some rules.
To keep it simple, let's discuss our example query:
select * from table1 where col1 in (select col1 from table2);
In a subquery, the column names are resolved by looking at the innermost scope and then moving out. The first scope where the column is resolved is used.
There can be 3 cases:
Case 1: When
table2
has a column namedcol1
col1
would be resolved astable2.col1
The query would be run as
select * from table1 where col1 in (select table2.col1 from table2);
Case 2: When
table2
does not have a column namedcol1
, buttable1
has (This is our case!!!!)col1
would be resolved astable1.col1
The query would be run as
select * from table1 where col1 in (select table1.col1 from table2);
Case 3: When neither
table2
nortable1
have a column namedcol1
- It will error out saying
column "col1" does not exist
- It will error out saying
Correlated Queries
What's a subquery?
A subquery (or nested query) is a query nested inside another query.
Example:
select * from table1 where col1 in (select col2 from table2);
In the above example, the subquery select col2 from table2
is executed first, and the output of the subquery is used by the outer query. Hence, steps of execution:
Subquery is executed (only once)
Outer query is executed using output of step 1 (only once)
What's a Correlated Subquery?
A correlated subquery is a subquery that contains a reference to a table from the outer query. The outer query uses the result of the inner query.
Example:
select * from table1 where col1 in (select col1 from table2);
This is the query we have been looking from the very start!
Why is this a correlated subquery? It's because the subquery (select col1 from table2
) is referencing a column (col1
) from a table of the outer query (table1
) --- thanks to scope resolution that we learned in the previous section. Note that this is a correlated subquery only because col1
does not exist in table2
, otherwise scope resolution would've resolved select col1 from table2
as select table2.col1 from table2
.
Now let's discuss how correlated subqueries are executed, in general.
For correlated subqueries, outer query executes first, and for every outer query row returned in the execution, the inner query is executed using the value of the outer query row.
Let's try to understand the above confusing statement with our query.
Coming back to our query
Our query:
select * from table1 where col1 in (select col1 from table2);
I've changed the data in the two tables to make the example easier to understand, and to avoid any confusion. Current state:
postgres@localhost:akjn> select * from table1;
+------+
| col1 |
|------|
| 1 |
| 4 |
| 42 |
+------+
postgres@localhost:akjn> select * from table2;
+------+
| col2 |
|------|
| 777 |
+------+
Step 1: Execute outer query.
postgres@localhost:akjn> select * from table1;
+------+
| col1 |
|------|
| 1 |
| 4 |
| 42 |
+------+
Note that the above is just an intermittent state that I've highlighted to demonstrate how it works.
Step 2: For each row returned by the outer query, run the inner query using the value of the outer query row.
Our inner query was select col1 from table2
Which means, step 2, in layman terms, can be considered as:
select 1 from table2
select 4 from table2
select 42 from table2
Now if we try to consider the entire query, it becomes:
select * from table1 where col1 in (select 1 from table2);
select * from table1 where col1 in (select 4 from table2);
select * from table1 where col1 in (select 42 from table2);
Which essentially becomes:
postgres@localhost:akjn> select * from table1 where col1 in (select 1 from table2);
+------+
| col1 |
|------|
| 1 |
+------+
postgres@localhost:akjn> select * from table1 where col1 in (select 4 from table2);
+------+
| col1 |
|------|
| 4 |
+------+
postgres@localhost:akjn> select * from table1 where col1 in (select 42 from table2);
+------+
| col1 |
|------|
| 42 |
+------+
And then the combined result set is returned, which is:
postgres@localhost:akjn> select * from table1 where col1 in (select col1 from table2);
+------+
| col1 |
|------|
| 1 |
| 4 |
| 42 |
+------+
Remember when we said that this is only the case when table2
is NOT empty? Can you guess why? Hint:
postgres@localhost:akjn> select * from table2; -- table2 is non-empty
+------+
| col2 |
|------|
| 777 |
+------+
postgres@localhost:akjn> select 1 from table2; -- returns 1
+----------+
| ?column? |
|----------|
| 1 |
+----------+
postgres@localhost:akjn> delete from table2; -- Delete all rows from table2
DELETE 1
postgres@localhost:akjn> select 1 from table2; -- returns nothing because table2 is empty :D
+----------+
| ?column? |
|----------|
+----------+
Why do I need to know all this?
TLDR: Whatever we discussed also applies to DELETE queries :)
Instead of selecting all rows of outer table, you could delete them (all of them!), like I did 2 weeks back :)
Story of my life
A couple of weeks back, I ran a DELETE query in our staging environment, which was something like:
delete from table1 where table1_column in (select table1_column from table2 where table2_column='something');
Result: All rows from table1
were deleted.
Hey Akshat, why did you run such a query without hesitation?
I had faith in my PostgreSQL skills.
I had faith in the
where
clause. What's the worst that can happen? The rows not matching the where clause can't be affected, right? Right? But unfortunately that was not the case and I learned it the hard way.
Fortunately this was just a table in our staging environment, so it didn't escalate much. But this could've been worse. A LOT WORSE.
Murphy's law: Anything that can go wrong will go wrong.
Learnings
Always use transactions when running such queries, so that you have the option to rollback any changes made.
This also taught me why aliases are important to be explicit about things, instead of letting it resolve the column names itself. Aliases are assigned during query execution. Ever faced query execution errors using ORMs in your application? Notice them the next time, they ALWAYS have aliases.
While ORMs take care of aliases if properly defined, if any of your applications are directly raw querying the DB, please use aliases. Generally such layers don't have any info on the DB schema, and blindly run the raw queries. Imagine a scenario where the table columns have changed in the future 🤷♂️
Conclusion
That's all folks!
Hope this was an informative article, and you got some takeaways from it. Thanks for your time!
PS:
I'd highly appreciate any feedback on this article - the good, the bad. How was the writing style? Was it easy to understand? Was it too long? Anything else?
I'll be trying to write more blogs now. So, if you liked this one, make sure you follow me on Hashnode to tag along in the journey :)