<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Akshat Jain's blog]]></title><description><![CDATA[Akshat Jain's blog]]></description><link>https://blog.akjn.dev</link><generator>RSS for Node</generator><lastBuildDate>Mon, 20 Apr 2026 06:33:20 GMT</lastBuildDate><atom:link href="https://blog.akjn.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Explaining and Analysing Postgres EXPLAIN ANALYSE]]></title><description><![CDATA[Introduction
SQL is a declarative query language. That means, you tell it what you want, you do not tell it how it's supposed to get it.
This blogs aims to give an intuition about how a query is internally run (on a very high-level). Understanding th...]]></description><link>https://blog.akjn.dev/explaining-and-analysing-postgres-explain-analyse</link><guid isPermaLink="true">https://blog.akjn.dev/explaining-and-analysing-postgres-explain-analyse</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[SQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Akshat Jain]]></dc:creator><pubDate>Tue, 24 Oct 2023 12:27:04 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>SQL is a declarative query language. That means, you tell it what you want, you do not tell it how it's supposed to get it.</p>
<p>This blogs aims to give an intuition about <em>how</em> a query is internally run (on a very high-level). Understanding this would enable you to structure better queries, and debug queries when they are taking unexpectedly longer execution time.</p>
<p>(PS: You can skip to section <code>Time to dig into a lot of examples!</code> if you are familiar with reading query plans)</p>
<p>In a typical database system like PostgreSQL, a query goes through four stages:</p>
<ol>
<li><p>Parsing: The system checks the query for syntax errors and ensures it is correctly structured.</p>
</li>
<li><p>Transformation: The query is converted into an internal representation for optimization and execution. This involves tasks like alias resolution and object existence checks.</p>
</li>
<li><p>Planning: An execution plan is created, determining the most efficient way to retrieve data based on various factors.</p>
</li>
<li><p>Execution: The system executes the plan generated in the previous step.</p>
</li>
</ol>
<p>This blog will go over the following aspects:</p>
<ol>
<li><p>What a Query Plan look like</p>
</li>
<li><p>An overview of the "Planning" step using <code>EXPLAIN ANALYSE</code></p>
</li>
<li><p>Understand the kind of decisions the query planner takes to generate an optimised execution plan. Since this isn't a topic that can be exhaustively summed up in a blog, we will go over a bunch of examples to develop an intuition about the kind of factors that contribute to a query plan.</p>
</li>
</ol>
<p>This blog will NOT go over:</p>
<ol>
<li><p><em>How</em> the query planner comes up with the optimised plan</p>
</li>
<li><p>How the generated plan is actually executed</p>
</li>
</ol>
<blockquote>
<p>Note: All examples mentioned ahead are based on PostgreSQL 14.5, and there may be minor differences in other database systems or versions. However, the high-level concepts are consistent.</p>
</blockquote>
<p>With that said, let's jump right into it!</p>
<h1 id="heading-hey-akshat-what-is-a-query-plan">Hey Akshat, what is a Query Plan?</h1>
<p>Query Plan is essentially a step-by-step detailed "execution plan" that is generated by the database system for any query execution.</p>
<p>Let's see an example. I have a table <code>table1</code> with 1000 rows:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">-------+</span>
| count |
|<span class="hljs-comment">-------|</span>
| 1000  |
+<span class="hljs-comment">-------+</span>
</code></pre>
<p>Let's try to find out "what" it did behind the scenes to give us this output.</p>
<h1 id="heading-explain-and-analyse">EXPLAIN and ANALYSE</h1>
<p>Before we begin to understand the query plan of specific queries, we need to understand <code>EXPLAIN</code> and <code>EXPLAIN ANALYSE</code>.</p>
<p>In simple terms, <code>EXPLAIN</code> gives an execution plan by estimating the "cost", whereas <code>EXPLAIN ANALYSE</code> gives you the execution plan by actually executing the query.</p>
<p>Let's use these for our simple query that counts the number of rows in <code>table1</code></p>
<p><code>EXPLAIN</code> gives:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> <span class="hljs-keyword">select</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">----------------------------------------------------------------+</span>
| QUERY PLAN                                                     |
|<span class="hljs-comment">----------------------------------------------------------------|</span>
| Aggregate  (cost=17.50..17.51 rows=1 width=8)                  |
|   -&gt;  Seq Scan on table1  (cost=0.00..15.00 rows=1000 width=0) |
+<span class="hljs-comment">----------------------------------------------------------------+</span>
</code></pre>
<p>Whereas <code>EXPLAIN ANALYSE</code> gives:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                  |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------|</span>
| Aggregate  (cost=17.50..17.51 rows=1 width=8) (actual time=0.188..0.189 rows=1 loops=1)                     |
|   -&gt;  Seq Scan on table1  (cost=0.00..15.00 rows=1000 width=0) (actual time=0.010..0.110 rows=1000 loops=1) |
| Planning Time: 0.028 ms                                                                                     |
| Execution Time: 0.203 ms                                                                                    |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>Let's ignore the complicated terms in the above outputs for a bit, and focus on the differences. On a quick look, the output of <code>EXPLAIN ANALYSE</code> has this additional info:</p>
<ol>
<li><p>A second set of parenthesis having some "actual" values (like <code>(actual time=0.188..0.189 rows=1 loops=1)</code> and <code>(actual time=0.010..0.110 rows=1000 loops=1)</code></p>
</li>
<li><p>Planning time</p>
</li>
<li><p>Execution time</p>
</li>
</ol>
<p>This is because <code>EXPLAIN</code> only gives an "estimate" of the cost. You need to actually run the query to get the actual time, which is done by <code>EXPLAIN ANALYSE</code>.</p>
<h1 id="heading-hey-akshat-how-to-read-the-complicated-explain-analyse-query-plan">Hey Akshat, how to read the complicated <code>EXPLAIN ANALYSE</code> query plan?</h1>
<p>Let's analyse the query plan we were discussing in the previous section:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                  |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------|</span>
| Aggregate  (cost=17.50..17.51 rows=1 width=8) (actual time=0.188..0.189 rows=1 loops=1)                     |
|   -&gt;  Seq Scan on table1  (cost=0.00..15.00 rows=1000 width=0) (actual time=0.010..0.110 rows=1000 loops=1) |
| Planning Time: 0.028 ms                                                                                     |
| Execution Time: 0.203 ms                                                                                    |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>Query plan is a <strong>Tree structure of "Plan Nodes"</strong>, with one root node, and the rest of the nodes indicated at the <code>-&gt;</code> marker. So, for the above query plan, we have the following nodes:</p>
<ol>
<li><p>Aggregate (also the Root Node)</p>
</li>
<li><p>Seq Scan</p>
</li>
</ol>
<h3 id="heading-question-1-how-to-read-this-how-does-postgres-execute-this">Question 1: How to read this? / How does Postgres execute this?</h3>
<p>Answer: From inside out.</p>
<p>That is, for the above example, in very simple terms, it will try to do <code>Seq Scan on table1</code> first, get the output, and give that output to the parent node (that is, <code>Aggregate</code>) for computation.</p>
<p>We will look at more complex examples later, which should give more intuition on this. Let's focus on the other parts of the query plan:</p>
<ol>
<li><p><code>Planning Time</code> is the time taken to generate the query plan.</p>
</li>
<li><p><code>Execution Time</code> is the time taken to execute the query plan.</p>
</li>
<li><p>Some weird looking parenthesis.</p>
</li>
</ol>
<h3 id="heading-question-2-what-are-the-weird-looking-parenthesis">Question 2: What are the weird looking parenthesis?</h3>
<p>We have some weird looking parenthesis of the format <code>(cost=0.00..15.00 rows=1000 width=0)</code>. Let's dig into the 3 parts of the parenthesis:</p>
<ol>
<li><p><code>cost</code>: This has 2 components - Startup Cost and Total Cost. For <code>cost=0.00..15.00</code>, <code>0.00</code> is the startup cost and <code>15.00</code> is the total cost.</p>
<ol>
<li><p>Startup Cost: This is an estimate of how long it will take for the given node to start. As you can expect, it would include the cost of all the children nodes.</p>
</li>
<li><p>Total Cost: This is an estimate of how long it will take to finish execution on the particular node.</p>
</li>
</ol>
</li>
<li><p><code>rows</code>: This is the estimated average number of rows returned by this node to the parent node.</p>
</li>
<li><p><code>width</code>: This is the estimated average width (in bytes) of rows returned by this node.</p>
</li>
</ol>
<h3 id="heading-question-3-what-is-the-unit-of-cost">Question 3: What is the unit of cost?</h3>
<p>The costs are in an arbitrary unit. I'm highlighting some of them here for better intuition:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> <span class="hljs-keyword">name</span>, setting, short_desc <span class="hljs-keyword">from</span> pg_settings <span class="hljs-keyword">where</span> <span class="hljs-keyword">name</span> <span class="hljs-keyword">in</span> (<span class="hljs-string">'seq_page_cost'</span>, <span class="hljs-string">'cpu_tuple_cost'</span>, <span class="hljs-string">'cpu_operator_cost'</span>);
+<span class="hljs-comment">-------------------+---------+---------------------------------------------------------------------------------------+</span>
| name              | setting | short_desc                                                                            |
|<span class="hljs-comment">-------------------+---------+---------------------------------------------------------------------------------------|</span>
| seq_page_cost     | 1       | Sets the planner's estimate of the cost of a sequentially fetched disk page.          |
| cpu_tuple_cost    | 0.01    | Sets the planner's estimate of the cost of processing each tuple (row).               |
| cpu_operator_cost | 0.0025  | Sets the planner's estimate of the cost of processing each operator or function call. |
+<span class="hljs-comment">-------------------+---------+---------------------------------------------------------------------------------------+</span>
</code></pre>
<p>So if we are fetching 100 pages sequentially, it would contribute to <code>100*1 = 100</code> cost units. If we are processing 100 rows, it would contribute to <code>100*0.01 = 1</code> cost units.</p>
<h3 id="heading-lets-see-if-we-can-understand-the-query-plan-now">Let's see if we can understand the query plan now!</h3>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> <span class="hljs-keyword">count</span>(*) <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                  |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------|</span>
| Aggregate  (cost=17.50..17.51 rows=1 width=8) (actual time=0.188..0.189 rows=1 loops=1)                     |
|   -&gt;  Seq Scan on table1  (cost=0.00..15.00 rows=1000 width=0) (actual time=0.010..0.110 rows=1000 loops=1) |
| Planning Time: 0.028 ms                                                                                     |
| Execution Time: 0.203 ms                                                                                    |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>We wanted to find the count of rows in <code>table1</code>. For that, we first did a Sequential Scan on <code>table1</code>, which has no startup cost (0.00). and takes an estimate of 15.00 cost units to return 1000 rows to the parent node. The parent node estimated the start cost as 17.50, and the total cost as 17.51, and estimates to return 1 row.</p>
<p>In terms of the actual time, Sequential Scan on <code>table1</code> started at 0.010 ms, finished at 0.110 ms. For the Aggregate node, it started at 0.188 ms, finished at 0.189 ms.</p>
<h1 id="heading-time-to-dig-into-a-lot-of-examples">Time to dig into a lot of examples!</h1>
<h3 id="heading-example-1-selecting-everything-from-table-having-1000-rows-vs-10000000-rows">Example 1: Selecting everything from table having 1000 rows vs 10000000 rows</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- table1 has 1000 rows</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                            |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------|</span>
| Seq Scan on table1  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.008..0.107 rows=1000 loops=1) |
| Planning Time: 0.042 ms                                                                               |
| Execution Time: 0.179 ms                                                                              |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------+</span>

<span class="hljs-comment">-- Insert 10000000 rows</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table1; <span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table1 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">10000000</span>));

<span class="hljs-comment">-- table1 has 10000000 rows</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">---------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                          |
|<span class="hljs-comment">---------------------------------------------------------------------------------------------------------------------|</span>
| Seq Scan on table1  (cost=0.00..132759.00 rows=8850600 width=4) (actual time=0.015..1126.625 rows=10000000 loops=1) |
| Planning Time: 0.023 ms                                                                                             |
| Execution Time: 1668.256 ms                                                                                         |
+<span class="hljs-comment">---------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>Since we need to get all rows, there's no option other than just sequentially scanning all rows in both cases, and return them as output.</p>
<p>Let's make things more interesting now.</p>
<h3 id="heading-example-2-selecting-a-particular-row-from-table-having-1000-rows-vs-10000000-rows">Example 2: Selecting a particular row from table having 1000 rows vs 10000000 rows</h3>
<pre><code class="lang-sql"> <span class="hljs-comment">-- table1 has 1000 rows </span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">50</span>;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                      |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------|</span>
| Seq Scan on table1  (cost=0.00..26.50 rows=1 width=4) (actual time=0.020..0.199 rows=1 loops=1) |
|   Filter: (col1 = 50)                                                                           |
|   Rows Removed by Filter: 999                                                                   |
| Planning Time: 0.054 ms                                                                         |
| Execution Time: 0.208 ms                                                                        |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------+</span>

 <span class="hljs-comment">-- Insert 10000000 rows</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table1; <span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table1 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">10000000</span>));

<span class="hljs-comment">-- table1 has 10000000 rows</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">50</span>;
+<span class="hljs-comment">-----------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                            |
|<span class="hljs-comment">-----------------------------------------------------------------------------------------------------------------------|</span>
| Gather  (cost=1000.00..61721.76 rows=1 width=4) (actual time=25.289..437.156 rows=1 loops=1)                          |
|   Workers Planned: 2                                                                                                  |
|   Workers Launched: 2                                                                                                 |
|   -&gt;  Parallel Seq Scan on table1  (cost=0.00..60721.66 rows=1 width=4) (actual time=289.218..425.408 rows=0 loops=3) |
|         Filter: (col1 = 50)                                                                                           |
|         Rows Removed by Filter: 3333333                                                                               |
| Planning Time: 0.042 ms                                                                                               |
| Execution Time: 437.171 ms                                                                                            |
+<span class="hljs-comment">-----------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>See the difference? When the table had small amount of data (1000 rows), it decided that an optimal query plan is to just do a sequential scan over all the rows. But when the table had large amount of data (10000000 rows), it decided that an optimal query plan is to spawn 2 worker nodes, and parallelise the work between 2 worker nodes + 1 leader node.</p>
<p>Observations:</p>
<ol>
<li><p>Each worker node was then responsible for running <code>Parallel Seq Scan</code> node, which can be seen in <code>loops=3</code> (indicating that this node was run 3 times in total).</p>
</li>
<li><p>Rows removed by filter for <code>Parallel Seq Scan</code> node is estimated to be <code>3333333</code>, which is one-third of the total number of rows. Hence each worker on an average filtered out one-third of the total number of rows (which is almost all of the rows each worker was working on, which makes sense!)</p>
</li>
</ol>
<p>You can read more about parallel queries at <a target="_blank" href="https://www.postgresql.org/docs/current/how-parallel-query-works.html">https://www.postgresql.org/docs/current/how-parallel-query-works.html</a>.</p>
<h3 id="heading-example-3-having-a-unique-index-on-the-column-you-are-selecting">Example 3: Having a unique index on the column you are selecting</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Add a UNIQUE index on col1</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">alter</span> <span class="hljs-keyword">table</span> table1 <span class="hljs-keyword">add</span> <span class="hljs-keyword">constraint</span> col1_unique <span class="hljs-keyword">unique</span>(col1);

 <span class="hljs-comment">-- table1 has 1000 rows </span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">50</span>;
+<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                         |
|<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------|</span>
| Bitmap Heap Scan on table1  (cost=4.28..8.30 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=1)             |
|   Recheck Cond: (col1 = 50)                                                                                        |
|   Heap Blocks: exact=1                                                                                             |
|   -&gt;  Bitmap Index Scan on col1_unique  (cost=0.00..4.28 rows=1 width=0) (actual time=0.007..0.008 rows=1 loops=1) |
|         Index Cond: (col1 = 50)                                                                                    |
| Planning Time: 0.051 ms                                                                                            |
| Execution Time: 0.027 ms                                                                                           |
+<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------+</span>

<span class="hljs-comment">-- If we wait for a while</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">50</span>;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                              |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------------|</span>
| Index Only Scan using col1_unique on table1  (cost=0.28..4.29 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1) |
|   Index Cond: (col1 = 50)                                                                                               |
|   Heap Fetches: 0                                                                                                       |
| Planning Time: 0.047 ms                                                                                                 |
| Execution Time: 0.024 ms                                                                                                |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>This is interesting. Running the same query gives us "Bitmap Index Scan + Bitmap Heap Scan" query plan initially (immediately after index creation), but gives us "Index Only Scan" after some time.</p>
<p>As we discussed earlier, these are "estimates". These are based on the current state of statistics that the database system has for a given table/database. The system needs some time to update the statistics (or some specific queries to be run that updates these). Once the statistics are updated, it realises that doing an Index Only Scan is more efficient, and it switches to doing that.</p>
<h3 id="heading-example-4-having-a-unique-index-and-querying-for-more-than-one-values-for-the-column">Example 4: Having a unique index and querying for more than one values for the column</h3>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">1</span> <span class="hljs-keyword">or</span> col1 = <span class="hljs-number">2</span>;
+<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                               |
|<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------------|</span>
| Bitmap Heap Scan on table1  (cost=8.87..16.86 rows=2 width=4) (actual time=0.014..0.015 rows=2 loops=1)                  |
|   Recheck Cond: ((col1 = 1) OR (col1 = 2))                                                                               |
|   Heap Blocks: exact=1                                                                                                   |
|   -&gt;  BitmapOr  (cost=8.87..8.87 rows=2 width=0) (actual time=0.011..0.011 rows=0 loops=1)                               |
|         -&gt;  Bitmap Index Scan on col1_unique  (cost=0.00..4.43 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1) |
|               Index Cond: (col1 = 1)                                                                                     |
|         -&gt;  Bitmap Index Scan on col1_unique  (cost=0.00..4.43 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1) |
|               Index Cond: (col1 = 2)                                                                                     |
| Planning Time: 0.050 ms                                                                                                  |
| Execution Time: 0.030 ms                                                                                                 |
+<span class="hljs-comment">--------------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>When we are querying for multiple values, it resorts to doing a "Bitmap Index Scan" for both of the individual filters, gets the result, and does a "BitmapOr" on the two results.</p>
<p>However, if we do a range query, we again do it with the "Index Only Scan":</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 &gt;= <span class="hljs-number">1</span> <span class="hljs-keyword">or</span> col1 &lt;=  <span class="hljs-number">2</span>;
+<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                                               |
|<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------------------------------|</span>
| Index Only Scan using col1_unique on table1  (cost=0.42..30750.28 rows=988043 width=4) (actual time=0.009..113.048 rows=1000000 loops=1) |
|   Filter: ((col1 &gt;= 1) OR (col1 &lt;= 2))                                                                                                   |
|   Heap Fetches: 0                                                                                                                        |
| Planning Time: 0.128 ms                                                                                                                  |
| Execution Time: 169.281 ms                                                                                                               |
+<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>We know that the two queries are equivalent for the kind of restrictions we have on the table. However, the database system operates solely on the statistical information it has about the data. Therefore, it's not surprising to observe a difference.</p>
<h3 id="heading-example-5-join-operations">Example 5: Join operations</h3>
<p>Let's say we have 2 tables:</p>
<ul>
<li><p><code>table1</code> having an integer column <code>col1</code></p>
</li>
<li><p><code>table2</code> having an integer column <code>col2</code></p>
</li>
</ul>
<pre><code class="lang-sql"><span class="hljs-comment">-- Create the two tables and enter 1000 records (1 to 1000) in both tables</span>
<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> table1 (col1 <span class="hljs-built_in">int</span>); <span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> table2 (col2 <span class="hljs-built_in">int</span>);
<span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table1 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000</span>));
<span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table2 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000</span>));
</code></pre>
<p>Now if we try to do a join operation:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">join</span> table2 <span class="hljs-keyword">on</span> col1=col2;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                        |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------|</span>
| <span class="hljs-keyword">Merge</span> <span class="hljs-keyword">Join</span>  (<span class="hljs-keyword">cost</span>=<span class="hljs-number">359.57</span>.<span class="hljs-number">.860</span><span class="hljs-number">.00</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">32512</span> width=<span class="hljs-number">8</span>) (actual <span class="hljs-built_in">time</span>=<span class="hljs-number">0.535</span>.<span class="hljs-number">.1</span><span class="hljs-number">.059</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1000</span> loops=<span class="hljs-number">1</span>)                 |
|   <span class="hljs-keyword">Merge</span> Cond: (table1.col1 = table2.col2)                                                                         |
|   -&gt;  <span class="hljs-keyword">Sort</span>  (<span class="hljs-keyword">cost</span>=<span class="hljs-number">179.78</span>.<span class="hljs-number">.186</span><span class="hljs-number">.16</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">2550</span> width=<span class="hljs-number">4</span>) (actual <span class="hljs-built_in">time</span>=<span class="hljs-number">0.283</span>.<span class="hljs-number">.0</span><span class="hljs-number">.370</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1000</span> loops=<span class="hljs-number">1</span>)                  |
|         <span class="hljs-keyword">Sort</span> <span class="hljs-keyword">Key</span>: table1.col1                                                                                     |
|         <span class="hljs-keyword">Sort</span> Method: quicksort  <span class="hljs-keyword">Memory</span>: <span class="hljs-number">71</span>kB                                                                      |
|         -&gt;  Seq <span class="hljs-keyword">Scan</span> <span class="hljs-keyword">on</span> table1  (<span class="hljs-keyword">cost</span>=<span class="hljs-number">0.00</span>.<span class="hljs-number">.35</span><span class="hljs-number">.50</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">2550</span> width=<span class="hljs-number">4</span>) (actual <span class="hljs-built_in">time</span>=<span class="hljs-number">0.052</span>.<span class="hljs-number">.0</span><span class="hljs-number">.176</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1000</span> loops=<span class="hljs-number">1</span>) |
|   -&gt;  <span class="hljs-keyword">Sort</span>  (<span class="hljs-keyword">cost</span>=<span class="hljs-number">179.78</span>.<span class="hljs-number">.186</span><span class="hljs-number">.16</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">2550</span> width=<span class="hljs-number">4</span>) (actual <span class="hljs-built_in">time</span>=<span class="hljs-number">0.249</span>.<span class="hljs-number">.0</span><span class="hljs-number">.323</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1000</span> loops=<span class="hljs-number">1</span>)                  |
|         <span class="hljs-keyword">Sort</span> <span class="hljs-keyword">Key</span>: table2.col2                                                                                     |
|         <span class="hljs-keyword">Sort</span> Method: quicksort  <span class="hljs-keyword">Memory</span>: <span class="hljs-number">71</span>kB                                                                      |
|         -&gt;  Seq <span class="hljs-keyword">Scan</span> <span class="hljs-keyword">on</span> table2  (<span class="hljs-keyword">cost</span>=<span class="hljs-number">0.00</span>.<span class="hljs-number">.35</span><span class="hljs-number">.50</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">2550</span> width=<span class="hljs-number">4</span>) (actual <span class="hljs-built_in">time</span>=<span class="hljs-number">0.016</span>.<span class="hljs-number">.0</span><span class="hljs-number">.136</span> <span class="hljs-keyword">rows</span>=<span class="hljs-number">1000</span> loops=<span class="hljs-number">1</span>) |
| Planning <span class="hljs-built_in">Time</span>: <span class="hljs-number">0.055</span> ms                                                                                           |
| Execution <span class="hljs-built_in">Time</span>: <span class="hljs-number">1.142</span> ms                                                                                          |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>Now let's change the data in the two tables and try doing the same join operation:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- table1 has (1 to 1000), table2 has (901 to 1000)</span>
<span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table1; <span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table1 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">1</span>, <span class="hljs-number">1000</span>));
<span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table2; <span class="hljs-keyword">insert</span> <span class="hljs-keyword">into</span> table2 (<span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> generate_series(<span class="hljs-number">901</span>, <span class="hljs-number">1000</span>));

postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">join</span> table2 <span class="hljs-keyword">on</span> col1=col2;
+<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                       |
|<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------|</span>
| Hash Join  (cost=27.50..71.25 rows=1000 width=8) (actual time=0.412..0.445 rows=100 loops=1)                     |
|   Hash Cond: (table1.col1 = table2.col2)                                                                         |
|   -&gt;  Seq Scan on table1  (cost=0.00..27.00 rows=1800 width=4) (actual time=0.084..0.219 rows=1000 loops=1)      |
|   -&gt;  Hash  (cost=15.00..15.00 rows=1000 width=4) (actual time=0.096..0.096 rows=100 loops=1)                    |
|         Buckets: 1024  Batches: 1  Memory Usage: 12kB                                                            |
|         -&gt;  Seq Scan on table2  (cost=0.00..15.00 rows=1000 width=4) (actual time=0.070..0.080 rows=100 loops=1) |
| Planning Time: 0.092 ms                                                                                          |
| Execution Time: 0.468 ms                                                                                         |
+<span class="hljs-comment">------------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>The difference is quite interesting. We can notice that it resorted to "Merge Join" when the data in the 2 tables had more overlap, compared to "Hash Join" when the data in the 2 tables had less overlap.</p>
<h3 id="heading-example-6-sorting">Example 6: Sorting</h3>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> col1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                                  |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------|</span>
| Sort  (cost=68.83..71.33 rows=1000 width=4) (actual time=0.203..0.276 rows=1000 loops=1)                    |
|   Sort Key: col1                                                                                            |
|   Sort Method: quicksort  Memory: 71kB                                                                      |
|   -&gt;  Seq Scan on table1  (cost=0.00..19.00 rows=1000 width=4) (actual time=0.009..0.100 rows=1000 loops=1) |
| Planning Time: 0.029 ms                                                                                     |
| Execution Time: 0.351 ms                                                                                    |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>We can see that it used quicksort method for sorting.</p>
<h3 id="heading-example-7-sorting-and-filtering">Example 7: Sorting and Filtering</h3>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 = <span class="hljs-number">50</span> <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> col1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                      |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------|</span>
| Seq Scan on table1  (cost=0.00..21.50 rows=1 width=4) (actual time=0.014..0.078 rows=1 loops=1) |
|   Filter: (col1 = 50)                                                                           |
|   Rows Removed by Filter: 999                                                                   |
| Planning Time: 0.069 ms                                                                         |
| Execution Time: 0.087 ms                                                                        |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>It realises that there's no need of sorting at all if you're querying for the same column!</p>
<p>However, as soon as we start querying for more than one values for <code>col1</code>, we can see the difference:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">explain</span> analyse <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> (col1 = <span class="hljs-number">50</span> <span class="hljs-keyword">or</span> col1 = <span class="hljs-number">500</span>) <span class="hljs-keyword">order</span> <span class="hljs-keyword">by</span> col1;
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------+</span>
| QUERY PLAN                                                                                            |
|<span class="hljs-comment">-------------------------------------------------------------------------------------------------------|</span>
| Sort  (cost=24.01..24.02 rows=2 width=4) (actual time=0.088..0.088 rows=2 loops=1)                    |
|   Sort Key: col1                                                                                      |
|   Sort Method: quicksort  Memory: 25kB                                                                |
|   -&gt;  Seq Scan on table1  (cost=0.00..24.00 rows=2 width=4) (actual time=0.014..0.084 rows=2 loops=1) |
|         Filter: ((col1 = 50) OR (col1 = 500))                                                         |
|         Rows Removed by Filter: 998                                                                   |
| Planning Time: 0.040 ms                                                                               |
| Execution Time: 0.100 ms                                                                              |
+<span class="hljs-comment">-------------------------------------------------------------------------------------------------------+</span>
</code></pre>
<p>We can also observe that scanning the rows happens first, sorting happens later.</p>
<h2 id="heading-concluding-the-examples">Concluding the examples</h2>
<p>We can go on and on and on and on with the examples, but I'll stop. I hope the diverse set of examples we went through gave some intuition about the kind of factors that contribute to the execution plan of a query.</p>
<p>Would encourage you to try analysing query plans for the following:</p>
<ol>
<li><p>Correlated Subqueries: See if you can validate that I didn't lie in my <a target="_blank" href="https://blog.akjn.dev/correlated-subqueries-and-why-you-need-to-know-them#heading-coming-back-to-our-query">previous blog</a>!</p>
</li>
<li><p>Sorting by multiple columns. Also try combining it with filtering.</p>
</li>
<li><p>Try different type of joins. Try joins between more than 2 tables, see if the data in the tables creates a difference in the order of joins.</p>
</li>
<li><p>Creating index on column1 and querying for (column1, column2). Tinker around with indexes!</p>
</li>
<li><p>Update queries. Delete queries. We didn't even touch those, so try combining them with everything else!</p>
</li>
<li><p>etc etc etc etc. The list is never-ending!</p>
</li>
</ol>
<h1 id="heading-hey-akshat-when-do-i-use-explain-vs-explain-analyse">Hey Akshat, when do I use EXPLAIN vs EXPLAIN ANALYSE?</h1>
<p>As you might have noticed, we completely switched to using <code>EXPLAIN ANALYSE</code> for all the examples. Does that mean <code>EXPLAIN</code> isn't useful? Absolutely not.</p>
<p>There might be times when you just want the query plan estimate, without actually running the query. Examples could be <code>UPDATE</code> or <code>DELETE</code> queries, when you don't want to actually update the data. You can use <code>EXPLAIN</code> in such cases.</p>
<h1 id="heading-one-final-observation">One Final Observation</h1>
<p>In case you didn't notice, an underlying theme of the examples was to demonstrate that the query plan adapts itself depending on the state of the database. This is incredibly useful as an application developer, since you do not have to change the application logic to handle 1000 rows vs 10000000 rows (for example). You can specify your query requirements irrespective of the database state, and rest assured that the database will come up with an optimal query plan.</p>
<h1 id="heading-resource-recommendations">Resource Recommendations</h1>
<p>I want to take a moment to share some awesome resources I stumbled upon during writing this.</p>
<ol>
<li><p><a target="_blank" href="https://postgrespro.com/blog/pgsql/">https://postgrespro.com/blog/pgsql/</a></p>
<ol>
<li><p>Excellent blogs on Postgres with superb examples with crystal clear explanations!</p>
</li>
<li><p>They even have a book with the PDF version available for free: <a target="_blank" href="https://postgrespro.com/blog/pgsql/5970159">https://postgrespro.com/blog/pgsql/5970159</a>. I've already ordered mine, have you?</p>
</li>
</ol>
</li>
<li><p>Great podcast on Postgres Query Planner: <a target="_blank" href="https://www.youtube.com/watch?v=vjRuSjiSpbI">SE-Radio Episode 328: Bruce Momjian on the Postgres Query Planner</a></p>
<ol>
<li>My favorite excerpt from the podcast: Up until 12 tables, Postgres is going to try every possible way of returning the data, but it's going to prune it as it goes.</li>
</ol>
</li>
<li><p>There are tools available to visualise the query plan. I'd encourage you to get comfortable with the tree output syntax, but in case you want to check out the tools:</p>
<ol>
<li><p><a target="_blank" href="http://tatiyants.com/pev/#/plans/new">http://tatiyants.com/pev/#/plans/new</a></p>
</li>
<li><p><a target="_blank" href="https://explain.dalibo.com/">https://explain.dalibo.com/</a></p>
</li>
</ol>
</li>
</ol>
<h1 id="heading-conclusion">Conclusion</h1>
<p><em>That's all folks!</em></p>
<p>Hope this was an informative article, and you got some takeaways from it. Thanks for your time!</p>
<p>PS:</p>
<ol>
<li><p>I'd highly appreciate any feedback on this article - the good, the bad. How was the writing style? Was it easy to understand? Was it too long? Anything else?</p>
</li>
<li><p>I'm aiming to write more blogs now. So, if you liked this one, make sure you follow me on Hashnode to tag along in the journey :)</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Correlated Subqueries and Why You Need To Know Them]]></title><description><![CDATA[Introduction
Let's say you have 2 tables:

table1: Having a column col1

table2: Having a column col2


create table table1 (col1 int);
create table table2 (col2 int);

Let's discuss a query:
select * from table1 where col1 in (select col1 from table...]]></description><link>https://blog.akjn.dev/correlated-subqueries-and-why-you-need-to-know-them</link><guid isPermaLink="true">https://blog.akjn.dev/correlated-subqueries-and-why-you-need-to-know-them</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[SQL]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Akshat Jain]]></dc:creator><pubDate>Tue, 19 Sep 2023 10:26:51 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Let's say you have 2 tables:</p>
<ol>
<li><p><code>table1</code>: Having a column <code>col1</code></p>
</li>
<li><p><code>table2</code>: Having a column <code>col2</code></p>
</li>
</ol>
<pre><code class="lang-sql"><span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> table1 (col1 <span class="hljs-built_in">int</span>);
<span class="hljs-keyword">create</span> <span class="hljs-keyword">table</span> table2 (col2 <span class="hljs-built_in">int</span>);
</code></pre>
<p>Let's discuss a query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2);
</code></pre>
<p>On a first look, it seems like this query should error out, because duh, <code>col1</code> does not exist in <code>table2</code> . Congratulations, that's NOT what will happen.</p>
<p>I'll spoil the answer: This query would return all rows from <code>table1</code> if <code>table2</code> is non-empty.</p>
<p>Try the following 2 cases for yourself.</p>
<p><strong>Case 1:</strong> <code>table2</code> <strong>is non-empty</strong></p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1; <span class="hljs-comment">-- table1 has some rows</span>
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
| 2    |
| 3    |
| 4    |
+<span class="hljs-comment">------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- table2 is non-empty</span>
+<span class="hljs-comment">------+</span>
| col2 |
|<span class="hljs-comment">------|</span>
| 5    |
+<span class="hljs-comment">------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2); <span class="hljs-comment">-- This returns all rows from table1</span>
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
| 2    |
| 3    |
| 4    |
+<span class="hljs-comment">------+</span>
</code></pre>
<p><strong>Case 2:</strong> <code>table2</code> <strong>is empty</strong></p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- Delete all rows from table2</span>
<span class="hljs-keyword">DELETE</span> <span class="hljs-number">1</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2); <span class="hljs-comment">-- This returns nothing since table2 is empty</span>
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
+<span class="hljs-comment">------+</span>
</code></pre>
<p>Now that we've gone over "what" we have to wrap our heads around, we will try to understand why it works the way it works, and then we will try to understand why understanding this is important using an actual incident of my life that happened a couple of weeks back.</p>
<h1 id="heading-the-query-explained">The Query Explained :)</h1>
<p>To understand the query, we need to understand a couple of things first:</p>
<ol>
<li><p>Scope Resolution</p>
</li>
<li><p>Correlated Queries</p>
</li>
</ol>
<h3 id="heading-scope-resolution">Scope Resolution</h3>
<p>How does the query engine know which column are we talking about? We could have a bunch of tables in a complex query, with common column names. In such cases, it tries to resolve the column names (as in, it finds out which table the column name is referring to) based on some rules.</p>
<p>To keep it simple, let's discuss our example query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2);
</code></pre>
<p>In a subquery, the column names are resolved by looking at the innermost scope and then moving out. The first scope where the column is resolved is used.</p>
<p>There can be 3 cases:</p>
<ul>
<li><p><strong>Case 1:</strong> When <code>table2</code> has a column named <code>col1</code></p>
<ul>
<li><p><code>col1</code> would be resolved as <code>table2.col1</code></p>
</li>
<li><p>The query would be run as <code>select * from table1 where col1 in (select table2.col1 from table2);</code></p>
</li>
</ul>
</li>
<li><p><strong>Case 2:</strong> When <code>table2</code> does not have a column named <code>col1</code>, but <code>table1</code> has (This is our case!!!!)</p>
<ul>
<li><p><code>col1</code> would be resolved as <code>table1.col1</code></p>
</li>
<li><p>The query would be run as <code>select * from table1 where col1 in (select table1.col1 from table2);</code></p>
</li>
</ul>
</li>
<li><p><strong>Case 3:</strong> When neither <code>table2</code> nor <code>table1</code> have a column named <code>col1</code></p>
<ul>
<li>It will error out saying <code>column "col1" does not exist</code></li>
</ul>
</li>
</ul>
<h3 id="heading-correlated-queries">Correlated Queries</h3>
<p><strong>What's a subquery?</strong></p>
<p>A subquery (or nested query) is a query nested inside another query.</p>
<p>Example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col2 <span class="hljs-keyword">from</span> table2);
</code></pre>
<p>In the above example, the subquery <code>select col2 from table2</code> is executed first, and the output of the subquery is used by the outer query. Hence, steps of execution:</p>
<ol>
<li><p>Subquery is executed (only once)</p>
</li>
<li><p>Outer query is executed using output of step 1 (only once)</p>
</li>
</ol>
<p><strong>What's a Correlated Subquery?</strong></p>
<p>A correlated subquery is a subquery that contains a reference to a table from the outer query. The outer query uses the result of the inner query.</p>
<p>Example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2);
</code></pre>
<p>This is the query we have been looking from the very start!</p>
<p>Why is this a correlated subquery? It's because the subquery (<code>select col1 from table2</code>) is referencing a column (<code>col1</code>) from a table of the outer query (<code>table1</code>) --- thanks to scope resolution that we learned in the previous section. Note that this is a correlated subquery only because <code>col1</code> does not exist in <code>table2</code>, otherwise scope resolution would've resolved <code>select col1 from table2</code> as <code>select table2.col1 from table2</code>.</p>
<p>Now let's discuss how correlated subqueries are executed, in general.</p>
<p>For correlated subqueries, outer query executes first, and for every outer query row returned in the execution, the inner query is executed using the value of the outer query row.</p>
<p>Let's try to understand the above confusing statement with our query.</p>
<h3 id="heading-coming-back-to-our-query">Coming back to our query</h3>
<p>Our query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2);
</code></pre>
<p>I've changed the data in the two tables to make the example easier to understand, and to avoid any confusion. Current state:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
| 4    |
| 42   |
+<span class="hljs-comment">------+</span>
postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table2;
+<span class="hljs-comment">------+</span>
| col2 |
|<span class="hljs-comment">------|</span>
| 777  |
+<span class="hljs-comment">------+</span>
</code></pre>
<p><strong>Step 1:</strong> Execute outer query.</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1;
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
| 4    |
| 42   |
+<span class="hljs-comment">------+</span>
</code></pre>
<p>Note that the above is just an intermittent state that I've highlighted to demonstrate how it works.</p>
<p><strong>Step 2:</strong> For each row returned by the outer query, run the inner query using the value of the outer query row.</p>
<p>Our inner query was <code>select col1 from table2</code></p>
<p>Which means, step 2, in layman terms, can be considered as:</p>
<ol>
<li><p><code>select 1 from table2</code></p>
</li>
<li><p><code>select 4 from table2</code></p>
</li>
<li><p><code>select 42 from table2</code></p>
</li>
</ol>
<p>Now if we try to consider the entire query, it becomes:</p>
<ol>
<li><p><code>select * from table1 where col1 in (select 1 from table2);</code></p>
</li>
<li><p><code>select * from table1 where col1 in (select 4 from table2);</code></p>
</li>
<li><p><code>select * from table1 where col1 in (select 42 from table2);</code></p>
</li>
</ol>
<p>Which essentially becomes:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> <span class="hljs-number">1</span> <span class="hljs-keyword">from</span> table2);
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
+<span class="hljs-comment">------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> <span class="hljs-number">4</span> <span class="hljs-keyword">from</span> table2);
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 4    |
+<span class="hljs-comment">------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> <span class="hljs-number">42</span> <span class="hljs-keyword">from</span> table2);
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 42   |
+<span class="hljs-comment">------+</span>
</code></pre>
<p>And then the combined result set is returned, which is:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> col1 <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> col1 <span class="hljs-keyword">from</span> table2);
+<span class="hljs-comment">------+</span>
| col1 |
|<span class="hljs-comment">------|</span>
| 1    |
| 4    |
| 42   |
+<span class="hljs-comment">------+</span>
</code></pre>
<p>Remember when we said that this is only the case when <code>table2</code> is NOT empty? Can you guess why? Hint:</p>
<pre><code class="lang-sql">postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> * <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- table2 is non-empty</span>
+<span class="hljs-comment">------+</span>
| col2 |
|<span class="hljs-comment">------|</span>
| 777  |
+<span class="hljs-comment">------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> <span class="hljs-number">1</span> <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- returns 1</span>
+<span class="hljs-comment">----------+</span>
| ?column? |
|<span class="hljs-comment">----------|</span>
| 1        |
+<span class="hljs-comment">----------+</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- Delete all rows from table2</span>
<span class="hljs-keyword">DELETE</span> <span class="hljs-number">1</span>

postgres@localhost:akjn&gt; <span class="hljs-keyword">select</span> <span class="hljs-number">1</span> <span class="hljs-keyword">from</span> table2; <span class="hljs-comment">-- returns nothing because table2 is empty :D</span>
+<span class="hljs-comment">----------+</span>
| ?column? |
|<span class="hljs-comment">----------|</span>
+<span class="hljs-comment">----------+</span>
</code></pre>
<h1 id="heading-why-do-i-need-to-know-all-this">Why do I need to know all this?</h1>
<p>TLDR: Whatever we discussed also applies to DELETE queries :)</p>
<p>Instead of selecting all rows of outer table, you could delete them (all of them!), like I did 2 weeks back :)</p>
<h3 id="heading-story-of-my-life">Story of my life</h3>
<p>A couple of weeks back, I ran a DELETE query in our staging environment, which was something like:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">delete</span> <span class="hljs-keyword">from</span> table1 <span class="hljs-keyword">where</span> table1_column <span class="hljs-keyword">in</span> (<span class="hljs-keyword">select</span> table1_column <span class="hljs-keyword">from</span> table2 <span class="hljs-keyword">where</span> table2_column=<span class="hljs-string">'something'</span>);
</code></pre>
<p>Result: All rows from <code>table1</code> were deleted.</p>
<h3 id="heading-hey-akshat-why-did-you-run-such-a-query-without-hesitation">Hey Akshat, why did you run such a query without hesitation?</h3>
<ul>
<li><p>I had faith in my PostgreSQL skills.</p>
</li>
<li><p>I had faith in the <code>where</code> clause. <em>What's the worst that can happen? The rows not matching the where clause can't be affected, right? Right?</em> But unfortunately that was not the case and I learned it the hard way.</p>
</li>
</ul>
<p>Fortunately this was just a table in our staging environment, so it didn't escalate much. But this could've been worse. A LOT WORSE.</p>
<blockquote>
<p><strong>Murphy's law: Anything that can go wrong will go wrong.</strong></p>
</blockquote>
<h1 id="heading-learnings">Learnings</h1>
<ol>
<li><p>Always use transactions when running such queries, so that you have the option to rollback any changes made.</p>
</li>
<li><p>This also taught me why aliases are important to be explicit about things, instead of letting it resolve the column names itself. Aliases are assigned during query execution. Ever faced query execution errors using ORMs in your application? Notice them the next time, they ALWAYS have aliases.</p>
</li>
<li><p>While ORMs take care of aliases if properly defined, if any of your applications are directly raw querying the DB, please use aliases. Generally such layers don't have any info on the DB schema, and blindly run the raw queries. Imagine a scenario where the table columns have changed in the future 🤷‍♂️</p>
</li>
</ol>
<h1 id="heading-conclusion">Conclusion</h1>
<p><em>That's all folks!</em></p>
<p>Hope this was an informative article, and you got some takeaways from it. Thanks for your time!</p>
<p>PS:</p>
<ol>
<li><p>I'd highly appreciate any feedback on this article - the good, the bad. How was the writing style? Was it easy to understand? Was it too long? Anything else?</p>
</li>
<li><p>I'll be trying to write more blogs now. So, if you liked this one, make sure you follow me on Hashnode to tag along in the journey :)</p>
</li>
</ol>
]]></content:encoded></item></channel></rss>