Optimizing PostgreSQL Queries with Empty IN Clauses

Optimizing PostgreSQL Queries with Empty IN Clauses

As a technical blogger, I’ve encountered numerous queries where an empty IN clause leads to slow performance. In this article, we’ll delve into the world of PostgreSQL indexing and explore ways to optimize these queries.

Understanding the Problem

The question at hand involves a PostgreSQL query that uses an IN clause with an empty subquery. The query is designed to retrieve 15 rows from the users-results.ai_algo table where the _at_timestamp column matches the same timestamp used in the filtered_users_pre subquery.

WITH filtered_users_pre AS (
  SELECT value as username, row_number() OVER (PARTITION BY value) AS rk
    FROM "user-stats".tag_table
    WHERE _at_timestamp = 1626955200 AND tag IN ('commercial','marketing')
  ),
  filtered_users AS (
    SELECT username
    FROM filtered_users_pre
    WHERE rk = 2
  ),
  valid_users AS (
    SELECT aa.username, aa.rank, aa.points, aa.version
    FROM "users-results".ai_algo aa
    WHERE aa._at_timestamp = 1626955200 AND aa.rank_timeframe = '7d' AND aa.username IN (SELECT * FROM filtered_users)
    ORDER BY aa.rank ASC
    LIMIT 15 OFFSET 0
  )
SELECT * FROM valid_users;

The query takes approximately 4 seconds to run when the filtered_users subquery returns no rows. However, once the subquery has at least one row, the execution time drops to around 400 milliseconds.

Analyzing the EXPLAIN Output

To gain a deeper understanding of the query’s behavior, let’s analyze the EXPLAIN (ANALYZE, BUFFERS) output for both scenarios:

Empty filtered_users Subquery

Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
  Buffers: shared hit=7456641
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Buffers: shared hit=7456641
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa  (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 7793096
              Buffers: shared hit=7456533
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
              Buffers: shared hit=108
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
                    Filter: (filtered_users_pre.rk = 2)
                    Rows Removed by Filter: 2415
                    Buffers: shared hit=108
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
                          Buffers: shared hit=108
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=108
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms

Non-Empty filtered_users Subquery

Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
  Buffers: shared hit=11042
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Rows Removed by Join Filter: 1544611
        Buffers: shared hit=11042
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 7793096
              Buffers: shared hit=7456533
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
              Buffers: shared hit=108
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.046..2.024 rows=13 loops=1)
                    Filter: (filtered_users_pre.rk = 2)
                    Rows Removed by Filter: 2415
                    Buffers: shared hit=105
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=0.604..1.044 rows=13 loops=1)
                          Buffers: shared hit=105
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=0.287..0.287 rows=13 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=105
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.106..0.138 rows=13 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.046..0.056 rows=13 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms

Optimizing the Query

The main issue here is that PostgreSQL has to scan all rows in the users-results.ai_algo table when there are no matching rows in the filtered_users subquery.

To optimize this query, we can create an index on the columns used in the WHERE clause. In this case, creating an index on (rank_timeframe, _at_timestamp) would allow PostgreSQL to quickly find the desired rows without having to scan all rows in the table.

Here’s how you can modify your query:

CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);

Additionally, you can drop the old index ai_algo_rank_timeframe_rank_idx, as it is no longer needed:

DROP INDEX ON "users-results".ai_algo rank_timeframe_rank_idx;

With these changes in place, the query should execute much faster when there are no matching rows in the filtered_users subquery.

Conclusion

In this article, we explored a common problem with PostgreSQL queries where an empty IN clause leads to slow performance. We analyzed the EXPLAIN output and discovered that PostgreSQL has to scan all rows in the table without finding any matches. To optimize such queries, it is essential to create effective indexes on columns used in the WHERE clause.

By creating an index on (rank_timeframe, _at_timestamp), we can speed up the query and improve performance.

Stay tuned for more PostgreSQL tips and tricks!


Last modified on 2023-11-16