EdgeLimits
EdgeLimits
Share:

From code lines to orchestrated background task workflows at scale

Lessons from building and maintaining enterprise-scale systems with Django, PostgreSQL, Celery, and RabbitMQ — the hard problems nobody warns you about.
From code lines to orchestrated background task workflows at scale

If you can count rows of a table like this SELECT COUNT(*) FROM TABLE, then you probably haven’t worked with large datasets yet.

This article is about my findings after working on a project for almost two years, which used Django, PostgreSQL, Celery, and RabbitMQ to tackle enterprise-level challenges.

In this article:

  • Why COUNT(*) will betray you on large PostgreSQL tables — and what to use instead
  • How to use EXPLAIN to spot query bottlenecks before they hit production
  • The indexing mistakes that quietly kill performance (and how to fix them)

If you’re a backend engineer who has outgrown toy datasets and is starting to feel the pain, this is for you.


How to count rows when you cannot count?

I recall the first time I got a request for a change on my PR, where I used LargeDjangoTable.objects.count(). I thought the reviewer was messing with me and making things up, as the .count() had never failed me … before.

As it turned out, when using SELECT COUNT(*) in PostgreSQL, the planner may choose a sequential scan to read through every data page of the table until it reaches the end. Similar to counting sheep before you fall asleep :). For tables with millions of rows, the query execution may time out sooner before you get a response from the database. So your beautiful code will “never” reach the end of the line.

NB! It is beneficial to familiarize yourself with EXPLAIN or EXPLAIN ANALYSE <your SQL query> to see the query execution plan, so you can identify potential bottlenecks in your queries. Finding something like Seq Scan on tablename (cost=0.00..9,876,540.00 rows=120,000,000 width=128) is definitely a red flag.

So, how to count rows when you cannot count (in PostgreSQL)?

  1. Increase the DB tier (cores, RAM, etc.) and await the invitation from Jeff B. to a party on his new yacht, which you have bought him.
  2. If the count is used for data aggregation, like business metrics, or if it doesn’t require immediate response, you can offload the aggregation to a materialized view, or an ETL like Airflow with PySpark.
  3. If the count is required to chunkify through the table, use the “seek method”: SELECT * FROM my_table WHERE id > :last_seen_id ORDER BY id LIMIT 100;
  4. If you only need an estimate row count:
SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
             WHEN c.relpages = 0 THEN float8 '0'  -- empty table
             ELSE c.reltuples / c.relpages END
     * (pg_catalog.pg_relation_size(c.oid)
      / pg_catalog.current_setting('block_size')::int)
       )::bigint AS estimate_count
FROM   pg_catalog.pg_class c
WHERE  c.oid = 'myschema.mytable'::regclass;
  1. If you need a precise count with an “immediate” response, then rethink if you really, really need it, or consider using big boy tools like BigQuery.

NB! The estimate query assumes VACUUM is keeping up. On tables with heavy updates or deletes, dead rows inflate the page count and the estimate can drift — run ANALYZE on the table if you need a fresher number.

Who are the unsung heroes of database performance?

Indecies, or indexes in modern English! Bear in mind that:

  • Group indices together when the lookup is using multiple conditions. For example, Apples.objects.filter(color="green", size="medium") on the model Meta, you should (if you want :) ) add the following: indexes = [ models.Index(fields=["color", "size"], name="color_size_idx")] One gotcha: the order of fields in the index matters. ["color", "size"] will be used by Postgres when you filter by color, or color + size — but not when filtering by size alone. Put the field you filter on most often first.
  • Index scan may not be used on smaller tables. By default, the Postgres query planner determines when it is more performant to use or not use an index scan. Ofc. You can enforce it, but in my experience, it wasn’t necessary.
  • Case sensitivity. By default, Postgres string comparisons and indices are case-sensitive (standard btree index). Email.Add@example.com will not match email.add@example.com. We have fallen into this trap before, where it didn’t use an index scan, and the simplest solution was CREATE INDEX email_lower_idx ON users (LOWER(email)); SELECT * FROM users WHERE LOWER(email)='email.add@example.com';

What I Took Away

  • COUNT(*) on a large table can trigger a full sequential scan — always check with EXPLAIN first
  • When you need to iterate over millions of rows, the seek method (WHERE id > :last_seen_id) is your friend
  • For estimates, PostgreSQL’s own catalog (pg_class) is fast and good enough for most use cases
  • Composite indexes, index scans on small tables, and case sensitivity are all easy to get wrong and hard to notice until it’s too late