Postgres at Scale
Clustered Postgres for production workloads

A primary-replica PostgreSQL 18 cluster with streaming replication, read-routed pooling, and partition-aware storage. Built for workloads that have outgrown a single node — millions of rows per table, thousands of concurrent connections, and read traffic that should never wait on a write.

Architecture

Every cluster is provisioned as one writable primary plus N hot-standby replicas, kept in sync via native PostgreSQL streaming replication (wal_level=replica, synchronous_commit=on for the first standby, asynchronous beyond that). A pooler in front of the cluster terminates client connections, multiplexes them onto a small pool of backend sessions, and routes traffic by intent: writes and SELECT … FOR UPDATE go to the primary, read-only transactions are pinned to the nearest replica.

# Primary endpoint — accepts reads + writes
DATABASE_URL=postgres://u_9f2x:pw@primary.cluster.nyas.io:5432/app?sslmode=require

# Replica endpoint — load-balanced across hot standbys
DATABASE_REPLICA_URL=postgres://u_9f2x:pw@replica.cluster.nyas.io:5432/app?sslmode=require&target_session_attrs=read-only

# Pooler endpoint — transaction-mode pooling, ~10k client conns -> ~200 backends
DATABASE_POOL_URL=postgres://u_9f2x:pw@pool.cluster.nyas.io:6432/app?sslmode=require
// Node.js — split reads from writes at the driver level
import postgres from 'postgres'

const writer = postgres(process.env.DATABASE_URL,    { max: 20 })
const reader = postgres(process.env.DATABASE_REPLICA_URL, { max: 80 })

// hot path: feed listings off replicas, keep the primary for mutations
export async function listOrders(userId) {
  return reader`SELECT id, total, status FROM orders WHERE user_id = ${userId}`
}

export async function placeOrder(o) {
  return writer`INSERT INTO orders ${writer(o)} RETURNING id`
}
psql "postgres://u_9f2x:pw@primary.cluster.nyas.io:5432/app?sslmode=require"

-- Inspect replication lag from the primary
SELECT application_name,
       client_addr,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Scaling primitives

  • Streaming replication with physical WAL shipping. Synchronous on the first standby for durability, asynchronous fan-out to the rest for read capacity. Sub-second replay lag under steady-state load.
  • Read replicas exposed behind a single load-balanced hostname. Add or remove replicas without changing application config.
  • Transaction-mode connection pooling in front of every node. Compresses 10,000+ client sockets onto a few hundred backend processes — the difference between too many connections and a healthy cluster at peak.
  • Declarative partitioning (PARTITION BY RANGE / LIST / HASH) with automated partition rotation. Time-series tables stay query-fast as they cross the billion-row mark.
  • Logical replication slots (pgoutput, wal2json) for CDC into Kafka, ClickHouse, or downstream services without bolted-on triggers.
  • Point-in-time recovery with continuous WAL archival to object storage. Restore to any second within the retention window.
  • Automated failover driven by a consensus-based controller. The pooler reroutes writes to the new primary on the order of seconds; in-flight read transactions on healthy standbys are unaffected.

Partitioning a large table

For tables that grow without bound — events, orders, audit logs — declarative partitioning lets the planner prune to a single child table per query. Combined with BRIN indexes on the partition key, scans stay bounded as the dataset grows past what fits in shared buffers.

-- Range-partitioned event table, one partition per month
CREATE TABLE events (
    id          bigserial,
    occurred_at timestamptz NOT NULL,
    user_id     bigint      NOT NULL,
    payload     jsonb       NOT NULL,
    PRIMARY KEY (id, occurred_at)
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_05 PARTITION OF events
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

CREATE TABLE events_2026_06 PARTITION OF events
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- The planner only touches partitions whose range overlaps the predicate.
EXPLAIN SELECT count(*) FROM events
 WHERE occurred_at >= '2026-05-15' AND occurred_at < '2026-05-20';
-- BRIN is ~1000x smaller than B-tree on monotonic columns
-- and is the right index for partition keys on append-only tables.
CREATE INDEX events_occurred_at_brin
    ON events USING brin (occurred_at)
    WITH (pages_per_range = 64);

-- Routine maintenance keeps the BRIN summary tight
SELECT brin_summarize_new_values('events_occurred_at_brin');

When to choose this

  • A single-node Postgres is hitting CPU or IOPS ceilings during read traffic spikes.
  • You have one or more tables past 100M rows and want predictable query latency without rewriting onto a different engine.
  • Your application can tolerate eventually consistent reads on a documented subset of endpoints, but writes must remain strongly consistent.
  • You need an SLA-grade RTO/RPO with automated failover, but want to keep the operational model of plain PostgreSQL — same SQL, same tools, same wire protocol.