Incremental Load Pattern: Stage + Base Table Design

A practical design pattern using Stage and Base tables with surrogate keys and checksum tracking for clean, auditable incremental loads

Is This How Everyone Does It?

Short answer: no. This is a design pattern — one approach that works well for a specific class of data engineering problems. Teams at companies like Meta or Amazon are dealing with petabyte-scale distributed systems and often build entirely different architectures optimized for their constraints. The right pattern depends on your data volume, team skill set, tooling, and SLA requirements.

What's described here is a clean, practical pattern for teams running SQL-based ETL pipelines against a traditional or cloud data warehouse.

The Two-Table Pattern: Stage + Base

The design rests on a strict separation of responsibilities between two layers:

Layer	Table	Responsibility
Stage	`STG_CUSTOMERS`	All ETL transformation logic runs here
Base	`BASE_CUSTOMERS`	Clean, versioned history — what users query

The rule is simple: every piece of ETL logic belongs in the load into STG. Field mapping, type casting, deduplication, business rules, derived columns — all of it happens before you touch Base. STG is where you do the work.

Once STG is loaded and validated, the move from STG → Base is a one-to-one copy using the incremental merge logic described below. This is sometimes called a kill-and-fill on the current records — you're not rebuilding from scratch, you're applying a clean diff.

Source System
     │
     ▼  (all ETL logic lives here)
┌──────────┐
│   STG    │
│ (staging)│
└──────────┘
     │
     │  one-to-one incremental copy
     ▼
┌──────────┐
│   BASE   │  ◀── Users query here
│  (base)  │
└──────────┘

This separation keeps Base predictable. You always know exactly what's in it because the only way data gets there is through a controlled, well-defined merge from STG.

The Two Tracking Columns

This pattern implements SCD Type 2 (Slowly Changing Dimension Type 2) — a standard technique for maintaining full history as attributes change over time. It runs on two columns:

1. `SURROGATE_KEY`

A hash of the business/natural key — the attribute that uniquely identifies who this person or entity is. For customer data, that's typically something like SSN.

SURROGATE_KEY = MD5(ssn)

This never changes. The same person always hashes to the same surrogate key regardless of how many times their address or phone number changes. It's purely an identity question: is this person already in Base?

2. `CHKSUMTXT`

A hash of the attributes you care about tracking changes for — the things that describe the entity. For a customer, that might be:

First name
Last name
Address
Phone number

CHKSUMTXT = MD5(CONCAT(first_name, last_name, address, phone))

If any one of those attributes changes, CHKSUMTXT changes, and that triggers an SCD2 update: the old row gets expired and a new row is inserted capturing the new state. You decide which columns go into CHKSUMTXT based on what your business actually needs to track historically.

-- Building both columns in STG
SELECT
    MD5(ssn)                                                         AS surrogate_key,
    MD5(CONCAT(first_name, last_name, address, phone_number))        AS chksumtxt,
    ssn,
    first_name,
    last_name,
    address,
    phone_number,
    CURRENT_TIMESTAMP AS load_ts
FROM raw_customers;

The Three Scenarios

Take a concrete customer as the example throughout: Jane Doe, SSN 111-22-3333.

Scenario 1 — New Record (Surrogate Key Not In Base)

Jane is a brand new customer — her surrogate key (MD5(ssn)) doesn't exist anywhere in Base. Insert her as an active record.

STG incoming:

surrogate_key	chksumtxt	ssn	first_name	last_name	address	phone
`sk_jane`	`cks_A`	111-22-3333	Jane	Doe	123 Main St	555-0100

BASE before:

(no rows)

BASE after:

surrogate_key	chksumtxt	first_name	last_name	address	phone	is_current	effective_ts	expiry_ts
`sk_jane`	`cks_A`	Jane	Doe	123 Main St	555-0100	`Y`	2026-05-01	`NULL`

Scenario 2 — Changed Record (Surrogate Key Exists, CHKSUMTXT Different)

Jane moved to a new address. Her SSN hasn't changed (same person, same surrogate_key), but her address changed, so CHKSUMTXT is now different.

SCD2 kicks in:

Expire the existing row — set is_current = 'N' and stamp expiry_ts
Insert a new row with the updated attributes and is_current = 'Y'

STG incoming:

surrogate_key	chksumtxt	first_name	last_name	address	phone
`sk_jane`	`cks_B`	Jane	Doe	456 Oak Ave	555-0100

BASE before:

surrogate_key	chksumtxt	first_name	last_name	address	phone	is_current	effective_ts	expiry_ts
`sk_jane`	`cks_A`	Jane	Doe	123 Main St	555-0100	`Y`	2026-05-01	`NULL`

BASE after:

surrogate_key	chksumtxt	first_name	last_name	address	phone	is_current	effective_ts	expiry_ts
`sk_jane`	`cks_A`	Jane	Doe	123 Main St	555-0100	`N`	2026-05-01	2026-05-07
`sk_jane`	`cks_B`	Jane	Doe	456 Oak Ave	555-0100	`Y`	2026-05-07	`NULL`

You now have a full history of every address Jane ever had, without deleting a single row.

The same logic applies if her first name, last name, or phone number changes — any attribute inside CHKSUMTXT triggers this.

Scenario 3 — Unchanged Record (Surrogate Key Exists, CHKSUMTXT Same)

Jane's record comes through again in the next load, but nothing about her has changed. CHKSUMTXT is identical. Skip it entirely. No update, no insert.

STG incoming:

surrogate_key	chksumtxt	first_name	last_name	address	phone
`sk_jane`	`cks_B`	Jane	Doe	456 Oak Ave	555-0100

BASE (current row):

surrogate_key	chksumtxt	first_name	last_name	address	phone	is_current
`sk_jane`	`cks_B`	Jane	Doe	456 Oak Ave	555-0100	`Y`

→ No action taken. This is where incremental loading earns its name — if a million customers load unchanged, you process zero rows against Base.

The Merge Logic in SQL

MERGE INTO base_customers AS base
USING stg_customers AS stg
ON base.surrogate_key = stg.surrogate_key
   AND base.is_current = 'Y'

-- Scenario 2: record changed → expire the old row
WHEN MATCHED AND base.chksumtxt <> stg.chksumtxt THEN
    UPDATE SET
        base.is_current = 'N',
        base.expiry_ts  = CURRENT_TIMESTAMP

-- Scenario 3: record unchanged → do nothing
WHEN MATCHED AND base.chksumtxt = stg.chksumtxt THEN
    -- no-op, intentionally omitted

-- Scenario 1: new record → insert
WHEN NOT MATCHED THEN
    INSERT (surrogate_key, chksumtxt, customer_id, address, email,
            phone, status, is_current, effective_ts, expiry_ts)
    VALUES (stg.surrogate_key, stg.chksumtxt, stg.customer_id,
            stg.address, stg.email, stg.phone, stg.status,
            'Y', CURRENT_TIMESTAMP, NULL);

-- Scenario 2 (continued): after expiring old row, insert the updated record
INSERT INTO base_customers
    (surrogate_key, chksumtxt, customer_id, address, email,
     phone, status, is_current, effective_ts, expiry_ts)
SELECT
    stg.surrogate_key, stg.chksumtxt, stg.customer_id,
    stg.address, stg.email, stg.phone, stg.status,
    'Y', CURRENT_TIMESTAMP, NULL
FROM stg_customers stg
JOIN base_customers base
    ON stg.surrogate_key = base.surrogate_key
WHERE base.chksumtxt <> stg.chksumtxt;

Why We Never Delete

Deletes are permanent and often irreversible within a pipeline run. Hard deletes in Base would mean:

No audit trail of what the data looked like previously
Broken downstream reports that joined on those rows
No way to distinguish "was deleted from source" vs "pipeline bug"

Instead, a soft delete pattern (setting is_current = 'N') handles source-side deletions gracefully — you mark the row expired and let consumers decide whether to include inactive records.

The Full Picture

Every pipeline run:

STG populated ─────────────────────────┐
                                        ▼
                         ┌─────────────────────────┐
                         │    Compare STG vs BASE   │
                         └─────────────────────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
        New record?         CHKSUMTXT           CHKSUMTXT
        (key missing        changed?            same?
         in BASE)           (key exists)        (key exists)
              │                   │                   │
         INSERT new          Expire old +         No-op,
         row with            INSERT new row       skip it
         is_current='Y'      with is_current='Y'

Result: BASE always has valid data. Pipeline failures only affect how fresh it is, never whether it exists.

Summary

Condition	Action
Key not in Base	Insert new row (`is_current = 'Y'`)
Key exists, CHKSUMTXT changed	Expire old row + insert updated row
Key exists, CHKSUMTXT same	Skip — no action
Source-side delete	Expire old row (`is_current = 'N'`) — never hard delete

This pattern gives you resilience (Base survives Stage failures), auditability (full row history), and efficiency (only process what changed). It scales from a few thousand rows to hundreds of millions without changing the logic — only the compute underneath it.

Incremental Load Pattern: Stage + Base Table Design

Is This How Everyone Does It?

The Two-Table Pattern: Stage + Base

The Two Tracking Columns

1. SURROGATE_KEY

2. CHKSUMTXT

The Three Scenarios

Scenario 1 — New Record (Surrogate Key Not In Base)

Scenario 2 — Changed Record (Surrogate Key Exists, CHKSUMTXT Different)

Scenario 3 — Unchanged Record (Surrogate Key Exists, CHKSUMTXT Same)

The Merge Logic in SQL

Why We Never Delete

The Full Picture

Summary

1. `SURROGATE_KEY`

2. `CHKSUMTXT`