Incremental Load Pattern: Stage + Base Table Design

May 7, 2026

A practical design pattern using Stage and Base tables with surrogate keys and checksum tracking for clean, auditable incremental loads

Is This How Everyone Does It?

Short answer: no. This is a design pattern — one approach that works well for a specific class of data engineering problems. Teams at companies like Meta or Amazon are dealing with petabyte-scale distributed systems and often build entirely different architectures optimized for their constraints. The right pattern depends on your data volume, team skill set, tooling, and SLA requirements.

What's described here is a clean, practical pattern for teams running SQL-based ETL pipelines against a traditional or cloud data warehouse.


The Two-Table Pattern: Stage + Base

The design rests on a strict separation of responsibilities between two layers:

LayerTableResponsibility
StageSTG_CUSTOMERSAll ETL transformation logic runs here
BaseBASE_CUSTOMERSClean, versioned history — what users query

The rule is simple: every piece of ETL logic belongs in the load into STG. Field mapping, type casting, deduplication, business rules, derived columns — all of it happens before you touch Base. STG is where you do the work.

Once STG is loaded and validated, the move from STG → Base is a one-to-one copy using the incremental merge logic described below. This is sometimes called a kill-and-fill on the current records — you're not rebuilding from scratch, you're applying a clean diff.

Source System
     
       (all ETL logic lives here)
┌──────────┐
   STG    
 (staging)
└──────────┘
     
       one-to-one incremental copy
     
┌──────────┐
   BASE     ◀── Users query here
  (base)  
└──────────┘

This separation keeps Base predictable. You always know exactly what's in it because the only way data gets there is through a controlled, well-defined merge from STG.


The Two Tracking Columns

This pattern implements SCD Type 2 (Slowly Changing Dimension Type 2) — a standard technique for maintaining full history as attributes change over time. It runs on two columns:

1. SURROGATE_KEY

A hash of the business/natural key — the attribute that uniquely identifies who this person or entity is. For customer data, that's typically something like SSN.

SURROGATE_KEY = MD5(ssn)

This never changes. The same person always hashes to the same surrogate key regardless of how many times their address or phone number changes. It's purely an identity question: is this person already in Base?

2. CHKSUMTXT

A hash of the attributes you care about tracking changes for — the things that describe the entity. For a customer, that might be:

  • First name
  • Last name
  • Address
  • Phone number
CHKSUMTXT = MD5(CONCAT(first_name, last_name, address, phone))

If any one of those attributes changes, CHKSUMTXT changes, and that triggers an SCD2 update: the old row gets expired and a new row is inserted capturing the new state. You decide which columns go into CHKSUMTXT based on what your business actually needs to track historically.

-- Building both columns in STG
SELECT
    MD5(ssn)                                                         AS surrogate_key,
    MD5(CONCAT(first_name, last_name, address, phone_number))        AS chksumtxt,
    ssn,
    first_name,
    last_name,
    address,
    phone_number,
    CURRENT_TIMESTAMP AS load_ts
FROM raw_customers;

The Three Scenarios

Take a concrete customer as the example throughout: Jane Doe, SSN 111-22-3333.

Scenario 1 — New Record (Surrogate Key Not In Base)

Jane is a brand new customer — her surrogate key (MD5(ssn)) doesn't exist anywhere in Base. Insert her as an active record.

STG incoming:

surrogate_keychksumtxtssnfirst_namelast_nameaddressphone
sk_janecks_A111-22-3333JaneDoe123 Main St555-0100

BASE before:

(no rows)

BASE after:

surrogate_keychksumtxtfirst_namelast_nameaddressphoneis_currenteffective_tsexpiry_ts
sk_janecks_AJaneDoe123 Main St555-0100Y2026-05-01NULL

Scenario 2 — Changed Record (Surrogate Key Exists, CHKSUMTXT Different)

Jane moved to a new address. Her SSN hasn't changed (same person, same surrogate_key), but her address changed, so CHKSUMTXT is now different.

SCD2 kicks in:

  1. Expire the existing row — set is_current = 'N' and stamp expiry_ts
  2. Insert a new row with the updated attributes and is_current = 'Y'

STG incoming:

surrogate_keychksumtxtfirst_namelast_nameaddressphone
sk_janecks_BJaneDoe456 Oak Ave555-0100

BASE before:

surrogate_keychksumtxtfirst_namelast_nameaddressphoneis_currenteffective_tsexpiry_ts
sk_janecks_AJaneDoe123 Main St555-0100Y2026-05-01NULL

BASE after:

surrogate_keychksumtxtfirst_namelast_nameaddressphoneis_currenteffective_tsexpiry_ts
sk_janecks_AJaneDoe123 Main St555-0100N2026-05-012026-05-07
sk_janecks_BJaneDoe456 Oak Ave555-0100Y2026-05-07NULL

You now have a full history of every address Jane ever had, without deleting a single row.

The same logic applies if her first name, last name, or phone number changes — any attribute inside CHKSUMTXT triggers this.


Scenario 3 — Unchanged Record (Surrogate Key Exists, CHKSUMTXT Same)

Jane's record comes through again in the next load, but nothing about her has changed. CHKSUMTXT is identical. Skip it entirely. No update, no insert.

STG incoming:

surrogate_keychksumtxtfirst_namelast_nameaddressphone
sk_janecks_BJaneDoe456 Oak Ave555-0100

BASE (current row):

surrogate_keychksumtxtfirst_namelast_nameaddressphoneis_current
sk_janecks_BJaneDoe456 Oak Ave555-0100Y

→ No action taken. This is where incremental loading earns its name — if a million customers load unchanged, you process zero rows against Base.


The Merge Logic in SQL

MERGE INTO base_customers AS base
USING stg_customers AS stg
ON base.surrogate_key = stg.surrogate_key
   AND base.is_current = 'Y'

-- Scenario 2: record changed  expire the old row
WHEN MATCHED AND base.chksumtxt <> stg.chksumtxt THEN
    UPDATE SET
        base.is_current = 'N',
        base.expiry_ts  = CURRENT_TIMESTAMP

-- Scenario 3: record unchanged  do nothing
WHEN MATCHED AND base.chksumtxt = stg.chksumtxt THEN
    -- no-op, intentionally omitted

-- Scenario 1: new record  insert
WHEN NOT MATCHED THEN
    INSERT (surrogate_key, chksumtxt, customer_id, address, email,
            phone, status, is_current, effective_ts, expiry_ts)
    VALUES (stg.surrogate_key, stg.chksumtxt, stg.customer_id,
            stg.address, stg.email, stg.phone, stg.status,
            'Y', CURRENT_TIMESTAMP, NULL);

-- Scenario 2 (continued): after expiring old row, insert the updated record
INSERT INTO base_customers
    (surrogate_key, chksumtxt, customer_id, address, email,
     phone, status, is_current, effective_ts, expiry_ts)
SELECT
    stg.surrogate_key, stg.chksumtxt, stg.customer_id,
    stg.address, stg.email, stg.phone, stg.status,
    'Y', CURRENT_TIMESTAMP, NULL
FROM stg_customers stg
JOIN base_customers base
    ON stg.surrogate_key = base.surrogate_key
WHERE base.chksumtxt <> stg.chksumtxt;

Why We Never Delete

Deletes are permanent and often irreversible within a pipeline run. Hard deletes in Base would mean:

  • No audit trail of what the data looked like previously
  • Broken downstream reports that joined on those rows
  • No way to distinguish "was deleted from source" vs "pipeline bug"

Instead, a soft delete pattern (setting is_current = 'N') handles source-side deletions gracefully — you mark the row expired and let consumers decide whether to include inactive records.


The Full Picture

Every pipeline run:

STG populated ─────────────────────────┐
                                        
                         ┌─────────────────────────┐
                             Compare STG vs BASE   
                         └─────────────────────────┘
                                  
              ┌───────────────────┼───────────────────┐
                                                    
        New record?         CHKSUMTXT           CHKSUMTXT
        (key missing        changed?            same?
         in BASE)           (key exists)        (key exists)
                                                    
         INSERT new          Expire old +         No-op,
         row with            INSERT new row       skip it
         is_current='Y'      with is_current='Y'

Result: BASE always has valid data. Pipeline failures only affect how fresh it is, never whether it exists.


Summary

ConditionAction
Key not in BaseInsert new row (is_current = 'Y')
Key exists, CHKSUMTXT changedExpire old row + insert updated row
Key exists, CHKSUMTXT sameSkip — no action
Source-side deleteExpire old row (is_current = 'N') — never hard delete

This pattern gives you resilience (Base survives Stage failures), auditability (full row history), and efficiency (only process what changed). It scales from a few thousand rows to hundreds of millions without changing the logic — only the compute underneath it.