SCD Type 2 Implementation — datamodelassistant.com

A Slowly Changing Dimension Type 2 (SCD2) preserves the full history of every change to a dimension row. Instead of overwriting the old value, you close out the previous version and insert a new one. Done correctly, an SCD2 dimension lets you answer "what did this customer look like at the moment of this sale?" months or years after the fact. Done incorrectly, it produces double-counted facts, misjoined history, and reports that disagree with themselves.

This guide walks through the implementation choices that actually matter in production: surrogate keys versus business keys, the columns that make SCD2 work, and the MERGE patterns that load new versions reliably.

What Type 2 Actually Does

A dimension with SCD2 behavior has one row per version of an entity. When a tracked attribute changes -- a customer moves to a new region, a product changes its category, an employee changes managers -- the existing row is closed off and a new row is inserted. The fact tables continue to point to whichever version was current at the time the fact occurred.

The contrast with Type 1 is the whole point. A Type 1 dimension overwrites the old value, and every historical fact suddenly looks like it happened under the new attribute value. That is fine for corrections (fixing a misspelled name) and disastrous for analytical history (treating last year's sales as if they happened under this year's territory boundaries). Type 2 keeps the history; Type 1 erases it.

This pattern is the Kimball-canonical way of handling history in dimensional modeling, and it is one of the main reasons star schemas outperform naive "current state" models for any reporting that spans time.

The Required Columns

A working SCD2 dimension needs five things on every row.

A surrogate key. This is a system-generated integer (or BIGINT) that uniquely identifies the row, not the entity. Two versions of the same customer have two different surrogate keys. Facts join to the surrogate key, which is what gives you the point-in-time correctness. Never use the business key as the join column on a Type 2 dimension; you will collapse history every time.

A business key (or natural key). The identifier from the source system -- the customer ID, the product SKU, the employee number. Multiple rows in the dimension can share the same business key, distinguished by their effective dates. Index this column; you will look up "the current version of business key X" constantly during loads.

An effective-from timestamp. When this version of the row became valid. Usually drawn from the source change timestamp or the ETL batch time, depending on how trustworthy the source's own timestamps are.

An effective-to timestamp. When this version stopped being valid. The currently active row uses a far-future sentinel like 9999-12-31 rather than NULL. This matters: range predicates like WHERE fact_date BETWEEN effective_from AND effective_to break when one side is NULL, and BI tools handle the sentinel more predictably.

A current-row flag. A boolean (is_current) that is true on exactly one row per business key. Strictly speaking the effective-to sentinel is sufficient, but the flag makes "join to the current version" queries trivial and lets you build a partial index on WHERE is_current for fast current-state lookups. Keep both -- the redundancy is cheap and the query ergonomics matter.

Some teams also add a version number, a hash of the tracked attributes, or a source-system batch ID. These are useful but optional. The five above are the working minimum.

Which Attributes Trigger a New Version

Not every column on a dimension should produce a new SCD2 row. A typo correction in a customer's last name is not analytically meaningful; promoting a customer from "standard" to "premium" tier almost certainly is. Decide explicitly, per attribute, which columns are tracked and which are overwritten.

The common pattern is to mix Type 1 and Type 2 behavior on the same dimension. Tracked columns (region, tier, segment) follow Type 2 rules; correction columns (display name, email address) follow Type 1 rules and overwrite in place across all versions of the same business key. Document this per column. The Kimball books call this a Type 6 hybrid; the name matters less than the practice of being explicit.

A robust implementation computes a hash of the tracked columns on each load and compares it against the hash on the current row. If the hash matches, nothing changed and the row stays as is. If it differs, you close out the current row and insert a new one.

The Canonical MERGE Pattern

In a SQL warehouse, the load is two operations against the dimension table for each changed business key: close the existing current row, then insert the new version. The standard pattern uses a single MERGE plus an insert, or two coordinated statements within a transaction.

A typical implementation, expressed against a Postgres-flavored warehouse:

-- Step 1: close out rows whose tracked attributes have changed
UPDATE dim_customer d
SET
  effective_to = s.effective_from,
  is_current   = false
FROM staging_customer s
WHERE d.business_key = s.business_key
  AND d.is_current = true
  AND d.tracked_hash <> s.tracked_hash;

-- Step 2: insert the new version for changed rows AND brand-new business keys
INSERT INTO dim_customer (
  business_key, region, tier, segment,
  display_name, email,
  effective_from, effective_to, is_current,
  tracked_hash
)
SELECT
  s.business_key, s.region, s.tier, s.segment,
  s.display_name, s.email,
  s.effective_from, TIMESTAMP '9999-12-31', true,
  s.tracked_hash
FROM staging_customer s
LEFT JOIN dim_customer d
  ON d.business_key = s.business_key
 AND d.is_current = true
WHERE d.business_key IS NULL
   OR d.tracked_hash <> s.tracked_hash;

Run both statements inside a single transaction so the dimension is never observed mid-update. The Type 1 overwrites (display_name, email) are applied separately against all rows with the same business key, regardless of version.

For warehouses that support MERGE with WHEN MATCHED THEN UPDATE and WHEN NOT MATCHED THEN INSERT (Snowflake, BigQuery, Redshift), you can fold both steps into one statement, but the two-statement form is easier to reason about, easier to test, and easier to roll back when something goes wrong.

Loading Facts Against an SCD2 Dimension

The fact-table load must look up the surrogate key that was current at the time the fact occurred. This is a range join, not an equality join:

INSERT INTO fact_sales (date_key, customer_key, ...)
SELECT
  s.date_key,
  d.customer_key,
  ...
FROM staging_sales s
JOIN dim_customer d
  ON d.business_key = s.customer_business_key
 AND s.event_timestamp >= d.effective_from
 AND s.event_timestamp <  d.effective_to;

The half-open interval (>= from, < to) is intentional: it prevents a single moment in time from matching two rows when one version closes exactly as another opens. Test this with a fact whose timestamp lands on a version boundary -- it is the most common source of double counting.

Common Pitfalls

A handful of mistakes account for most broken SCD2 implementations.

Joining facts to the business key. This collapses all history into the current row and is the single most common cause of "the historical numbers changed when nothing happened." Always join to the surrogate key. Always.

NULL effective-to columns. They break range predicates and force defensive COALESCE calls in every BI query. Use a far-future sentinel from the start.

Closing rows by date alone, not by hash. If the source system re-emits unchanged rows, a date-only check will produce new versions every batch. Hash the tracked columns and only close when the hash changes.

Overlapping versions. Two rows with the same business key whose effective ranges overlap will both match a fact in the overlap window and inflate measures. Add a unique constraint or assertion on (business_key, is_current) filtered to is_current = true, and a periodic check that no two rows for the same business key have overlapping effective ranges.

Forgetting Type 1 columns. Mixing Type 1 and Type 2 behavior on one dimension is fine, but you must apply Type 1 updates across all historical versions of the same business key, not just the current one. Otherwise the corrected value appears only on the latest row and historical reports still show the typo.

Late-arriving facts. A fact whose event timestamp predates the earliest dimension version has no version to join to. Either backfill an initial-version row with an effective-from of -infinity, or flag the fact for review.

When Type 2 Is Overkill

SCD2 is not free. Every changed attribute generates a new row, the dimension grows roughly with the rate of change, and every fact load pays the range-join cost. If you do not actually need to reproduce historical attribute values -- if the business only ever asks "what does this look like today" -- a Type 1 overwrite is cheaper, simpler, and analytically sufficient.

The right test is: does any report ever filter or group by the tracked attribute as it was at the time of the fact? If yes, you need Type 2. If no, save yourself the operational complexity.