Integration Layer (Silver) Modeling Standards

Status: Draft v1 — proposed standards, pending team + management review. Owner: Enterprise Data Analytics engineering team. Scope: All dbt models under models/integration_layer/ in this repo. Audience: QTS data engineers, AEs, external partners (e.g. PHdata), and AI coding assistants.

0. Relationship to `.github/dbt_style_guide.md`

A unified QTS dbt style guide already exists at .github/dbt_style_guide.md (last updated 2025-07-31). It is the authoritative source for shared SQL, Jinja, Python, and YAML conventions across the entire repo (IL, PL, snapshots, macros) and is what .github/copilot-instructions.md points GitHub Copilot at.

This document does not duplicate that guide. It defers to it for:

Lowercase keywords / functions / identifiers.
4-space indentation, trailing commas.
Jinja delimiter spacing.
YAML 2-space indentation.
General SQL formatting.

This document adds the IL-specific layer on top:

IL purpose and scope (§1, §2).
Folder structure inside models/integration_layer/ (§3).
Standardized cleansing rules: NULL/blank handling, soft deletes, audit columns, dedup, timestamp UTC normalization (§7–§10).
IL-specific naming refinements beyond the global guide (§4, §5).
Source declaration standards (§16).
Snapshot standards (§17).
Existing macros to reuse before writing inline logic (§18).
Required tests and documentation patterns for IL models (§12, §15).
IL Pull Request expectations (§19), which extend .github/pull_request_template.md.

If this document and .github/dbt_style_guide.md ever conflict, the global style guide wins on shared concerns (formatting, base naming) and this document wins on IL-specific concerns (cleansing, dedup, audit, soft delete).

1. Purpose

The Integration Layer (IL) is the standardized, cleansed contract between raw source data and downstream business modeling. It is what other teams, analysts, semantic models, and AI tools should depend on instead of raw tables.

The IL exists to:

Normalize column names, types, and casing across sources.
Apply consistent NULL, blank, and soft-delete handling.
Convert all timestamps to UTC.
Preserve source lineage and audit metadata (_fivetran_synced, etc.).
Provide deterministic, deduplicated rows keyed on a documented business key.
Give the Presentation Layer (PL) and downstream tooling a stable, tested base to build dimensional and business models on top of.

The IL is not the place for:

Business definitions, KPIs, or metric logic. → Presentation Layer.
Joining across multiple sources to produce business entities. → Presentation Layer.
Ad-hoc analyst views. → analyses/ or PL.

2. Scope and layer terminology

Layer	Repo path	Database pattern	Purpose
Raw (Bronze)	Snowflake `RAW_*` databases	`RAW_*_DB`	Untransformed source data, owned by ingestion.
Integration (Silver)	`models/integration_layer/`	`EDW_<TARGET>_IL_DB`	Standardized, cleansed, source-faithful.
Presentation (Gold)	`models/presentation_layer/`	`EDW_<TARGET>_PL_DB`	Business-modeled facts, dims, and bridges.

This document governs the Integration Layer only.

3. Folder structure

The current convention in this repo, formalized:

models/integration_layer/
├── <source_system>/                     ← one folder per source system
│   ├── <source_system>.yaml             ← model + column docs and tests
│   ├── cleansing_transformations/       ← cleansed, source-faithful models
│   │   └── <source_abbrev>_<entity>.sql
│   └── business_transformations/        ← cross-entity reshaping within one source
│       └── <source_abbrev>_<entity>.sql
└── dimensions/                          ← cross-source IL-level dim helpers (rare)

Rules:

Every source system gets its own top-level folder under integration_layer/.
The first level inside a source folder reflects subject-area or domain grouping when the source is large (e.g. workday_hcm/organization/). Otherwise it is cleansing_transformations/ and business_transformations/ directly.
cleansing_transformations/ is for models that are 1:1 with a single source table and apply renaming, casting, NULL handling, and dedup only.
business_transformations/ is for models that combine multiple cleansing_transformations outputs from the same source into a more useful shape. Cross-source joining stays in the Presentation Layer.

4. Model naming

IL model names

Format: <source_abbrev>_<entity> (singular, snake_case).

Source	Abbreviation	Example
Salesforce	`sf`	`sf_quote`
Workday HCM	`wd_hcm`	`wd_hcm_worker`
Workday Financials	`wd_fin`	`wd_fin_invoice_line`
Workday Adaptive Planning	`wd_ap`	`wd_ap_account`
Procore	`pc`	`pc_daily_log`
Smartsheet	`ss`	`ss_project`
Datahawk	`dh`	`dh_market`
8x8	`eb`	`eb_call`

Rules:

Models are singular (sf_quote, not sf_quotes). One row = one entity.
Lowercase snake_case, ASCII only.
No abbreviations beyond the established source abbreviation. Prefer wd_hcm_worker_position over wd_hcm_wkr_pos.
Deprecated suffixes: legacy models use a _bl suffix (e.g. wd_hcm_organization_bl). New IL models must not add this suffix. The pattern is being retired. When touching legacy _bl models, rename if it can be done as part of the same PR; otherwise leave for a focused rename PR.
Fivetran salesforce_formula_utils exception. Models generated by the fivetran/salesforce_formula_utils package end in _formula (e.g. sf_quote_formula). The suffix is package-defined, not arbitrary, and must not be edited by hand — regeneration will overwrite changes. These models are exempt from rules 1–4 above for the suffix only.

IL fact / dim naming

IL business_transformation models occasionally take fact/dim shape inside a single source (e.g. sf_quote_dim, sf_opportunity_fact). In active use today is the suffix style: <source_abbrev>_<entity>_<dim|fact>.

Fact: <source_abbrev>_<entity>_fact — e.g. sf_quote_fact, sf_opportunity_fact.
Dim: <source_abbrev>_<entity>_dim — e.g. sf_account_dim, sf_user_dim.

Presentation Layer naming (informational only)

This document does not govern Presentation Layer naming. PL conventions vary across the repo (dim_fin_company vs wd_ap_dim_account_pl) and are out of scope for the IL standards. See .ai/findings/repo-survey-2026-06-17.md §3 for the open decision.

File names match model names

models/integration_layer/workday_hcm/organization/cleansing_transformations/wd_hcm_worker.sql materializes as wd_hcm_worker.

5. Column naming

Lowercase snake_case. No camelCase, no PascalCase, no spaces, no hyphens.
Primary keys are descriptive, never id in the projected column list.
- ✅ worker_id, daily_log_id, quote_id
- ❌ id, pk, key
- When the source column is id, rename in the cleansing model: id as worker_id.
- Exceptions:
  - The incremental unique_key config may continue to reference id while the underlying column is being renamed end-to-end (transition window).
  - Models generated by salesforce_formula_utils keep unique_key="id" to stay compatible with the package.
  - Snapshots in snapshots/ may use unique_key: id because they read from source(...) directly, before any IL rename has happened.
Foreign keys carry the parent entity name + _id. A foreign key from wd_hcm_worker_position to wd_hcm_worker is worker_id in both tables.
Surrogate keys end in _key (Presentation Layer concept; rare in IL).
Do not abbreviate. customer_id, not cust_id. order_id, not o_id.
Do not use reserved words. Avoid select, from, group, order, class, type (use <entity>_type instead).
Booleans start with is_ or has_ and read as a statement.
- ✅ is_active, has_dependents, is_terminated, is_floorplan
- ❌ active, terminated, floorplan
- Rename at IL: active as is_active.
Timestamps end in _at and are stored in UTC.
- ✅ created_at, updated_at, terminated_at
- If a non-UTC timestamp is genuinely required, suffix the timezone: created_at_pt. UTC remains the default.
Dates end in _date.
- ✅ hire_date, invoice_date, effective_date
Event verbs are past tense. created, updated, deleted, submitted.
Generic columns get the entity prefix so they remain unambiguous after joins:
- ✅ quote_status, quote_type, quote_created_at
- ❌ status, type, created_at (when ambiguous post-join)

6. Data types

Domain	Snowflake type	Notes
Money / amounts	`NUMBER(38,2)`	Never `FLOAT`. Use even when the source is `FLOAT` — cast in IL.
Counts / quantities	`NUMBER(38,0)` or `NUMBER(38,n)`	Match the precision the business requires.
IDs	`VARCHAR` (string)	All IDs are strings, including numeric-looking ones.
Booleans	`BOOLEAN`	Cast `'Y'/'N'`, `1/0`, etc. to true booleans in IL.
Timestamps	`TIMESTAMP_TZ`	Stored in UTC.
Dates	`DATE`
Text	`VARCHAR`	Unbounded by default; let Snowflake handle it.
Semi-structured	`VARIANT` / `OBJECT` / `ARRAY`	Flatten in IL where the structure is known and stable.

7. NULL and blank handling

Empty strings '' become NULL at IL.
String-encoded nulls ('NULL', 'null', 'N/A', '-') become NULL at IL when you can confirm they represent missing data.
Trim leading and trailing whitespace from strings.
Source-system sentinel values (e.g. 9999-12-31 "high date") are kept as-is if the business uses them as sentinels; otherwise convert to NULL and document the choice in the model description.
Required PK columns must be NOT NULL after IL transformation. If a row would have a NULL PK, filter it out in IL and document why.

8. Soft deletes and Fivetran metadata

The current standard pattern (matches wd_hcm_worker.sql):

with
    source as (select * from {{ source('workday_hcm', 'worker') }}),

    renamed as (
        select
            id as worker_id,
            active as is_active,
            hire_date,
            termination_date,
            worker_code,
            user_id as worker_user_id,
            _fivetran_deleted,
            _fivetran_synced

        from source
    )

select * exclude (_fivetran_deleted)
from renamed
where _fivetran_deleted <> true or _fivetran_deleted is null

Rules:

Always preserve _fivetran_synced as the freshness indicator. Carry it through to the IL output column unchanged.
Filter out soft-deleted rows in IL using the where _fivetran_deleted <> true or _fivetran_deleted is null pattern, then exclude the column from the final select.
If a downstream consumer needs deleted rows for historical reasons, build a separate IL model with _deleted in the name; do not mix live and deleted rows in a single IL output.
For sources without Fivetran metadata, document the freshness column used instead (e.g. last_modified, etl_loaded_at).

9. Audit columns and source lineage

Every IL cleansing_transformations model must expose, in this order at the end of the column list:

_fivetran_synced     -- or equivalent source freshness column

Optional, encouraged for non-Fivetran sources or where the team agrees:

source_system        -- short identifier of the source, e.g. 'workday_hcm'
loaded_at            -- timestamp the data landed in Snowflake (UTC)

These are added at IL, not in PL. PL inherits them via ref().

10. Deduplication

When a source is versioned (multiple rows per business key with a freshness timestamp), deduplicate in IL using the canonical pattern:

with
    source as (select * from {{ source('foo', 'bar') }}),

    deduped as (
        select *,
               row_number() over (
                   partition by <business_key>
                   order by _fivetran_synced desc
               ) as _row_num
        from source
    )

select <columns>
from deduped
where _row_num = 1

Rules:

The business key is documented in the model YAML description.
The ordering column is _fivetran_synced for Fivetran sources; otherwise the source's last_modified / updated_at / equivalent.
The dedup CTE name is deduped for consistency.
Drop _row_num from the final select.

11. Materialization

Defaults are set in dbt_project.yml. Override only with a documented reason.

Situation	Materialization
Source is small, downstream usage is light	`view` (default)
Source is large or downstream queries are slow on a view	`table`
Source is large and grows append-only or update-by-key	`incremental`
Source is small but heavily reused across PL	`table`

For incremental models:

unique_key should reference the descriptive PK name (e.g. unique_key='daily_log_id'). The legacy 'id' is acceptable during a transition window per §5.2 exceptions; new hand-written incrementals should start with the descriptive name.
on_schema_change must be set explicitly. Default to 'append_new_columns' unless there is a reason to fail.

Incremental filters use >= not > to avoid missing records that share a timestamp with the high-water mark.

{% if is_incremental() %}
    where _fivetran_synced >= (select max(_fivetran_synced) from {{ this }})
{% endif %}

Salesforce `_formula` incremental pattern (do not edit by hand)

Salesforce cleansing models generated by salesforce_formula_utils follow a fixed pattern that the standards do not override:

{{ config(
    materialized="incremental",
    incremental_strategy="merge",
    unique_key="id",
    data-removed="append_new_columns"
) }}
{{ get_formula_model("<source_object>", incremental_this=(this if is_incremental() else none)) }}

Keep unique_key="id", keep incremental_strategy="merge", do not edit the file body — it will be overwritten by the package.

12. Required dbt tests

In the source-folder YAML, every IL model must have:

Primary key — unique and not_null on the PK column. Reference the descriptive PK name (worker_id), not id.
Foreign keys — relationships test to the parent IL model where the parent is also in the IL.
Enumerated columns — accepted_values for status / type columns whose set is small and stable.
Not-null on required business columns — at minimum the columns the model description claims are required.

Use the modern dbt syntax (flags: require_generic_test_arguments_property: true is set in dbt_project.yml):

columns:
  - name: worker_id
    data_type: varchar
    description: "Workday worker ID. Primary key."
    data_tests:
      - unique
      - not_null

13. SQL style

Most rules are enforced by .sqlfluff. The standards doc only states conventions not already covered there:

One CTE = one logical step. Name CTEs descriptively (renamed, deduped, joined_to_workers).
{{ ref() }} and {{ source() }} calls go in import CTEs at the top of the file, named after the table they reference.
Limit data scanned by import CTEs. Select only the columns used; apply where filters at the top, not at the bottom.
Last line is select * from <final_cte> so model output can be audited by swapping the final CTE name during development.
Use union all unless deduplication via union is genuinely intended.
Be explicit about join type: inner join, left join, full outer join. Never bare join.
Move left to right in joins. A right join usually means the FROM table is wrong.
No select * in final output. Enumerate columns. select * exclude (...) is allowed when the exclusion list is short and intentional (see the soft-delete pattern above).
No commented-out code in committed files. Use Jinja comments ({# … #}) for explanatory comments that should not appear in compiled SQL.

14. File header

Every .sql file in the IL begins with:

/*

Created By: <Author Name>
Created On: YYYY-MM-DD
Description: <one-sentence description of what this model is and its grain>

Updated By:
Updated On:
Description:

*/

When making non-trivial changes, append (do not overwrite) an Updated By / Updated On / Description block.

15. Documentation requirements

Where IL model YAML lives

The repo currently has IL model documentation split across two locations:

Co-located — e.g. models/integration_layer/workday_hcm/workday_hcm.yaml. Most descriptions are empty "".
Centralized — models/_model_yamls/<source>/<source>.yaml. Examples: _model_yamls/prowatch/prowatch.yaml, _model_yamls/workday/workday_financials.yaml, _model_yamls/salesforce/....

Standard for new work: put model documentation in models/_model_yamls/<source>/<source>.yaml. Migrate co-located YAML to that location as part of any meaningful PR that touches the affected models.

The gold-standard reference is models/_model_yamls/prowatch/prowatch.yaml. New IL YAML should match its quality bar.

Required content per IL model

A model description that includes the grain ("one row per …") and the business key.
A description on every column — not "". At minimum, describe the PK, FKs, status/type columns, all monetary columns, and all timestamp/date columns. Generic boilerplate is not acceptable.
data_type on every column where the type is not obvious from a sibling IL model.
The required tests from §12.

Why empty descriptions matter: dbt_project.yml sets +persist_docs: { relation: true, columns: true }, which pushes descriptions to Snowflake's COMMENT metadata. Empty descriptions surface as gaps in the Snowflake catalog and any AI/semantic tooling that reads from INFORMATION_SCHEMA.

Empty descriptions are blockers in code review.

16. Sources

Source declarations live in models/_source_yamls/<source>/. Each new source must declare:

version: 2

sources:
  - name: <source_name>
    database: edw_stage_db          # current default for Fivetran-landed sources
    schema: <schema>
    config:
      tags: ["<source_name>"]
      loaded_at_field: _fivetran_synced
      freshness:
        warn_after:  { count: 24, period: hour }
        error_after: { count: 72, period: hour }

    tables:
      - name: <table_name_in_dbt>
        identifier: <actual_table_in_snowflake>   # only when remapping

Rules:

Source files live under models/_source_yamls/<source>/, never co-located with models.
database and schema are set explicitly.
loaded_at_field is set: _fivetran_synced for Fivetran sources, otherwise the source's freshness column (e.g. last_modified).
freshness is set with warn_after and error_after thresholds calibrated to the connector's sync schedule.
Source-level tags are applied for orchestration.
Use identifier: to remap a source table when its raw name collides with another source (e.g. hcm_organization aliasing organization).
New sources are listed in the appropriate folder; do not put a new source into an existing source's YAML file.

17. Snapshots

Snapshots live in the top-level snapshots/ directory, in YAML files named _<source>_snapshots.yml (or a _<source>_snapshots/ directory of YAML fragments for very large source families).

Canonical snapshot config:

- name: wd_hcm_worker_snapshot
  relation: source('workday_hcm', 'worker')
  config:
    tags: <source>-snapshots
    schema: SNAPSHOT
    database: "{{ target.database }}"
    unique_key: id              # source PK as it lands; rename happens at IL
    strategy: timestamp
    updated_at: _fivetran_synced
    dbt_valid_to_current: "to_date('9999-12-31')"

Rules:

Snapshots read from source(...), not from IL models.
Snapshot model names are <source_abbrev>_<entity>_snapshot.
schema: SNAPSHOT and database: "{{ target.database }}".
Strategy is timestamp; updated_at is _fivetran_synced for Fivetran sources, otherwise the source's freshness column.
dbt_valid_to_current: "to_date('9999-12-31')" is the team's standard sentinel for the current row.
unique_key: id is acceptable here even though IL models rename id — snapshots predate the rename and operate on the source as it landed.
Composite keys are written as a YAML list: unique_key: [worker_id, manager_id, position_id].
Tags follow the pattern <source>-snapshots (e.g. hcm-snapshots).

18. Use existing macros before writing inline logic

The macros/ folder already contains utilities that should be reused instead of re-implementing inline in models. Always check macros/ first.

Macro	When to use
`norm_str(expr)`	Case-insensitive normalization for joins / comparisons. Returns `upper(trim(...))`.
`norm_key(parts)`	Building composite natural keys for dedup or surrogate-key inputs.
`json_value(column, path, cast_type)`	Extracting paths from VARIANT/JSON columns. Use this instead of writing raw `(col:path)::type` casts inline.
`with_unknown_row(model_ref, column_values)`	Adding an "Unknown" sentinel row to a dimension. PL use; mention in IL only when an IL dim genuinely needs one.
`quoted_column_not_null`	Custom dbt test for columns whose names need quoting (reserved words, special characters).
`convert_to_est(column_name)`	UTC → America/New_York conversion. Never use in IL — IL stays UTC. PL only, when a stakeholder explicitly wants ET.

From installed packages:

dbt_utils.generate_surrogate_key([...]) for surrogate keys (PL use).
dbt_utils.equal_rowcount, dbt_utils.expression_is_true, etc. for tests beyond unique / not_null / relationships / accepted_values.
dbt run-operation generate_model_yaml --args '{"model_names": ["<model>"]}' from the codegen package to scaffold YAML for a new model.

If you find yourself writing inline column-cleansing logic that feels repeatable across more than one model, raise a follow-up to add it as a macro rather than copy-pasting.

19. PR expectations

This section extends, not replaces, .github/pull_request_template.md. A PR that adds or modifies an IL model must include:

The .sql file with header and SQL conforming to this document.
The corresponding YAML entry in models/_model_yamls/<source>/<source>.yaml with description, columns, data_types, and tests (see §15).
A green dbt build --select state:modified+ locally.
No select * in final output (see §13.8).
A note in the PR description listing the model name, the grain, and the primary key.
The il_model_review_checklist.md is filled out and pasted into the PR description, or referenced as already reviewed.

20. What this document does not cover (yet)

The following are intentionally out of scope for v1 and will be addressed in follow-up PRs:

Presentation Layer naming. Multiple competing patterns exist in the repo; this is deferred to a separate PL standards effort.
Cross-source bridging patterns. PL-only.
A first-class macro library for column renaming, NULL normalization, dedup, and soft-delete — candidates for a macros/il_standards/ package.
Generated AI prompts for scaffolding new IL models.
Standardized source_system / loaded_at defaults applied via post-hooks.
Deeper CI rules — e.g. sqlfluff enforced in CI, doc-coverage thresholds, description-not-empty checks.
Migration plan for legacy _bl suffix renames and YAML migration from co-located to _model_yamls/.

21. Change history

Version	Date	Author	Notes
v1.1 (draft)	2026-06-17	Steve Farmer	Repo survey integration: added §0 (relationship to `.github/dbt_style_guide.md`), §16 Sources, §17 Snapshots, §18 Existing macros. Updated §4 (Salesforce `_formula` exception, IL fact/dim suffix style, PL naming deferred), §5 (`id` PK exception list), §11 (Salesforce formula incremental pattern), §15 (`_model_yamls/` location, prowatch as gold-standard, `+persist_docs` rationale), §19 (PR template cross-reference), §20 (sources/macros removed; PL naming added).
v1 (draft)	2026-06-16	Steve Farmer	Initial draft for team review.

Integration Layer (Silver) Modeling Standards

0. Relationship to .github/dbt_style_guide.md

1. Purpose

2. Scope and layer terminology

3. Folder structure

4. Model naming

IL model names

IL fact / dim naming

Presentation Layer naming (informational only)

File names match model names

5. Column naming

6. Data types

7. NULL and blank handling

8. Soft deletes and Fivetran metadata

9. Audit columns and source lineage

10. Deduplication

11. Materialization

Salesforce _formula incremental pattern (do not edit by hand)

12. Required dbt tests

13. SQL style

14. File header

15. Documentation requirements

Where IL model YAML lives

Required content per IL model

16. Sources

17. Snapshots

18. Use existing macros before writing inline logic

19. PR expectations

20. What this document does not cover (yet)

21. Change history

0. Relationship to `.github/dbt_style_guide.md`

Salesforce `_formula` incremental pattern (do not edit by hand)