Status: Draft v1 — proposed standards, pending team + management review. Owner: Enterprise Data Analytics engineering team. Scope: All dbt models under
models/integration_layer/in this repo. Audience: QTS data engineers, AEs, external partners (e.g. PHdata), and AI coding assistants.
.github/dbt_style_guide.mdA unified QTS dbt style guide already exists at
.github/dbt_style_guide.md (last updated
2025-07-31). It is the authoritative source for shared SQL, Jinja, Python,
and YAML conventions across the entire repo (IL, PL, snapshots, macros) and
is what .github/copilot-instructions.md points GitHub Copilot at.
This document does not duplicate that guide. It defers to it for:
This document adds the IL-specific layer on top:
§1, §2).models/integration_layer/ (§3).§7–§10).§4, §5).§16).§17).§18).§12, §15).§19), which extend
.github/pull_request_template.md.If this document and .github/dbt_style_guide.md ever conflict, the global
style guide wins on shared concerns (formatting, base naming) and this
document wins on IL-specific concerns (cleansing, dedup, audit, soft delete).
The Integration Layer (IL) is the standardized, cleansed contract between raw source data and downstream business modeling. It is what other teams, analysts, semantic models, and AI tools should depend on instead of raw tables.
The IL exists to:
_fivetran_synced, etc.).The IL is not the place for:
analyses/ or PL.| Layer | Repo path | Database pattern | Purpose |
|---|---|---|---|
| Raw (Bronze) | Snowflake RAW_* databases |
RAW_*_DB |
Untransformed source data, owned by ingestion. |
| Integration (Silver) | models/integration_layer/ |
EDW_<TARGET>_IL_DB |
Standardized, cleansed, source-faithful. |
| Presentation (Gold) | models/presentation_layer/ |
EDW_<TARGET>_PL_DB |
Business-modeled facts, dims, and bridges. |
This document governs the Integration Layer only.
The current convention in this repo, formalized:
models/integration_layer/
├── <source_system>/ ← one folder per source system
│ ├── <source_system>.yaml ← model + column docs and tests
│ ├── cleansing_transformations/ ← cleansed, source-faithful models
│ │ └── <source_abbrev>_<entity>.sql
│ └── business_transformations/ ← cross-entity reshaping within one source
│ └── <source_abbrev>_<entity>.sql
└── dimensions/ ← cross-source IL-level dim helpers (rare)
Rules:
integration_layer/.workday_hcm/organization/).
Otherwise it is cleansing_transformations/ and business_transformations/
directly.cleansing_transformations/ is for models that are 1:1 with a single
source table and apply renaming, casting, NULL handling, and dedup only.business_transformations/ is for models that combine multiple
cleansing_transformations outputs from the same source into a more
useful shape. Cross-source joining stays in the Presentation Layer.Format: <source_abbrev>_<entity> (singular, snake_case).
| Source | Abbreviation | Example |
|---|---|---|
| Salesforce | sf |
sf_quote |
| Workday HCM | wd_hcm |
wd_hcm_worker |
| Workday Financials | wd_fin |
wd_fin_invoice_line |
| Workday Adaptive Planning | wd_ap |
wd_ap_account |
| Procore | pc |
pc_daily_log |
| Smartsheet | ss |
ss_project |
| Datahawk | dh |
dh_market |
| 8x8 | eb |
eb_call |
Rules:
sf_quote, not sf_quotes). One row = one entity.wd_hcm_worker_position over wd_hcm_wkr_pos._bl suffix (e.g.
wd_hcm_organization_bl). New IL models must not add this suffix. The
pattern is being retired. When touching legacy _bl models, rename if it can
be done as part of the same PR; otherwise leave for a focused rename PR.salesforce_formula_utils exception. Models generated by the
fivetran/salesforce_formula_utils package end in _formula (e.g.
sf_quote_formula). The suffix is package-defined, not arbitrary, and
must not be edited by hand — regeneration will overwrite changes.
These models are exempt from rules 1–4 above for the suffix only.IL business_transformation models occasionally take fact/dim shape inside a
single source (e.g. sf_quote_dim, sf_opportunity_fact). In active use
today is the suffix style: <source_abbrev>_<entity>_<dim|fact>.
<source_abbrev>_<entity>_fact — e.g. sf_quote_fact, sf_opportunity_fact.<source_abbrev>_<entity>_dim — e.g. sf_account_dim, sf_user_dim.This document does not govern Presentation Layer naming. PL conventions
vary across the repo (dim_fin_company vs wd_ap_dim_account_pl) and are
out of scope for the IL standards. See
.ai/findings/repo-survey-2026-06-17.md
§3 for the open decision.
models/integration_layer/workday_hcm/organization/cleansing_transformations/wd_hcm_worker.sql
materializes as wd_hcm_worker.
id in the projected column list.worker_id, daily_log_id, quote_idid, pk, keyid, rename in the cleansing model:
id as worker_id.unique_key config may continue to reference id while
the underlying column is being renamed end-to-end (transition window).salesforce_formula_utils keep unique_key="id"
to stay compatible with the package.snapshots/ may use unique_key: id because they read
from source(...) directly, before any IL rename has happened._id. A foreign key from
wd_hcm_worker_position to wd_hcm_worker is worker_id in both tables._key (Presentation Layer concept; rare in IL).customer_id, not cust_id. order_id, not o_id.select, from, group, order,
class, type (use <entity>_type instead).is_ or has_ and read as a statement.is_active, has_dependents, is_terminated, is_floorplanactive, terminated, floorplanactive as is_active._at and are stored in UTC.created_at, updated_at, terminated_atcreated_at_pt. UTC remains the default._date.hire_date, invoice_date, effective_datecreated, updated, deleted, submitted.quote_status, quote_type, quote_created_atstatus, type, created_at (when ambiguous post-join)| Domain | Snowflake type | Notes |
|---|---|---|
| Money / amounts | NUMBER(38,2) |
Never FLOAT. Use even when the source is FLOAT — cast in IL. |
| Counts / quantities | NUMBER(38,0) or NUMBER(38,n) |
Match the precision the business requires. |
| IDs | VARCHAR (string) |
All IDs are strings, including numeric-looking ones. |
| Booleans | BOOLEAN |
Cast 'Y'/'N', 1/0, etc. to true booleans in IL. |
| Timestamps | TIMESTAMP_TZ |
Stored in UTC. |
| Dates | DATE |
|
| Text | VARCHAR |
Unbounded by default; let Snowflake handle it. |
| Semi-structured | VARIANT / OBJECT / ARRAY |
Flatten in IL where the structure is known and stable. |
'' become NULL at IL.'NULL', 'null', 'N/A', '-') become NULL at IL
when you can confirm they represent missing data.9999-12-31 "high date") are kept as-is
if the business uses them as sentinels; otherwise convert to NULL and
document the choice in the model description.NOT NULL after IL transformation. If a row
would have a NULL PK, filter it out in IL and document why.The current standard pattern (matches wd_hcm_worker.sql):
with
source as (select * from {{ source('workday_hcm', 'worker') }}),
renamed as (
select
id as worker_id,
active as is_active,
hire_date,
termination_date,
worker_code,
user_id as worker_user_id,
_fivetran_deleted,
_fivetran_synced
from source
)
select * exclude (_fivetran_deleted)
from renamed
where _fivetran_deleted <> true or _fivetran_deleted is null
Rules:
_fivetran_synced as the freshness indicator. Carry it
through to the IL output column unchanged.where _fivetran_deleted <> true or _fivetran_deleted is null pattern, then exclude the column from
the final select._deleted in the name; do not mix live and deleted
rows in a single IL output.last_modified, etl_loaded_at).Every IL cleansing_transformations model must expose, in this order at the
end of the column list:
_fivetran_synced -- or equivalent source freshness column
Optional, encouraged for non-Fivetran sources or where the team agrees:
source_system -- short identifier of the source, e.g. 'workday_hcm'
loaded_at -- timestamp the data landed in Snowflake (UTC)
These are added at IL, not in PL. PL inherits them via ref().
When a source is versioned (multiple rows per business key with a freshness timestamp), deduplicate in IL using the canonical pattern:
with
source as (select * from {{ source('foo', 'bar') }}),
deduped as (
select *,
row_number() over (
partition by <business_key>
order by _fivetran_synced desc
) as _row_num
from source
)
select <columns>
from deduped
where _row_num = 1
Rules:
_fivetran_synced for Fivetran sources; otherwise
the source's last_modified / updated_at / equivalent.deduped for consistency._row_num from the final select.Defaults are set in dbt_project.yml. Override only with a documented reason.
| Situation | Materialization |
|---|---|
| Source is small, downstream usage is light | view (default) |
| Source is large or downstream queries are slow on a view | table |
| Source is large and grows append-only or update-by-key | incremental |
| Source is small but heavily reused across PL | table |
For incremental models:
unique_key should reference the descriptive PK name (e.g.
unique_key='daily_log_id'). The legacy 'id' is acceptable during a
transition window per §5.2 exceptions; new hand-written incrementals should
start with the descriptive name.
on_schema_change must be set explicitly. Default to 'append_new_columns'
unless there is a reason to fail.
Incremental filters use >= not > to avoid missing records that share a
timestamp with the high-water mark.
{% if is_incremental() %}
where _fivetran_synced >= (select max(_fivetran_synced) from {{ this }})
{% endif %}
_formula incremental pattern (do not edit by hand)Salesforce cleansing models generated by salesforce_formula_utils follow a
fixed pattern that the standards do not override:
{{ config(
materialized="incremental",
incremental_strategy="merge",
unique_key="id",
data-removed="append_new_columns"
) }}
{{ get_formula_model("<source_object>", incremental_this=(this if is_incremental() else none)) }}
Keep unique_key="id", keep incremental_strategy="merge", do not edit the
file body — it will be overwritten by the package.
In the source-folder YAML, every IL model must have:
unique and not_null on the PK column. Reference the
descriptive PK name (worker_id), not id.relationships test to the parent IL model where the
parent is also in the IL.accepted_values for status / type columns whose
set is small and stable.Use the modern dbt syntax (flags: require_generic_test_arguments_property: true is set in dbt_project.yml):
columns:
- name: worker_id
data_type: varchar
description: "Workday worker ID. Primary key."
data_tests:
- unique
- not_null
Most rules are enforced by .sqlfluff. The standards doc
only states conventions not already covered there:
renamed, deduped, joined_to_workers).{{ ref() }} and {{ source() }} calls go in import CTEs at the top of
the file, named after the table they reference.where filters at the top, not at the bottom.select * from <final_cte> so model output can be audited
by swapping the final CTE name during development.union all unless deduplication via union is genuinely intended.inner join, left join, full outer join. Never bare join.right join usually means the FROM
table is wrong.select * in final output. Enumerate columns. select * exclude (...) is allowed when the exclusion list is short and intentional (see the
soft-delete pattern above).{# … #})
for explanatory comments that should not appear in compiled SQL.Every .sql file in the IL begins with:
/*
Created By: <Author Name>
Created On: YYYY-MM-DD
Description: <one-sentence description of what this model is and its grain>
Updated By:
Updated On:
Description:
*/
When making non-trivial changes, append (do not overwrite) an Updated By /
Updated On / Description block.
The repo currently has IL model documentation split across two locations:
models/integration_layer/workday_hcm/workday_hcm.yaml.
Most descriptions are empty "".models/_model_yamls/<source>/<source>.yaml. Examples:
_model_yamls/prowatch/prowatch.yaml,
_model_yamls/workday/workday_financials.yaml,
_model_yamls/salesforce/....Standard for new work: put model documentation in
models/_model_yamls/<source>/<source>.yaml. Migrate co-located YAML to that
location as part of any meaningful PR that touches the affected models.
The gold-standard reference is
models/_model_yamls/prowatch/prowatch.yaml.
New IL YAML should match its quality bar.
description that includes the grain ("one row per …") and the
business key.description on every column — not "". At minimum, describe the PK,
FKs, status/type columns, all monetary columns, and all timestamp/date
columns. Generic boilerplate is not acceptable.data_type on every column where the type is not obvious from a sibling
IL model.Why empty descriptions matter: dbt_project.yml sets +persist_docs: { relation: true, columns: true }, which pushes descriptions to Snowflake's
COMMENT metadata. Empty descriptions surface as gaps in the Snowflake catalog
and any AI/semantic tooling that reads from INFORMATION_SCHEMA.
Empty descriptions are blockers in code review.
Source declarations live in models/_source_yamls/<source>/. Each new source
must declare:
version: 2
sources:
- name: <source_name>
database: edw_stage_db # current default for Fivetran-landed sources
schema: <schema>
config:
tags: ["<source_name>"]
loaded_at_field: _fivetran_synced
freshness:
warn_after: { count: 24, period: hour }
error_after: { count: 72, period: hour }
tables:
- name: <table_name_in_dbt>
identifier: <actual_table_in_snowflake> # only when remapping
Rules:
models/_source_yamls/<source>/, never
co-located with models.database and schema are set explicitly.loaded_at_field is set: _fivetran_synced for Fivetran sources, otherwise
the source's freshness column (e.g. last_modified).freshness is set with warn_after and error_after thresholds calibrated
to the connector's sync schedule.identifier: to remap a source table when its raw name collides with
another source (e.g. hcm_organization aliasing organization).Snapshots live in the top-level snapshots/ directory, in YAML files named
_<source>_snapshots.yml (or a _<source>_snapshots/ directory of YAML
fragments for very large source families).
Canonical snapshot config:
- name: wd_hcm_worker_snapshot
relation: source('workday_hcm', 'worker')
config:
tags: <source>-snapshots
schema: SNAPSHOT
database: "{{ target.database }}"
unique_key: id # source PK as it lands; rename happens at IL
strategy: timestamp
updated_at: _fivetran_synced
dbt_valid_to_current: "to_date('9999-12-31')"
Rules:
source(...), not from IL models.<source_abbrev>_<entity>_snapshot.schema: SNAPSHOT and database: "{{ target.database }}".timestamp; updated_at is _fivetran_synced for Fivetran
sources, otherwise the source's freshness column.dbt_valid_to_current: "to_date('9999-12-31')" is the team's standard
sentinel for the current row.unique_key: id is acceptable here even though IL models rename id —
snapshots predate the rename and operate on the source as it landed.unique_key: [worker_id, manager_id, position_id].<source>-snapshots (e.g. hcm-snapshots).The macros/ folder already contains utilities that should be reused
instead of re-implementing inline in models. Always check macros/ first.
| Macro | When to use |
|---|---|
norm_str(expr) |
Case-insensitive normalization for joins / comparisons. Returns upper(trim(...)). |
norm_key(parts) |
Building composite natural keys for dedup or surrogate-key inputs. |
json_value(column, path, cast_type) |
Extracting paths from VARIANT/JSON columns. Use this instead of writing raw (col:path)::type casts inline. |
with_unknown_row(model_ref, column_values) |
Adding an "Unknown" sentinel row to a dimension. PL use; mention in IL only when an IL dim genuinely needs one. |
quoted_column_not_null |
Custom dbt test for columns whose names need quoting (reserved words, special characters). |
convert_to_est(column_name) |
UTC → America/New_York conversion. Never use in IL — IL stays UTC. PL only, when a stakeholder explicitly wants ET. |
From installed packages:
dbt_utils.generate_surrogate_key([...]) for surrogate keys (PL use).dbt_utils.equal_rowcount, dbt_utils.expression_is_true, etc. for tests
beyond unique / not_null / relationships / accepted_values.dbt run-operation generate_model_yaml --args '{"model_names": ["<model>"]}'
from the codegen package to scaffold YAML for a new model.If you find yourself writing inline column-cleansing logic that feels repeatable across more than one model, raise a follow-up to add it as a macro rather than copy-pasting.
This section extends, not replaces,
.github/pull_request_template.md.
A PR that adds or modifies an IL model must include:
.sql file with header and SQL conforming to this document.models/_model_yamls/<source>/<source>.yaml
with description, columns, data_types, and tests (see §15).dbt build --select state:modified+ locally.select * in final output (see §13.8).il_model_review_checklist.md
is filled out and pasted into the PR description, or referenced as already
reviewed.The following are intentionally out of scope for v1 and will be addressed in follow-up PRs:
macros/il_standards/ package.source_system / loaded_at defaults applied via
post-hooks._bl suffix renames and YAML migration from
co-located to _model_yamls/.| Version | Date | Author | Notes |
|---|---|---|---|
| v1.1 (draft) | 2026-06-17 | Steve Farmer | Repo survey integration: added §0 (relationship to .github/dbt_style_guide.md), §16 Sources, §17 Snapshots, §18 Existing macros. Updated §4 (Salesforce _formula exception, IL fact/dim suffix style, PL naming deferred), §5 (id PK exception list), §11 (Salesforce formula incremental pattern), §15 (_model_yamls/ location, prowatch as gold-standard, +persist_docs rationale), §19 (PR template cross-reference), §20 (sources/macros removed; PL naming added). |
| v1 (draft) | 2026-06-16 | Steve Farmer | Initial draft for team review. |