Running dbt inside a Microsoft Fabric Python Notebook
This skill explains how to run a full dbt pipeline inside a Microsoft Fabric Python notebook using DuckDB as the compute engine and DuckLake for Delta Lake table management.
Architecture
Fabric Notebook (ephemeral Python session)
-> pip install duckdb, dbt-duckdb
-> dbt run (DuckDB in-memory + DuckLake extension)
-> on-run-start: installs delta_export extension, downloads data
-> Transforms with dbt models (incremental by file)
-> DuckLake writes Parquet to abfss://.../<lakehouse>/Tables/
-> on-run-end: CALL delta_export() writes Delta Lake _delta_log
-> Fabric / Power BI reads Delta tables natively
Key insight: DuckDB runs in-memory inside the notebook. DuckLake stores data as Parquet on OneLake (abfss://) and keeps metadata in a local SQLite file. The delta_export DuckDB extension converts DuckLake metadata into Delta Lake transaction logs — no separate Python package needed. Everything runs inside dbt run.
Notebook Template (4 cells)
Cell 1 — Install dependencies
python
1!pip install -q duckdb==1.4.4
2!pip install -q dbt-duckdb
3import sys
4sys.exit(0) # Restart kernel to pick up new packages
Cell 2 — Environment variables
python
1import os
2os.environ['DBT_SCHEMA'] = 'my_schema'
3os.environ['ROOT_PATH'] = 'abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<lakehouse>.Lakehouse'
4os.environ['download_limit'] = '100'
5os.environ['process_limit'] = '100'
6os.environ['METADATA_LOCAL_PATH'] = '/synfs/nb_resource/builtin/metadata.db'
| Variable | Purpose |
|---|
DBT_SCHEMA | Target schema name in DuckLake |
ROOT_PATH | abfss:// path to the Fabric lakehouse root |
download_limit | Max files to download per source per run |
process_limit | Max files to process per model per run (default: 100) |
METADATA_LOCAL_PATH | Local path for DuckLake SQLite metadata DB. Use /synfs/nb_resource/builtin/ in Fabric — this persists across notebook sessions |
Cell 3 — Clone dbt project
Public repo:
python
1!git clone --branch production --single-branch https://github.com/<your-repo>.git /tmp/dbt
Private repo (production):
python
1pat = mssparkutils.credentials.getSecret('https://<vault-name>.vault.azure.net/', 'github-pat')
2!git clone --branch production --single-branch https://{pat}@github.com/<your-repo>.git /tmp/dbt
For private repos, store a GitHub fine-grained PAT (with Contents: Read permission) in Azure Key Vault. Grant the Fabric workspace identity Key Vault Secrets User access.
Cell 4 — Run dbt
python
1!cd /tmp/dbt && dbt run && dbt test
Delta export happens automatically via the on-run-end hook — no separate step needed.
profiles.yml Configuration
yaml
1my_project:
2 target: dev
3 outputs:
4 dev:
5 type: duckdb
6 path: ':memory:'
7 database: ducklake
8 threads: 1
9 schema: "{{ env_var('DBT_SCHEMA', 'default') }}"
10 config_options:
11 allow_unsigned_extensions: true
12 settings:
13 preserve_insertion_order: false
14 extensions:
15 - parquet
16 - sqlite
17 - azure
18 - httpfs
19 - json
20 - ducklake
21 - name: zipfs
22 repo: community
23 - name: delta_export
24 repo: community
25 attach:
26 - path: "ducklake:sqlite:{{ env_var('METADATA_LOCAL_PATH', '/tmp/ducklake_metadata.db') }}"
27 alias: ducklake
28 options:
29 data_path: "{{ env_var('ROOT_PATH') }}/Tables"
30 data_inlining_row_limit: 0
Key points:
path: ':memory:' — DuckDB runs fully in-memory, no local database file
database: ducklake — tells dbt to target the attached DuckLake catalog
config_options.allow_unsigned_extensions: true — must be set at connection creation time (not settings) to allow community extensions
settings.preserve_insertion_order: false — reduces memory usage by allowing DuckDB to reorder results
data_path points to abfss://.../<lakehouse>/Tables where Parquet data lives
data_inlining_row_limit: 0 — disables data inlining (small writes go to Parquet, not metadata)
- Extensions:
azure for abfss://, httpfs for HTTP downloads, zipfs for reading CSVs from ZIPs
dbt_project.yml — Delta Export & DuckLake Maintenance
yaml
1on-run-start:
2 - "CALL ducklake.set_option('rewrite_delete_threshold', 0)"
3 - "CALL ducklake.set_option('target_file_size', '128MB')"
4 - "{{ download() }}" # your data ingestion macro
5
6on-run-end:
7 - "CALL ducklake_rewrite_data_files('ducklake')"
8 - "CALL ducklake_merge_adjacent_files('ducklake')"
9 - "CALL delta_export()"
delta_export is now a community extension — installed via profiles.yml extensions list (no manual INSTALL/LOAD needed). At run end, DuckLake maintenance compacts data files before delta_export() writes Delta Lake transaction logs.
Why delta_export is mandatory: DuckLake is not natively supported by Spark, Power BI, or any other engine in Microsoft Fabric. Without delta_export, the Parquet files written by DuckLake are invisible to the rest of the platform. The Delta Lake transaction logs (_delta_log/) make the tables readable as standard Delta tables by Power BI, Spark, SQL Analytics, and any Delta-compatible tool.
DuckLake Maintenance (on-run-end)
| Call | Purpose |
|---|
ducklake_rewrite_data_files('ducklake') | Rewrites files with deletes (threshold controlled by rewrite_delete_threshold) |
ducklake_merge_adjacent_files('ducklake') | Merges small Parquet files into larger ones (target size from target_file_size) |
delta_export() | Writes Delta Lake _delta_log/ so Fabric/Power BI reads tables natively |
DuckLake Options
| Option | Value | How to Set | Why |
|---|
data_inlining_row_limit | 0 | ATTACH option in profiles.yml | Disable storing small inserts in metadata DB |
rewrite_delete_threshold | 0 | set_option in on-run-start (not an ATTACH param) | Rewrite all files with any deletes |
target_file_size | 128MB | set_option in on-run-start (not an ATTACH param) | Target Parquet file size for insert and compaction (default 512MB, reduced to prevent OOM) |
Note: target_file_size controls both merge_adjacent_files merge target and auto-splitting of large inserts. It cannot be set via ATTACH options — must use set_option.
Schema Separation (raw + aemo)
Separate intermediate tables from Power BI-facing tables using dbt schema config:
aemo schema (default from DBT_SCHEMA): dim_calendar, dim_duid, fct_summary — exposed to Power BI
raw schema: fct_scada, fct_scada_today, fct_price, fct_price_today, stg_csv_archive_log — intermediate
dbt_project.yml:
yaml
1models:
2 aemo_electricity:
3 staging:
4 +materialized: view
5 +schema: raw
6 dimensions:
7 +materialized: table
8 marts:
9 +materialized: incremental
10 +schema: raw
Override specific models back to aemo: {{ config(schema='aemo') }} in fct_summary.sql.
Custom schema macro required (macros/generate_schema_name.sql):
sql
1{% macro generate_schema_name(custom_schema_name, node) -%}
2 {%- if custom_schema_name is none -%}
3 {{ target.schema }}
4 {%- else -%}
5 {{ custom_schema_name | trim }}
6 {%- endif -%}
7{%- endmacro %}
Without this, dbt prefixes the target schema (e.g., aemo_raw instead of raw).
Semantic Model (model.bim)
Deploy via Fabric REST API using TMSL format (model.bim) — TMDL create is NOT supported by the API.
Key requirements for Direct Lake on OneLake (no SQL endpoint):
PBI_ProTooling annotation with DirectLakeOnOneLakeCreatedInDesktop
sourceLineageTag on all tables ([schema].[table]) and columns
relyOnReferentialIntegrity on all relationships
PBI_RemovedChildren on expression listing excluded tables with correct schema prefix
- Partition
schemaName must match the lakehouse schema folder name
Key Patterns
DuckLake's SQLite metadata DB needs a local filesystem. In Fabric notebooks, use /synfs/nb_resource/builtin/ — this is the notebook resource folder that persists across sessions. Set METADATA_LOCAL_PATH to a file in that folder.
Incremental by file with process_limit
Track which source files have been processed using a file column in each fact table. Pre-hooks use DuckDB SET VARIABLE to pass only unprocessed file paths, capped by process_limit:
sql
1{{ config(
2 materialized='incremental',
3 unique_key=['file', 'DUID', 'SETTLEMENTDATE'],
4 pre_hook="SET VARIABLE paths = (
5 SELECT list(...)
6 FROM (
7 SELECT source_filename
8 FROM {{ ref('stg_archive_log') }}
9 WHERE source_type = 'daily'
10 {% if is_incremental() %}
11 AND source_filename NOT IN (SELECT DISTINCT file FROM {{ this }})
12 {% endif %}
13 LIMIT {{ env_var('process_limit', '100') }}
14 )
15 )"
16) }}
Reading CSVs from ZIP archives
DuckDB can read CSVs directly from ZIP files on remote storage:
sql
1SELECT * FROM read_csv(
2 getvariable('paths'),
3 ignore_errors=true,
4 null_padding=true
5)
Reference
See reference.md for full code examples.