Automating Data Refresh in Microsoft Fabric: Foundational Architecture Decision For a Successful AI Initiative

CodeCraft

1 week ago

Blogs Highlights

Automating Data Refresh in Microsoft Fabric: Foundational Architecture Decision For a Successful AI Initiative

Spread the love

How a hybrid refresh architecture in Microsoft Fabric makes data trustworthy at scale and why it is the prerequisite for AI and data democratization

Gartner reported in January 2026 that at least half of all generative AI projects were abandoned after proof of concept by the end of last year. The leading cause was not the model, it was the data.

When AI underperforms, organizations tend to examine factors such as model selection, prompt design, or adoption gaps. They do not scrutinize the data infrastructure delivering inputs to these models.

Most organizations have made considerable investments in data analysis tools to extract insights, but inherited their data pipelines by default. Whether these pipelines are delivering complete and latest data is not discovered until something fails. And when a pipeline fails, the downstream outputs may not reflect it. The errors such as a stale dataset, an incomplete load, or a failed validation check can pass through undetected while the results continue to seem credible.

BCG’s Widening AI Value Gap (2025) report found that only 5% of companies worldwide are generating substantial value from AI, and building the right technology and data foundations is among the capabilities that define them. The architecture governing how enterprise data is updated, validated, and delivered is not incidental to AI performance, it is the deciding factor.

Organizations are extending data access in two directions: broader across the business through self-service reporting and democratized analytics, and deeper into automated systems through AI. Both the movements rest on the assumption that the data being accessed is sound, and this assumption is rarely examined at the refresh and ingestion layer.

This article documents how we built a data pipeline in Microsoft Fabric to address this.

The Refresh Problem at Scale

A large US manufacturing organization’s reporting environment had grown considerably over the last several years. The production data, maintenance records, and operational metrics flowed from multiple source systems into a centralized analytics layer built on Microsoft Fabric, with Power BI as the primary consumption surface for business and operational leadership.

The refresh infrastructure had not been able to keep pace. Data was being loaded without validation steps, that is, it arrived from source systems and went directly into tables that were consumed by the downstream reports. The schema changes upstream went undetected until a report failed and late-arriving corrections to prior periods either overwrote records indiscriminately or were missed entirely.

The questions that the architecture could not answer reliably had become business problems:

How fast was new information available for decision-making?
How were prior-period corrections absorbed without compromising historical reporting?
How were upstream schema changes detected before they reached reported metrics?
When failures occurred, how soon were they flagged?

The existing approach applied a uniform refresh strategy across datasets that behaved differently. The large tables with stable historical records processed in full on every cycle alongside structurally sensitive datasets subject to upstream schema changes without any distinction between them. The result was unnecessary compute exposure, accumulating lineage ambiguity and duplication risk.

Hence, there was a need for an architecture that treated each dataset according to its own characteristics, aligned refresh strategy to each dataset, embedded validation before publication, and contained failures.

Building Inside Fabric

As the organization was already in the Microsoft ecosystem with Power BI for reporting, SharePoint as a source, and Fabric as the platform, the native tooling was the obvious starting point, and it was adequate for simpler ingestion. However, the requirements here were not simple.

Power BI’s native incremental refresh operates at the semantic model layer and controls what it loads into memory. For an organization managing large operational datasets across multiple source systems, with validation requirements that needed to be enforced before data reached a reporting surface, this was the wrong layer to work at.

The decision to build a custom ELTL (Extract, Load, Transform, Load) pipeline inside Fabric came down to three requirements that native options could not satisfy together:

Pre- validation: Data required to be checked for structural conformity, completeness, and temporal boundary compliance before it reached an authoritative table.
Dataset-specific refresh strategies: Some tables required only their most recent periods to be rebuilt on each cycle, leaving historical records untouched. Others required a complete replacement when upstream structure changed. A uniform strategy applied across both created unnecessary compute exposure or unacceptable lineage risk.
Observability: Every run needed to produce a structured outcome, such as success or failure, rows processed, validation status, and communicated immediately to the right stakeholders. In an environment where operational leadership consumed reports directly, a failure that passed without notification was a governance failure. In a democratized data environment, business users consuming reports do not have visibility into whether the data underlying them is validated. The notification layer and audit trail make broad access governable.

Fabric provided the orchestration layer, compute, and connectors, but lacked a pipeline designed to use these capabilities deliberately. The architecture that meets all the three requirements is a hybrid refresh model that treats each dataset according to its behaviour.

Architecture in Practice: Hybrid Refresh Model

The architecture that addressed these requirements combines two distinct refresh strategies: partial refresh and full table replacement applied to different datasets based on their behavior. A single refresh strategy applied uniformly across all datasets is one of the more common architectural decisions that looks reasonable at the outset and becomes a liability at scale.

The choice between the two strategies follows a clear set of criteria:

Partial refresh applies where change is concentrated in recent periods, historical records remain untouched, and late-arriving corrections are absorbed without destabilising prior reporting. The reconstruction window, set at three calendar months, is calibrated to the correction lag of the source systems feeding this environment. Most upstream corrections to operational data arrive within a two-month window. The three-month boundary provides a deliberate buffer. Records older than that window are statistically stable and reprocessing them on every cycle does not show any improvement in data quality at material compute cost.
Full table replacement applies where datasets are delivered as complete period-end extracts, or where upstream structural changes make incremental reconciliation impractical. When a source system redefines a field, introduces a new column, or reorganizes its output, the delta between old and new structure creates reconciliation edge cases that are difficult to test exhaustively. A clean replacement eliminates those edge cases, resets the table structure, and produces unambiguous audit lineage.

Both strategies operate under the same governance framework. The same validation controls run before data reaches a consumption-ready table, structured metadata is produced on every run and the notification layer communicates the outcome to the relevant teams immediately.

Partial Refresh: How It Works

The default choice for keeping data current is to rebuild the entire table on every run. For large operational datasets with stable historical records, this is cost intensive. When a full rebuild fails mid-cycle, the entire dataset is at risk and with growing data volumes, the compute cost of reprocessing history proves to be much higher than the marginal improvement in data quality.

The partial refresh approach reconstructs only the most recent three full calendar months and the current month on every run, leaving everything prior as it is. The result is a refresh window that is predictable in cost and bounded in failure exposure. The pipeline executes this in five steps:

1. Validating the Staged Dataset

Most data quality failures are discovered downstream in a report or by a stakeholder. The pipeline addresses this by running four explicit checks against the staged dataset before the master table is touched:

Presence: The dataset contains records and is not empty. An empty file arriving from a source system passes through undetected under a standard append but this check stops it before it reaches the master table.
Schema conformity: The structure matches the expected column definitions. This check catches upstream field changes, such as a renamed column or a changed data type, at the staging boundary before they propagate into downstream models or reports.
Completeness: Record counts fall within the expected range for the reconstruction window. A file that contains records fewer than expected is a partial extraction and is not published.
Boundary compliance: All records fall within the correct temporal range for the cycle. Data landing outside the expected window, which is a common consequence of incorrect batch processing in source systems, is rejected before it reaches the master table.

If any check fails, the pipeline halts immediately and sends a failure notification. A validation failure routes directly to the notification connector without subsequent step runs.

2. Scoped Deletion of the Reconstruction Window

Once validation passes, the Notebook removes only the records that will be replaced–the last three full calendar months and the current partial month. The boundary is calculated at runtime on every cycle, which means it advances automatically and does not need manual updating. The result is a reconstruction window that is always rebuilt from a clean state, without duplicates and stale records.

3. Appending Cleaned and Validated Records

Validated rows from the staging table are appended to the master dataset. The reconstruction window is clean, accurate, and current with all prior history unchanged. The logic is deliberately idempotent so that interrupting and rerunning the pipeline at any point produces an identical final state. The master table does not contain duplicates or ambiguity after a successful run or a retry.

4. Generating Structured JSON Output

Every run produces a log in the following structure:

{

“run_timestamp”: “…”,

“operation_type”: “partial_refresh”,

“rows_deleted”: 000,

“rows_inserted”: 000,

“status”: “success”,

“comments”: “…”

}

It is an operational input passed to the notification connector, written to logging systems for audit traceability, and available for downstream monitoring.

The row count delta between rows_deleted and rows_inserted means a consistent pattern of insertions significantly exceeding deletions may indicate upstream data duplication. A sudden drop in rows_inserted relative to prior cycles may indicate a partial extraction from the source. These signals are available for anomaly detection without additional instrumentation because the pipeline produces them as a matter of course.

5. Automated Notification

The pipeline’s Outlook connector dispatches a structured result summary on every run with execution timestamp, success or failure status, rows deleted and inserted, and any validation errors encountered. The notification fires on both success and failure paths, ensuring the outcome of all runs is communicated.

The relevant teams know the state of the data immediately, without checking pipeline logs, and if something went wrong they are notified before a dashboard has consumed the result.

Full Table Replacement: How It Works

Incremental logic does not apply to all datasets. When a source system delivers a complete file each period, or when upstream structural changes have been introduced, applying incremental reconciliation adds complexity without improving reliability, which introduces failure modes of its own.

For these datasets, Dataflow Gen2 handles the entire process. It extracts raw data from SharePoint, applies the required transformations, and overwrites the target Lakehouse table. The pipeline branching is simpler than the partial refresh path as it does not require Notebook execution, scoped deletion, or reconstruction window. Dataflow either completes successfully or does not, and the outcome is communicated through the same Outlook connector used in the partial refresh pipeline.

The governance standards are identical across both paths. The same metadata output is produced and notification logic applied. The audit trail is consistent regardless of the refresh strategy a dataset follows, which matters when governance stakeholders need to trace the lineage of a specific data state back to a pipeline run.

Two Strategies, One Framework

The two refresh strategies share Fabric Pipeline as a common orchestration layer, but differ in the way they handle ingestion. The complex ingestion requiring programmatic control, conditional logic, and validation is handled in a Python Notebook orchestrated via Fabric Pipeline. And the simple flat file ingestion from SharePoint is handled by Dataflow Gen2.

Component	Partial Refresh	Full Replacement
Ingestion tool	Notebook handles complex ingestion logic, orchestrated via Fabric Pipeline	Dataflow Gen2 handles simple flat file ingestion from SharePoint
Notebook	Yes. validation, scoped deletion, and reinsert	No
Deletion logic	Scoped to definite time window	N/A
Log	Yes	Yes
Email notification	Yes, both outcomes	Yes, both outcomes
Idempotency	Delete/reinsert is deterministic	Full overwrite is deterministic
Best suited for	Daily-changing, historically stable datasets	Complete-period files, structural resets

What the Architecture Delivers

The value of a refresh architecture reveals itself over time in the quality of decisions, the stability of the historical record, and its capacity to absorb change without requiring intervention. For this organization, the hybrid refresh model delivered across three dimensions that matter to leadership.

Data that can be trusted

Validation runs before any permanent changes occur. Schema drift, empty loads, and out-of-range records are caught before they reach an authoritative table, governed by the architecture rather than by manual oversight.

In an environment where operational decisions are based on reported metrics every day, the difference between validated and unvalidated data is the difference between a foundation and a risk.

Intact Historical Record

The rolling window reconstruction absorbs corrections and late arrivals in recent periods while mature historical records remain fixed. The records outside the reconstruction window remain untouched by a refresh cycle.

In a manufacturing environment where forecasting, audit processes, and operational reviews depend on the integrity of prior periods, this stability is a business requirement. The architecture enforces it by design rather than relying on operational discipline to preserve it.

Early Failure Detection

Every run produces a structured record with timestamp, operation type, rows deleted and inserted, validation status, and execution outcome. The relevant teams are notified immediately. A failed or incomplete load cannot reach a downstream report without being flagged first.

The JSON metadata accumulates into a time series that can be monitored. The patterns in row count deltas, recurring validation failures at specific time windows, and anomalies in execution duration are detectable without additional instrumentation.

Why This Matters for AI

Data democratization and AI adoption share a dependency that most organizations have not explicitly addressed, which is the quality and validity of data being used. A refresh architecture validates data, notifies on every run, and produces a complete audit trail, making broader access to data useful for producing better decisions.

Fabric’s AI capabilities consume what Lakehouse tables contain. Organizations invest considerably more in selecting and configuring the model than in examining the data it consumes, which poses great risk. The validation step in the partial refresh pipeline changes this sequencing. Data passes structural checks before it reaches an authoritative table, which is the point at which a data problem can be contained. Once it crosses that boundary, it becomes an input to a critical forecast or report.

The JSON metadata extends the architecture’s reach further. Row counts, operation types, and status signals accumulate into a structured time series that can easily be monitored. The occurrences, such as recurring validation failures at specific time windows or anomalies in execution duration, are detectable without additional tooling. The pipeline generates the evidence needed to govern it.

The organizations extracting real value from AI are the ones who are making a fundamental shift in how they store, access and govern data, treating architecture as a strategic capability. If you are looking to build the data architecture that enables AI and business insights at scale, reach out to our team.

Talk to an Expert

AI/ML

Artificial intelligence