Loading Data the Cloud-Native Way: OPENROWSET from Blob Storage to Azure SQL at Scale
Hook: You’ve got terabytes of CSV/Parquet files landing in Azure Blob every hour. Your app teams want fresh rows in Azure SQL Database without babysitting SSIS, moving VMs, or bolting on brittle scripts. You need something cloud-native, cheap, and fast—ideally a single SQL statement you can run in parallel and automate. That’s OPENROWSET (BULK) with Azure Blob.
Why this matters
- Zero servers to manage. Read directly from Blob into Azure SQL—no middle tier required.
- Security-first. Use Managed Identity instead of scattering SAS tokens everywhere.
- Scales with patterns. Wildcards for batch loads, partitioned folders for parallelism, and idempotent MERGE patterns to keep data clean.
- Operationally sane. Built-in retries + small transaction batches = fewer 2 AM pages.
Architecture in a glance
- Data lands in Blob by partition (e.g.,
container/events/dt=2025-11-21/part-*.parquet). - Azure SQL defines an EXTERNAL DATA SOURCE pointing to that container (auth via SAS or Managed Identity).
- An INSERT…SELECT FROM OPENROWSET(BULK …) ingests files using a wildcard pattern.
- Loads run in parallel by partition (hour/day), with small batches for predictable logging and lock durations.
- A tiny ingestion ledger table enforces idempotency and powers safe retries.
Authentication: SAS vs Managed Identity (MI)
| Aspect | SAS Token | Managed Identity (Recommended) |
|---|---|---|
| Secret handling | You pass a long secret string | No secrets; Azure AD token requested at runtime |
| Rotation | Manual; tokens expire | Automatic token issuance |
| Scope | Can be right-sized via stored access policy | Scope via RBAC (Storage Blob Data Reader) |
| Risk | Easy to leak in code/scripts | Lower risk; no token to leak |
| Setup in Azure SQL | CREATE DATABASE SCOPED CREDENTIAL ... SECRET = 'sas' | IDENTITY = 'Managed Identity' (no secret) |
Blunt take: If you can use MI, do it. SAS is fine for quick starts, bad for long-lived prod.
One-time setup (Azure SQL)
1) Create the credential
Managed Identity (best):
-- Azure SQL Database with a system-assigned managed identity enabled on the server
CREATE DATABASE SCOPED CREDENTIAL [BlobMI]
WITH IDENTITY = 'Managed Identity';
SAS token (fallback):
-- SAS must include at least r (read) permission on the container or path
CREATE DATABASE SCOPED CREDENTIAL [BlobSAS]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2025-01-01&ss=bf...&se=2025-12-31T23%3A59Z&sp=rl...'; -- no leading '?'
2) Create the external data source
CREATE EXTERNAL DATA SOURCE [BlobEvents]
WITH (
TYPE = BLOB_STORAGE,
LOCATION = 'https://<account>.blob.core.windows.net/<container>',
CREDENTIAL = BlobMI -- or BlobSAS
);
Fast path: Ingest CSV or Parquet with OPENROWSET
CSV example (header row, inferred schema)
INSERT INTO dbo.events_raw (event_id, ts_utc, user_id, payload)
SELECT
TRY_CAST([event_id] AS BIGINT),
TRY_CAST([ts_utc] AS DATETIME2(3)),
TRY_CAST([user_id] AS INT),
[payload]
FROM OPENROWSET( BULK ‘events/dt=2025-11-21/*.csv’, DATA_SOURCE = ‘BlobEvents’, FORMAT = ‘CSV’, PARSER_VERSION = ‘2.0’, FIRSTROW = 2 ) WITH ( [event_id] VARCHAR(40), [ts_utc] VARCHAR(30), [user_id] VARCHAR(20), [payload] NVARCHAR(MAX) ) AS src;
Parquet example (schema-on-read)
INSERT INTO dbo.events_raw (event_id, ts_utc, user_id, payload)
SELECT
event_id,
ts_utc,
user_id,
payload
FROM OPENROWSET(
BULK 'events/dt=2025-11-21/*.parquet',
DATA_SOURCE = 'BlobEvents',
FORMAT = 'PARQUET'
) AS src;
Tip: Use folder partitioning (
dt=YYYY-MM-DD/) to isolate batches and to run loads concurrently per partition.
Idempotency that actually works
Create a tiny ingestion ledger that records which file (or partition) you’ve already applied.
CREATE TABLE dbo.ingestion_ledger (
batch_key VARCHAR(200) NOT NULL, -- e.g., 'dt=2025-11-21'
first_seen_utc DATETIME2(3) NOT NULL DEFAULT SYSUTCDATETIME(),
applied_utc DATETIME2(3) NULL,
checksum_sha256 CHAR(64) NULL,
CONSTRAINT PK_ingestion_ledger PRIMARY KEY (batch_key)
);
Pattern:
- Compute a batch_key (e.g.,
dt=2025-11-21). - If not present, insert a pending row.
- Load with
INSERT…SELECTfromOPENROWSET. MERGEinto the final table using a natural/business key (or a composite of fields).- Mark the ledger row as applied.
- On retries, your MERGE becomes a no-op, maintaining idempotency.
BEGIN TRY
SET XACT_ABORT ON;
BEGIN TRAN;
DECLARE @batch_key VARCHAR(200) = 'dt=2025-11-21';
IF NOT EXISTS (SELECT 1 FROM dbo.ingestion_ledger WITH (UPDLOCK, HOLDLOCK) WHERE batch_key = @batch_key)
INSERT dbo.ingestion_ledger(batch_key) VALUES (@batch_key);
;WITH src AS (
SELECT * FROM OPENROWSET(
BULK CONCAT('events/', @batch_key, '/*.parquet'),
DATA_SOURCE = 'BlobEvents',
FORMAT = 'PARQUET'
) AS f
)
MERGE dbo.events_final AS T
USING (
SELECT DISTINCT
CAST(event_id AS BIGINT) AS event_id,
CAST(ts_utc AS DATETIME2) AS ts_utc,
CAST(user_id AS INT) AS user_id,
payload
FROM src
) AS S
ON T.event_id = S.event_id -- your natural key
WHEN MATCHED THEN
UPDATE SET
T.ts_utc = S.ts_utc,
T.user_id = S.user_id,
T.payload = S.payload
WHEN NOT MATCHED THEN
INSERT (event_id, ts_utc, user_id, payload)
VALUES (S.event_id, S.ts_utc, S.user_id, S.payload);
UPDATE dbo.ingestion_ledger
SET applied_utc = SYSUTCDATETIME()
WHERE batch_key = @batch_key;
COMMIT;
END TRY
BEGIN CATCH
IF XACT_STATE() <> 0 ROLLBACK;
THROW;
END CATCH;
Blunt take: If you aren’t tracking batch/file identity, your retries will double-load data sooner or later.
Batching & parallelism (without wrecking the log)
Goals: predictable transaction sizes, less lock pressure, steady log growth.
- Partitioned folders (e.g., day/hour) → one transaction per partition.
- Wildcards + small partitions beat “one huge wildcard”.
- Rowgroup sizing: If converting to clustered columnstore, load in 100k–1M row chunks to form healthy rowgroups.
- TABLOCK when inserting into a heap to help minimal logging (verify in your tier).
- MAXDOP is limited in Azure SQL; prefer multiple small sessions over one monster session.
- Service tier right-sizing: loads are I/O and log heavy; bump DTUs/vCores for the ingest window.
Example: loop by hour
DECLARE @d DATE = '2025-11-21';
DECLARE @h INT = 0;
WHILE @h < 24
BEGIN
DECLARE @batch_key VARCHAR(200) = CONCAT('dt=', CONVERT(CHAR(10), @d, 23), '/hh=', FORMAT(@h, '00'));
EXEC dbo.load_partition @batch_key; -- wrap the MERGE pattern above in a procedure
SET @h += 1;
END
Retries that won’t make things worse
Wrap your ingestion in a lightweight retry with exponential backoff and error code filters (e.g., throttling, transient storage/network hiccups). Keep the transaction boundaries tight.
CREATE OR ALTER PROCEDURE dbo.load_with_retry @batch_key VARCHAR(200)
AS
BEGIN
DECLARE @attempt INT = 1, @max_attempts INT = 5, @delay_ms INT = 500;
WHILE @attempt <= @max_attempts
BEGIN
BEGIN TRY
EXEC dbo.load_partition @batch_key; -- your idempotent loader (MERGE+ledger)
RETURN;
END TRY
BEGIN CATCH
-- Filter on transient errors you actually want to retry (demo keeps it simple)
IF @attempt = @max_attempts THROW;
WAITFOR DELAY TIMEFROMPARTS(0,0,0,@delay_ms,3);
SET @delay_ms = @delay_ms * 2;
SET @attempt += 1;
END CATCH
END
END
Because the loader is idempotent, retries don’t duplicate rows.
Operational guardrails
- Schema-on-read validation: Use a
WITH (...)schema in CSV to catch drift early (TRY_CAST+ reject queue). - Quarantine bad files: On conversion failures, write offending file names to a
rejectstable with error text. - Compression matters: Prefer Parquet. Less I/O, enforced schema, better end-to-end SLAs.
- File sizing: Target 128–512 MB per file for balanced parallelism. Thousands of 10 KB files will kill you with overhead.
- RBAC for MI: Grant the SQL server’s MI Storage Blob Data Reader on the container (least privilege).
- Observability: Track load duration, rows read, rows inserted/updated, and size per batch in the ledger (add columns).
- Backfills: Throttle concurrency during business hours; crank it off-peak. Maintain a backfill queue (table) to coordinate.
Common pitfalls (and how to dodge them)
- Leaking SAS tokens in scripts or views → Prefer MI; if you must use SAS, keep them short-lived via stored access policy.
- One mega-transaction → Split by partition; you’ll avoid lock timeouts and log growth spikes.
- Blind INSERT → Always MERGE or
INSERT … WHERE NOT EXISTSkeyed by a business id to ensure idempotency. - Schema drift surprises → Use explicit columns +
TRY_CAST. Write rejects; don’t silently coerce. - Tiny files everywhere → Consolidate upstream (Spark/ADF) to sane file sizes.
- No isolation of staging vs final → Land into staging, then MERGE to final to keep queries stable.
Conclusion & Takeaways
- OPENROWSET (BULK) + Managed Identity gives you a secure, serverless-ish ingest path from Blob to Azure SQL.
- Partition your data and batch by folder for controllable parallelism and stable operations.
- Idempotent MERGE + ledger makes retries safe and boring (that’s good).
- Prefer Parquet, right-size files, and monitor with a few extra columns in your ledger.
Call to action: Start with one partitioned day in a lower environment, wire up the ledger and retry wrapper, then scale by adding concurrent partitions. If you hit a ceiling, I’ll help you profile log I/O and reshape batches.
Internal link ideas (add these on your site)
- “Managed Identity for Azure SQL & Blob: End-to-End Setup”
- “Idempotent Ingestion Patterns with MERGE: Ledger Schema Cheatsheet”
- “Parquet vs CSV in Azure SQL OPENROWSET: Performance Notes & Gotchas”
- “Blob Partitioning Strategies: dt/hour device shards”
- “When to outgrow Azure SQL ingestion: Synapse & Fabric choices”
Image prompt (for DALL·E / Midjourney)
“A clean, modern data architecture diagram showing Azure Blob Storage partitions feeding Azure SQL via OPENROWSET (BULK). Highlight Managed Identity auth, partitioned folders (dt/hour), parallel loaders per partition, a staging table, a MERGE into final, and an ingestion ledger. Minimalistic, high contrast, 3D isometric style.”
Tags
#AzureSQL #OPENROWSET #AzureBlob #ManagedIdentity #DataIngestion #Parquet #Idempotency #DataEngineering #BatchProcessing #SQL








Leave a Reply