Lakehouse at Lightspeed: Querying Iceberg & Hudi with VeloDB (Powered by Apache Doris)
Meta description (156 chars):
Learn how VeloDB supercharges Iceberg and Hudi in your lakehouse. Set up catalogs, run cross-catalog joins, and tune caches for interactive analytics.
Why this matters
You already have data sitting in Iceberg or Hudi. Analysts want BI-speed queries, engineers want zero copy, and security wants private networking. VeloDB (a managed, enterprise distro of Apache Doris) lets you query those lake tables directly—with sub-second interactivity—without rebuilding pipelines. Think: “warehouse-like speed over open tables.” (VeloDB)
What VeloDB actually is (and why it’s fast)
VeloDB is a real-time OLAP database built on Apache Doris. Doris added a mature “lakehouse” path with External Catalogs, vectorized execution, and optimizations for Iceberg/Hudi—including time travel, delete file handling, incremental read, cross-catalog joins, and write-back to Iceberg. VeloDB ships those capabilities with cloud management and BYOC networking. (Apache Doris)
Key ingredients you’ll use:
- Multi-Catalog to attach Iceberg/Hudi once and auto-sync metadata. (Apache Doris)
- Iceberg engine with V1/V2 support, position/equality deletes, system tables, branches/tags. (Apache Doris)
- Hudi COW/MOR snapshot & incremental queries + time travel. (Apache Doris)
- Data Cache (local NVMe) and Metadata Cache to avoid re-listing remote files and to stabilize SLAs. (Apache Doris)
- Private connectivity for BYOC with AWS PrivateLink-style endpoints. (VeloDB Docs)
Architecture at a glance
Analogy: Treat Iceberg/Hudi as your “open tables,” S3/HDFS as the pantry, and VeloDB as the sous-chef that brings hot data to the line and keeps the menu (metadata) cached for seconds-fast service.
Data path:
- Client (JDBC/MySQL protocol) → VeloDB.
- VeloDB uses the catalog to resolve table metadata (REST/HMS/Glue). (Apache Doris)
- Files read from object storage (S3/GCS/HDFS). Hot blocks stay in Data Cache on BE nodes. (Apache Doris)
- Vectorized execution, partition & file pruning, delete-file handling. (Apache Doris)
Quick start: wire Iceberg & Hudi into VeloDB
1) Create an Iceberg catalog (REST / Unity Catalog, Glue, HMS, Hadoop)
REST (Unity Catalog) example
CREATE CATALOG dbx_unity_catalog PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"uri" = "https://<dbc>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest/",
"iceberg.rest.security.type" = "oauth2",
"iceberg.rest.oauth2.token" = "<token>",
"iceberg.rest.vended-credentials-enabled" = "true",
"warehouse" = "my-unity-catalog"
);
Then:
USE dbx_unity_catalog.`default`;
SELECT * FROM iceberg_table;
(Also supports OAuth2 client credentials and static S3 credentials when you don’t use vended creds.) (Apache Doris)
Hadoop / HMS / Glue options are supported as well; pick the one that matches your metastore. (VeloDB Docs)
2) Create a Hudi catalog (via Hive Metastore)
CREATE CATALOG hudi PROPERTIES (
"type" = "hms",
"hive.metastore.uris" = "thrift://hive-metastore:9083",
"s3.endpoint" = "http://minio:9000",
"s3.access_key" = "minio",
"s3.secret_key" = "minio123",
"s3.region" = "us-east-1",
"use_path_style" = "true"
);
REFRESH CATALOG hudi;
Query both COW and MOR tables and use time travel or incremental read when needed. (Apache Doris)
3) Cross-catalog joins (ZeroETL)
After attaching catalogs, you can join across them—e.g., Hudi fact to Iceberg dim:
SELECT f.order_id, d.category, f.amount
FROM hudi.retail.fact_orders f
JOIN iceberg.dim.dim_product d
ON f.product_id = d.product_id
WHERE f.order_ts >= NOW() - INTERVAL 1 DAY;
Cross-catalog queries are a first-class Doris feature. (Apache Doris)
Real example: Create and write Iceberg from VeloDB
Create an Iceberg catalog and table; then insert and query:
-- Catalog (REST example for demo)
CREATE CATALOG iceberg PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"warehouse" = "s3://warehouse/",
"uri" = "http://rest:8181",
"s3.access_key" = "admin",
"s3.secret_key" = "password",
"s3.endpoint" = "http://minio:9000"
);
-- Switch and create table with partition transforms
SWITCH iceberg;
CREATE DATABASE nyc;
CREATE TABLE nyc.taxis (
vendor_id BIGINT,
trip_id BIGINT,
trip_distance FLOAT,
fare_amount DOUBLE,
store_and_fwd_flag STRING,
ts DATETIME
)
PARTITION BY LIST (vendor_id, DAY(ts)) ()
PROPERTIES ("compression-codec"="zstd","write-format"="parquet");
INSERT INTO nyc.taxis VALUES
(1,1000371,1.8,15.32,'N','2024-01-01 09:15:23');
-- Query current snapshot
SELECT * FROM nyc.taxis;
These patterns—CREATE CATALOG, SWITCH, partition transforms, INSERT, and snapshot/time-travel queries—are built-in. (Apache Doris)
Performance tuning for “interactive” lake queries
1) Turn on Data Cache (hot blocks on NVMe)
- Configure BE
file_cache_pathwith enough space. - Enable at session/global level:
SET enable_file_cache = true;
Use for hot time windows (today/last N days). Monitor hit rates in the FE Profile and thefile_cache_statisticssystem table. (Apache Doris)
2) Right-size Metadata Cache
- Controls database/table lists, schemas, partitions, and file lists.
- Tune eviction/refresh to balance freshness vs. latency (
external_cache_*params). (Apache Doris)
3) Prune early, prune often
- Use partition transforms (
day(ts),bucket,truncate) to minimize scanned files. - Doris maps Iceberg transforms and supports delete-file semantics, so filters matter. (VeloDB Docs)
4) Elastic compute
- Separate compute scales independently; attach catalogs once and scale for peak hours. (Lakehouse overview + ECN docs). (Apache Doris)
5) Secure & stable connectivity
- In BYOC, set private endpoints to VeloDB Cloud control plane (no public Internet). (VeloDB Docs)
When to pick Iceberg vs. Hudi (for query serving)
| Scenario (common in BI/ops) | Prefer Iceberg | Prefer Hudi |
|---|---|---|
| Large batch analytics, schema evolution, multi-engine sharing | ✅ | |
| Real-time upserts with incremental pull for near-real-time dashboards | ✅ | |
| Time travel over stable snapshots | ✅ | ✅ |
| Merge-On-Read (row-level freshness with compacting later) | ✅ |
Rationale: both work great through VeloDB; Iceberg brings rich snapshot/branching/system tables, while Hudi shines for incremental ingestion and MOR/COW table types. Query engines in VeloDB support both natively. (VeloDB Docs)
Best practices (and pitfalls to dodge)
Do this
- Start with catalog-linked access; don’t copy data unless you must. (Apache Doris)
- Partition on time + a low-cardinality dimension; add bucket for high-cardinality joins. (VeloDB Docs)
- Enable Data Cache on NVMe for the last N days; watch cache hit rate. (Apache Doris)
- Set sane metadata refresh windows to keep schemas/partitions fresh without hammering HMS/REST. (Apache Doris)
- Use cross-catalog joins to enrich lake facts with JDBC/operational dims without ETL. (Apache Doris)
Avoid this
- Tiny Parquet files (death by file-listing). Compact in your lake tool; Doris will still cache metadata, but physics wins. (Apache Doris)
- Over-wide MOR queries with no filters; add predicates and only read needed columns. (Apache Doris)
- Using public networking for BYOC; configure a private endpoint. (VeloDB Docs)
Summary & call to action
If your data already lives in Iceberg or Hudi, VeloDB lets you query it at warehouse speed—without duplicating tables. Attach catalogs, flip on data/metadata caches, and start serving BI and real-time analytics straight from your lake.
Next step:
- Attach your first catalog (REST/HMS/Glue).
- Enable Data Cache.
- Run a cross-catalog join and measure p95 latency.
Internal link ideas (for your site)
- “Designing Iceberg partition strategies for BI and ML”
- “Hudi incremental ingestion patterns with Flink → VeloDB”
- “Tuning Doris Data/Metadata Cache for S3”
- “Unity Catalog + VeloDB: secure patterns with vended credentials”
- “From parquet sprawl to performance: compaction strategies”
Image prompt
“A clean, modern lakehouse diagram showing VeloDB (Apache Doris) querying Iceberg and Hudi tables on S3 via REST/HMS, with data & metadata caches highlighted — minimalistic, high contrast, 3D isometric style.”
Tags
#NoSQL #VeloDB #ApacheDoris #Iceberg #Hudi #Lakehouse #DataEngineering #RealTimeAnalytics #Catalog #S3
References (official)
- VeloDB site & docs: product overview; Iceberg and Hudi best practices; catalogs and versions; BYOC private endpoints. (VeloDB)
- Apache Doris docs: Iceberg & Hudi integration, Multi-Catalog/Catalog Overview, Data/Metadata Cache, Unity Catalog REST integration, and example SQL. (Apache Doris)










Leave a Reply