Lakehouse at Lightspeed

Lakehouse at Lightspeed: Querying Iceberg & Hudi with VeloDB (Powered by Apache Doris)

Meta description (156 chars):
Learn how VeloDB supercharges Iceberg and Hudi in your lakehouse. Set up catalogs, run cross-catalog joins, and tune caches for interactive analytics.


Why this matters

You already have data sitting in Iceberg or Hudi. Analysts want BI-speed queries, engineers want zero copy, and security wants private networking. VeloDB (a managed, enterprise distro of Apache Doris) lets you query those lake tables directly—with sub-second interactivity—without rebuilding pipelines. Think: “warehouse-like speed over open tables.” (VeloDB)


What VeloDB actually is (and why it’s fast)

VeloDB is a real-time OLAP database built on Apache Doris. Doris added a mature “lakehouse” path with External Catalogs, vectorized execution, and optimizations for Iceberg/Hudi—including time travel, delete file handling, incremental read, cross-catalog joins, and write-back to Iceberg. VeloDB ships those capabilities with cloud management and BYOC networking. (Apache Doris)

Key ingredients you’ll use:

  • Multi-Catalog to attach Iceberg/Hudi once and auto-sync metadata. (Apache Doris)
  • Iceberg engine with V1/V2 support, position/equality deletes, system tables, branches/tags. (Apache Doris)
  • Hudi COW/MOR snapshot & incremental queries + time travel. (Apache Doris)
  • Data Cache (local NVMe) and Metadata Cache to avoid re-listing remote files and to stabilize SLAs. (Apache Doris)
  • Private connectivity for BYOC with AWS PrivateLink-style endpoints. (VeloDB Docs)

Architecture at a glance

Analogy: Treat Iceberg/Hudi as your “open tables,” S3/HDFS as the pantry, and VeloDB as the sous-chef that brings hot data to the line and keeps the menu (metadata) cached for seconds-fast service.

Data path:

  1. Client (JDBC/MySQL protocol) → VeloDB.
  2. VeloDB uses the catalog to resolve table metadata (REST/HMS/Glue). (Apache Doris)
  3. Files read from object storage (S3/GCS/HDFS). Hot blocks stay in Data Cache on BE nodes. (Apache Doris)
  4. Vectorized execution, partition & file pruning, delete-file handling. (Apache Doris)

Quick start: wire Iceberg & Hudi into VeloDB

1) Create an Iceberg catalog (REST / Unity Catalog, Glue, HMS, Hadoop)

REST (Unity Catalog) example

CREATE CATALOG dbx_unity_catalog PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "rest",
  "uri" = "https://<dbc>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest/",
  "iceberg.rest.security.type" = "oauth2",
  "iceberg.rest.oauth2.token" = "<token>",
  "iceberg.rest.vended-credentials-enabled" = "true",
  "warehouse" = "my-unity-catalog"
);

Then:

USE dbx_unity_catalog.`default`;
SELECT * FROM iceberg_table;

(Also supports OAuth2 client credentials and static S3 credentials when you don’t use vended creds.) (Apache Doris)

Hadoop / HMS / Glue options are supported as well; pick the one that matches your metastore. (VeloDB Docs)

2) Create a Hudi catalog (via Hive Metastore)

CREATE CATALOG hudi PROPERTIES (
  "type" = "hms",
  "hive.metastore.uris" = "thrift://hive-metastore:9083",
  "s3.endpoint" = "http://minio:9000",
  "s3.access_key" = "minio",
  "s3.secret_key" = "minio123",
  "s3.region" = "us-east-1",
  "use_path_style" = "true"
);
REFRESH CATALOG hudi;

Query both COW and MOR tables and use time travel or incremental read when needed. (Apache Doris)

3) Cross-catalog joins (ZeroETL)

After attaching catalogs, you can join across them—e.g., Hudi fact to Iceberg dim:

SELECT f.order_id, d.category, f.amount
FROM hudi.retail.fact_orders f
JOIN iceberg.dim.dim_product d
  ON f.product_id = d.product_id
WHERE f.order_ts >= NOW() - INTERVAL 1 DAY;

Cross-catalog queries are a first-class Doris feature. (Apache Doris)


Real example: Create and write Iceberg from VeloDB

Create an Iceberg catalog and table; then insert and query:

-- Catalog (REST example for demo)
CREATE CATALOG iceberg PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "rest",
  "warehouse" = "s3://warehouse/",
  "uri" = "http://rest:8181",
  "s3.access_key" = "admin",
  "s3.secret_key" = "password",
  "s3.endpoint" = "http://minio:9000"
);

-- Switch and create table with partition transforms
SWITCH iceberg;
CREATE DATABASE nyc;
CREATE TABLE nyc.taxis (
  vendor_id BIGINT,
  trip_id BIGINT,
  trip_distance FLOAT,
  fare_amount DOUBLE,
  store_and_fwd_flag STRING,
  ts DATETIME
)
PARTITION BY LIST (vendor_id, DAY(ts)) ()
PROPERTIES ("compression-codec"="zstd","write-format"="parquet");

INSERT INTO nyc.taxis VALUES
(1,1000371,1.8,15.32,'N','2024-01-01 09:15:23');

-- Query current snapshot
SELECT * FROM nyc.taxis;

These patterns—CREATE CATALOG, SWITCH, partition transforms, INSERT, and snapshot/time-travel queries—are built-in. (Apache Doris)


Performance tuning for “interactive” lake queries

1) Turn on Data Cache (hot blocks on NVMe)

  • Configure BE file_cache_path with enough space.
  • Enable at session/global level: SET enable_file_cache = true;
    Use for hot time windows (today/last N days). Monitor hit rates in the FE Profile and the file_cache_statistics system table. (Apache Doris)

2) Right-size Metadata Cache

  • Controls database/table lists, schemas, partitions, and file lists.
  • Tune eviction/refresh to balance freshness vs. latency (external_cache_* params). (Apache Doris)

3) Prune early, prune often

  • Use partition transforms (day(ts), bucket, truncate) to minimize scanned files.
  • Doris maps Iceberg transforms and supports delete-file semantics, so filters matter. (VeloDB Docs)

4) Elastic compute

  • Separate compute scales independently; attach catalogs once and scale for peak hours. (Lakehouse overview + ECN docs). (Apache Doris)

5) Secure & stable connectivity

  • In BYOC, set private endpoints to VeloDB Cloud control plane (no public Internet). (VeloDB Docs)

When to pick Iceberg vs. Hudi (for query serving)

Scenario (common in BI/ops)Prefer IcebergPrefer Hudi
Large batch analytics, schema evolution, multi-engine sharing
Real-time upserts with incremental pull for near-real-time dashboards
Time travel over stable snapshots
Merge-On-Read (row-level freshness with compacting later)

Rationale: both work great through VeloDB; Iceberg brings rich snapshot/branching/system tables, while Hudi shines for incremental ingestion and MOR/COW table types. Query engines in VeloDB support both natively. (VeloDB Docs)


Best practices (and pitfalls to dodge)

Do this

  • Start with catalog-linked access; don’t copy data unless you must. (Apache Doris)
  • Partition on time + a low-cardinality dimension; add bucket for high-cardinality joins. (VeloDB Docs)
  • Enable Data Cache on NVMe for the last N days; watch cache hit rate. (Apache Doris)
  • Set sane metadata refresh windows to keep schemas/partitions fresh without hammering HMS/REST. (Apache Doris)
  • Use cross-catalog joins to enrich lake facts with JDBC/operational dims without ETL. (Apache Doris)

Avoid this

  • Tiny Parquet files (death by file-listing). Compact in your lake tool; Doris will still cache metadata, but physics wins. (Apache Doris)
  • Over-wide MOR queries with no filters; add predicates and only read needed columns. (Apache Doris)
  • Using public networking for BYOC; configure a private endpoint. (VeloDB Docs)

Summary & call to action

If your data already lives in Iceberg or Hudi, VeloDB lets you query it at warehouse speed—without duplicating tables. Attach catalogs, flip on data/metadata caches, and start serving BI and real-time analytics straight from your lake.

Next step:

  • Attach your first catalog (REST/HMS/Glue).
  • Enable Data Cache.
  • Run a cross-catalog join and measure p95 latency.

Internal link ideas (for your site)

  • “Designing Iceberg partition strategies for BI and ML”
  • “Hudi incremental ingestion patterns with Flink → VeloDB”
  • “Tuning Doris Data/Metadata Cache for S3”
  • “Unity Catalog + VeloDB: secure patterns with vended credentials”
  • “From parquet sprawl to performance: compaction strategies”

Image prompt

“A clean, modern lakehouse diagram showing VeloDB (Apache Doris) querying Iceberg and Hudi tables on S3 via REST/HMS, with data & metadata caches highlighted — minimalistic, high contrast, 3D isometric style.”


Tags

#NoSQL #VeloDB #ApacheDoris #Iceberg #Hudi #Lakehouse #DataEngineering #RealTimeAnalytics #Catalog #S3


References (official)

  • VeloDB site & docs: product overview; Iceberg and Hudi best practices; catalogs and versions; BYOC private endpoints. (VeloDB)
  • Apache Doris docs: Iceberg & Hudi integration, Multi-Catalog/Catalog Overview, Data/Metadata Cache, Unity Catalog REST integration, and example SQL. (Apache Doris)

Leave a Reply

Your email address will not be published. Required fields are marked *