Arrow Flight SQL from Python with VeloDB: How to Get “10×-Style” Read Speedups

Meta description (158 chars):
Supercharge Python reads from VeloDB (Apache Doris) with Arrow Flight SQL + ADBC. Setup, code, tuning (timeouts, gRPC size), and pitfalls for real-time analytics.

Why this matters

Ever waited minutes for pandas.read_sql() to pull a wide result? With VeloDB (built on Apache Doris) you can switch the transport to Arrow Flight SQL and stream columnar batches directly into Python—eliminating row-by-row serialization overhead. In Doris/VeloDB, Arrow Flight SQL is an officially supported high-speed path and, in some scenarios, the docs report “up to 100×” better transfer performance than MySQL/JDBC/ODBC-based fetches. (docs.velodb.io)

What Arrow Flight SQL is (in plain terms)

Arrow Flight SQL is a standard RPC protocol (gRPC) for SQL engines to deliver data in Apache Arrow’s columnar format. Clients use ADBC Flight SQL drivers (DB-API) to run queries and fetch Arrow tables with near-zero copy. Less CPU wasted transforming rows to columns → more bandwidth for your actual analytics. (Apache Arrow)

How this fits VeloDB / Doris architecture

Frontends (FE): plan/coordinate queries, expose Flight SQL endpoint.
Backends (BE): execute queries, store columnar data, and often serve the result stream to clients. That’s why BE network access matters for Flight SQL. (doris.apache.org)

Enabling/connecting (self-hosted / BYOC)

Doris/VeloDB expose Flight SQL via gRPC ports. Server-side you typically set:

fe.conf: arrow_flight_sql_port (e.g., 8070)
be.conf: arrow_flight_sql_port (e.g., 8050)

Then restart; logs confirm startup. (doris.apache.org)

Networking gotcha: Flight SQL clients often pull results from BE nodes. If BEs aren’t reachable, configure public_host/arrow_flight_sql_proxy_port or put a reverse proxy (e.g., Nginx) in front of BEs. (doris.apache.org)

BYOC note (VeloDB Cloud): the VeloDB agent runs in your VPC and uses PrivateLink/private endpoints for control. Ensure Flight SQL ports/paths you use are reachable per the BYOC guides. (docs.velodb.io)

Quick start from Python (ADBC Flight SQL)

1) Install drivers

pip install adbc-driver-manager adbc-driver-flightsql pyarrow pandas polars

(Driver packages are published on PyPI.) (docs.velodb.io)

2) Connect and fetch Arrow directly

import adbc_driver_flightsql.dbapi as flightsql
import adbc_driver_manager
import pyarrow as pa
import pandas as pd
import polars as pl

# VeloDB / Doris FE Flight endpoint
URI = "grpc://fe.example.company:8070"

conn = flightsql.connect(
    URI,
    db_kwargs={
        adbc_driver_manager.DatabaseOptions.USERNAME.value: "user",
        adbc_driver_manager.DatabaseOptions.PASSWORD.value: "pass",
        # Optional: lift default gRPC message size cap for big batches
        # adbc_driver_flightsql.DatabaseOptions.WITH_MAX_MSG_SIZE.value: "2147483647",
    },
)
cur = conn.cursor()
cur.execute("""
  SELECT ts, user_id, metric_a, metric_b
  FROM prod.fact_events
  WHERE ts >= now() - interval 1 day
""")

# Fetch as an Arrow table (columnar) then use Pandas/Polars zero-copy
tbl = cur.fetchallarrow()          # -> pyarrow.Table
df = tbl.to_pandas()               # Pandas
pl_df = pl.from_arrow(tbl)         # Polars

cur.close(); conn.close()

The ADBC cookbook shows recipes for authenticated connections, timeouts, and raising the max gRPC message size when you hit “received message larger than max.” (Apache Arrow)

Performance checklist (what actually moves the needle)

Use Arrow end-to-end
Fetch Arrow from the cursor (fetchallarrow()), then hand it to Pandas/Polars. This avoids row→column conversions on the client side. (Apache Arrow)
Tune Flight SQL driver options
Set connection/statement timeouts; increase WITH_MAX_MSG_SIZE for very wide batches; optionally enable cookie middleware if your gateway requires it. (Apache Arrow)
Plan for BE reachability
Flight SQL often streams from BEs. If they’re private, configure public_host/proxy or expose appropriate egress paths; otherwise the client stalls or times out. (doris.apache.org)
Parallel endpoints (advanced)
Doris 3.0.4+ can return multiple endpoints; clients then pull from several BEs in parallel for some query shapes (set enable_parallel_result_sink=true when applicable). Measure carefully; small results don’t benefit. (doris.apache.org)
Reduce bytes first
Project only needed columns, filter early, and consider async materialized views for hot aggregates—let the warehouse do the heavy lifting, then stream fewer bytes. (doris.apache.org)

Reality check: The official docs say Flight SQL can be “up to 100× faster” than MySQL/JDBC/ODBC in some scenarios. Your mileage depends on schema width, network, CPU, and whether you actually consume Arrow on the client. Treat “10×” as a goal, not a guarantee. (docs.velodb.io)

Comparison table: Flight SQL vs. classic drivers

Aspect	Flight SQL (Arrow + gRPC)	MySQL/JDBC/ODBC path
Data format	Columnar Arrow batches	Row-oriented
Client cost	Low (no row→column reshaping)	Extra CPU to reshape
Transport	gRPC streaming, large frames	Text/binary row fetch
Python integration	ADBC DB-API; Arrow → Pandas/Polars	DB-API; usually rows→DataFrame
Network prereqs	FE + often BE reachability	Typically FE only
When it shines	Wide tables, big scans, data-science export	Small OLTP-style lookups

Sources: protocol & driver docs; Doris/VeloDB Flight SQL guides. (Apache Arrow)

End-to-end example with tuning

Set timeouts and bigger message size

from adbc_driver_flightsql import ConnectionOptions, DatabaseOptions
conn = flightsql.connect(
    URI,
    db_kwargs={
        adbc_driver_manager.DatabaseOptions.USERNAME.value: "user",
        adbc_driver_manager.DatabaseOptions.PASSWORD.value: "pass",
        DatabaseOptions.WITH_MAX_MSG_SIZE.value: "2147483647",  # ~2 GiB
    },
)
conn.adbc_connection.set_options(
    **{
        ConnectionOptions.TIMEOUT_FETCH.value: 60.0,
        ConnectionOptions.TIMEOUT_QUERY.value: 60.0,
        ConnectionOptions.TIMEOUT_UPDATE.value: 60.0,
    }
)

These options are straight from the ADBC Flight SQL recipes. (Apache Arrow)

Common pitfalls (and fixes)

BEs not reachable: Can’t fetch? It’s likely networking. Add a reverse proxy or configure public_host/arrow_flight_sql_proxy_port so clients can reach the BE that holds the result. (doris.apache.org)
Hitting gRPC size limits: Increase WITH_MAX_MSG_SIZE or fetch fewer columns per query. (Apache Arrow)
Expecting speedups but still slow: If your code converts Arrow → rows prematurely, you lose the benefit. Keep data in Arrow as long as possible. (Apache Arrow)
Small result sets: For tiny queries, transport doesn’t dominate; don’t expect miracles—optimize the SQL or add an async MV instead. (doris.apache.org)

Summary & call-to-action

Switching Python reads to Arrow Flight SQL is the fastest way to cut “data copy tax” when exporting analytics data from VeloDB/Doris. Start with ADBC, fetch Arrow tables, tune timeouts and message size, and verify BE reachability. If your workload is wide and scan-heavy, the gains can be dramatic. Try one pipeline today: migrate a single read path (wide table → Pandas/Polars) and compare wall-clock times.

Internal link ideas (for your site)

“VeloDB Flight SQL Setup: Ports, Proxies, and Security in BYOC”
“Async Materialized Views in Doris: Designing Read-Friendly Aggregates”
“Pandas vs. Polars with Arrow: Choosing Your Python Frame”
“Designing Tables for Fast Export: Partitioning & Sort Keys in Doris”

Image prompt (for DALL·E / Midjourney)

“A clean, modern diagram of a VeloDB (Apache Doris) cluster streaming query results via Arrow Flight SQL to Python clients (Pandas/Polars). Show FE/BE nodes, gRPC flows, and Arrow columnar batches. Minimalist, high-contrast, isometric vector style.”

Data/ML Engineer Blog