How to choose (quick map)
- Write portable analytics in Python, let engines execute it →
ibis - Write SQL in Python with great control / ORM when needed →
SQLAlchemy - Do fast analytics locally / on Parquet, CSV, S3 →
duckdb,pyarrow.dataset - Talk to a specific DB → use its native driver (Postgres
psycopg/asyncpg, Snowflake connector/Snowpark, etc.) - Pull huge query results straight into a DataFrame fast →
connectorx
SQL toolkit / query builder / ORM
SQLAlchemy (Core + ORM)
Use when: you want Pythonic SQL that’s explicit, composable, and debugs well; or you need an ORM.
Highlights: Core (SQL first), ORM (models), connection pooling, transactions, async support, works with many DBs (via drivers).
Gotchas: don’t build SQL with f-strings—use bound params; keep sessions short-lived.
from sqlalchemy import create_engine, text
eng = create_engine("postgresql+psycopg://user:pwd@host/db", future=True)
with eng.begin() as cx:
rows = cx.execute(text("SELECT id, ts FROM events WHERE ts >= :t"), {"t": "2025-11-01"}).mappings().all()
Ibis (ibis-framework)
Use when: you want backend-agnostic analytics (write once, run on DuckDB/BigQuery/Snowflake/Trino/Postgres).
Highlights: vectorized, expression API; pushdown to the engine; easy to test on DuckDB locally.
Gotchas: not every backend has 1:1 feature parity.
import ibis
t = ibis.table(dict(tenant="string", amount="float64", ts="timestamp"), name="payments")
expr = (t.filter(t.ts >= "2025-11-01").groupby("tenant").amount.sum().name("total"))
# DuckDB run:
con = ibis.duckdb.connect()
result = con.execute(expr) # pandas DataFrame
Embedded / columnar engines
DuckDB
Use when: you need fast OLAP on your laptop/server, query Parquet/CSV directly, or stage data before loading to a warehouse.
Highlights: zero server, SQL over Parquet/S3, Arrow/Pandas/Polars interop, window functions, joins, UDFs.
Gotchas: process-local (not a server); set threads pragmas if needed.
import duckdb
con = duckdb.connect()
df = con.execute("""
SELECT tenant, COUNT(*) n
FROM read_parquet('s3://bucket/logs/*.parquet')
WHERE event_time >= '2025-11-01'
GROUP BY 1
""").df()
PyArrow Dataset
Use when: you want scan-filter-project over partitioned files with predicate pushdown and then hand off to Pandas/Polars.
Highlights: works great on S3/GCS via fsspec; zero DB required.
Gotchas: compute is in Python unless you pass to an engine (DuckDB/Polars/Arrow compute).
import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/ds/", format="parquet")
scanner = dataset.scanner(filter=ds.field("date") >= "2025-11-01", columns=["tenant","amount"])
table = scanner.to_table() # Arrow Table; convert to pandas via .to_pandas()
Drivers by database
Postgres — psycopg (new), psycopg2, asyncpg
- Pick:
psycopgfor modern sync;asyncpgfor high-throughput async. - Tip: use
COPYfor bulk I/O; setprepare_threshold=0if you see plan cache issues.
# psycopg (sync)
import psycopg
with psycopg.connect("postgresql://u:pw@h/db") as cx, cx.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM events WHERE ts >= %s", ("2025-11-01",))
print(cur.fetchone()[0])
MySQL/MariaDB — mysqlclient, PyMySQL, aiomysql
- Pick:
mysqlclient(C, fastest) if wheels available;PyMySQL(pure Python) for portability;aiomysqlfor async. - Tip: watch timezones/decimals; prefer server-side
LOAD DATAfor bulk.
SQLite — sqlite3, aiosqlite
- Pick:
sqlite3for scripts/tests;aiosqlitewhen integrating into async apps. - Tip: use WAL mode for concurrency; mind file locking in containers.
Snowflake — snowflake-connector-python, snowflake-sqlalchemy, snowflake-snowpark-python
- Pick:
- Connector for straight SQL + fast
write_pandas/Arrow paths. - SQLAlchemy to integrate with generic code that expects an engine.
- Snowpark for DataFrame-style pushdown (UDFs, stored procs, scaling in Snowflake).
- Connector for straight SQL + fast
- Tip: prefer
COPY INTO/stages for large loads; useAUTOCOMMIT=Falseand explicit transactions for batches.
# Connector + write_pandas example
import snowflake.connector, pandas as pd
cx = snowflake.connector.connect(user="...", password="...", account="...")
df = pd.DataFrame({"id":[1,2]})
from snowflake.connector.pandas_tools import write_pandas
write_pandas(cx, df, "MY_TABLE") # fast Arrow path
BigQuery — google-cloud-bigquery, pandas-gbq
- Pick:
google-cloud-bigqueryis the main SDK;pandas-gbqis a thin bridge. - Tip: for large pulls use
.to_dataframe(create_bqstorage_client=True)for the BigQuery Storage API.
from google.cloud import bigquery
client = bigquery.Client()
job = client.query("SELECT tenant, COUNT(*) n FROM ds.events GROUP BY tenant")
df = job.result().to_dataframe(create_bqstorage_client=True)
Trino/Presto — trino, presto-python-client
- Pick:
trino(newer) for Trino;presto-python-clientfor Presto. - Tip: configure auth (JWT/OAuth/LDAP) and HTTP session pool; expect server-side pagination.
import trino
cx = trino.dbapi.connect(host="trino.example", port=443, http_scheme="https", auth=trino.auth.BasicAuthentication("u","p"))
cur = cx.cursor(); cur.execute("SELECT count(*) FROM hive.logs"); print(cur.fetchone())
SQL Server / ODBC — pyodbc
Tip: use DSN or long connection strings; for fast bulk insert set cursor.fast_executemany = True.
import pyodbc
cx = pyodbc.connect("Driver={ODBC Driver 18 for SQL Server};Server=host;Uid=u;Pwd=p;Encrypt=yes")
cur = cx.cursor(); cur.fast_executemany = True
cur.executemany("INSERT INTO t(id,name) VALUES (?,?)", [(1,"a"), (2,"b")]); cx.commit()
Oracle — oracledb (successor to cx_Oracle)
Tip: “thin” mode = no client install; “thick” mode uses Oracle Client features; use array DML for bulk.
import oracledb
with oracledb.connect(user="u", password="p", dsn="host/service") as cx:
with cx.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM DUAL"); print(cur.fetchone()[0])
Speedy DataFrame loads — connectorx
Use when: you need to pull millions of rows fast into Pandas/Polars, using DB-native fast paths (Rust under the hood).
Gotchas: feature coverage varies by backend; mind timezone/decimal mappings.
import connectorx as cx
df = cx.read_sql(
"postgresql://u:pw@host/db",
"SELECT id, ts, amount FROM events WHERE ts >= '2025-11-01'",
return_type="polars" # or "pandas"
)
Cross-cutting best practices (production-minded)
- Push work to the database/warehouse. Don’t
SELECT *and aggregate in Python—filter/aggregate in SQL, pull the minimal shape. - Always bind parameters (
:name,%s,?). Avoid string concatenation to prevent SQL injection and plan cache misses. - Stream/partition large pulls. Use server cursors (
yield_per,.fetchmany), or engines with columnar transfer (BigQuery Storage API, Arrow/ConnectorX). - Transactions are your friend. Batch writes in a transaction; on failure, rollback cleanly.
- Connection pooling. Reuse connections; keep them short-lived at the call site.
- Types & timezones. Decide on UTC; map DECIMAL →
Decimal; beware implicit casts. - Arrow everywhere (when possible). Arrow/Parquet interop reduces copies between DB ↔ Python ↔ files.
- For Snowflake: prefer Snowpark/SQL for heavy transforms; use stages +
COPY INTOfor bulk;write_pandasfor small/medium DF loads.




