How to choose (quick map)

  • Write portable analytics in Python, let engines execute itibis
  • Write SQL in Python with great control / ORM when neededSQLAlchemy
  • Do fast analytics locally / on Parquet, CSV, S3duckdb, pyarrow.dataset
  • Talk to a specific DB → use its native driver (Postgres psycopg/asyncpg, Snowflake connector/Snowpark, etc.)
  • Pull huge query results straight into a DataFrame fastconnectorx

SQL toolkit / query builder / ORM

SQLAlchemy (Core + ORM)

Use when: you want Pythonic SQL that’s explicit, composable, and debugs well; or you need an ORM.
Highlights: Core (SQL first), ORM (models), connection pooling, transactions, async support, works with many DBs (via drivers).
Gotchas: don’t build SQL with f-strings—use bound params; keep sessions short-lived.

from sqlalchemy import create_engine, text

eng = create_engine("postgresql+psycopg://user:pwd@host/db", future=True)
with eng.begin() as cx:
    rows = cx.execute(text("SELECT id, ts FROM events WHERE ts >= :t"), {"t": "2025-11-01"}).mappings().all()

Ibis (ibis-framework)

Use when: you want backend-agnostic analytics (write once, run on DuckDB/BigQuery/Snowflake/Trino/Postgres).
Highlights: vectorized, expression API; pushdown to the engine; easy to test on DuckDB locally.
Gotchas: not every backend has 1:1 feature parity.

import ibis
t = ibis.table(dict(tenant="string", amount="float64", ts="timestamp"), name="payments")
expr = (t.filter(t.ts >= "2025-11-01").groupby("tenant").amount.sum().name("total"))
# DuckDB run:
con = ibis.duckdb.connect()
result = con.execute(expr)  # pandas DataFrame

Embedded / columnar engines

DuckDB

Use when: you need fast OLAP on your laptop/server, query Parquet/CSV directly, or stage data before loading to a warehouse.
Highlights: zero server, SQL over Parquet/S3, Arrow/Pandas/Polars interop, window functions, joins, UDFs.
Gotchas: process-local (not a server); set threads pragmas if needed.

import duckdb
con = duckdb.connect()
df = con.execute("""
  SELECT tenant, COUNT(*) n
  FROM read_parquet('s3://bucket/logs/*.parquet')
  WHERE event_time >= '2025-11-01'
  GROUP BY 1
""").df()

PyArrow Dataset

Use when: you want scan-filter-project over partitioned files with predicate pushdown and then hand off to Pandas/Polars.
Highlights: works great on S3/GCS via fsspec; zero DB required.
Gotchas: compute is in Python unless you pass to an engine (DuckDB/Polars/Arrow compute).

import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/ds/", format="parquet")
scanner = dataset.scanner(filter=ds.field("date") >= "2025-11-01", columns=["tenant","amount"])
table = scanner.to_table()          # Arrow Table; convert to pandas via .to_pandas()

Drivers by database

Postgres — psycopg (new), psycopg2, asyncpg

  • Pick: psycopg for modern sync; asyncpg for high-throughput async.
  • Tip: use COPY for bulk I/O; set prepare_threshold=0 if you see plan cache issues.
# psycopg (sync)
import psycopg
with psycopg.connect("postgresql://u:pw@h/db") as cx, cx.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM events WHERE ts >= %s", ("2025-11-01",))
    print(cur.fetchone()[0])

MySQL/MariaDB — mysqlclient, PyMySQL, aiomysql

  • Pick: mysqlclient (C, fastest) if wheels available; PyMySQL (pure Python) for portability; aiomysql for async.
  • Tip: watch timezones/decimals; prefer server-side LOAD DATA for bulk.

SQLite — sqlite3, aiosqlite

  • Pick: sqlite3 for scripts/tests; aiosqlite when integrating into async apps.
  • Tip: use WAL mode for concurrency; mind file locking in containers.

Snowflake — snowflake-connector-python, snowflake-sqlalchemy, snowflake-snowpark-python

  • Pick:
    • Connector for straight SQL + fast write_pandas/Arrow paths.
    • SQLAlchemy to integrate with generic code that expects an engine.
    • Snowpark for DataFrame-style pushdown (UDFs, stored procs, scaling in Snowflake).
  • Tip: prefer COPY INTO/stages for large loads; use AUTOCOMMIT=False and explicit transactions for batches.
# Connector + write_pandas example
import snowflake.connector, pandas as pd
cx = snowflake.connector.connect(user="...", password="...", account="...")
df = pd.DataFrame({"id":[1,2]})
from snowflake.connector.pandas_tools import write_pandas
write_pandas(cx, df, "MY_TABLE")  # fast Arrow path

BigQuery — google-cloud-bigquery, pandas-gbq

  • Pick: google-cloud-bigquery is the main SDK; pandas-gbq is a thin bridge.
  • Tip: for large pulls use .to_dataframe(create_bqstorage_client=True) for the BigQuery Storage API.
from google.cloud import bigquery
client = bigquery.Client()
job = client.query("SELECT tenant, COUNT(*) n FROM ds.events GROUP BY tenant")
df = job.result().to_dataframe(create_bqstorage_client=True)

Trino/Presto — trino, presto-python-client

  • Pick: trino (newer) for Trino; presto-python-client for Presto.
  • Tip: configure auth (JWT/OAuth/LDAP) and HTTP session pool; expect server-side pagination.
import trino
cx = trino.dbapi.connect(host="trino.example", port=443, http_scheme="https", auth=trino.auth.BasicAuthentication("u","p"))
cur = cx.cursor(); cur.execute("SELECT count(*) FROM hive.logs"); print(cur.fetchone())

SQL Server / ODBC — pyodbc

Tip: use DSN or long connection strings; for fast bulk insert set cursor.fast_executemany = True.

import pyodbc
cx = pyodbc.connect("Driver={ODBC Driver 18 for SQL Server};Server=host;Uid=u;Pwd=p;Encrypt=yes")
cur = cx.cursor(); cur.fast_executemany = True
cur.executemany("INSERT INTO t(id,name) VALUES (?,?)", [(1,"a"), (2,"b")]); cx.commit()

Oracle — oracledb (successor to cx_Oracle)

Tip: “thin” mode = no client install; “thick” mode uses Oracle Client features; use array DML for bulk.

import oracledb
with oracledb.connect(user="u", password="p", dsn="host/service") as cx:
    with cx.cursor() as cur:
        cur.execute("SELECT COUNT(*) FROM DUAL"); print(cur.fetchone()[0])

Speedy DataFrame loads — connectorx

Use when: you need to pull millions of rows fast into Pandas/Polars, using DB-native fast paths (Rust under the hood).
Gotchas: feature coverage varies by backend; mind timezone/decimal mappings.

import connectorx as cx
df = cx.read_sql(
    "postgresql://u:pw@host/db",
    "SELECT id, ts, amount FROM events WHERE ts >= '2025-11-01'",
    return_type="polars"  # or "pandas"
)

Cross-cutting best practices (production-minded)

  • Push work to the database/warehouse. Don’t SELECT * and aggregate in Python—filter/aggregate in SQL, pull the minimal shape.
  • Always bind parameters (:name, %s, ?). Avoid string concatenation to prevent SQL injection and plan cache misses.
  • Stream/partition large pulls. Use server cursors (yield_per, .fetchmany), or engines with columnar transfer (BigQuery Storage API, Arrow/ConnectorX).
  • Transactions are your friend. Batch writes in a transaction; on failure, rollback cleanly.
  • Connection pooling. Reuse connections; keep them short-lived at the call site.
  • Types & timezones. Decide on UTC; map DECIMAL → Decimal; beware implicit casts.
  • Arrow everywhere (when possible). Arrow/Parquet interop reduces copies between DB ↔ Python ↔ files.
  • For Snowflake: prefer Snowpark/SQL for heavy transforms; use stages + COPY INTO for bulk; write_pandas for small/medium DF loads.