What Is Python?
why Python?
- Readable: code looks like pseudocode, great for teams.
- Productive: huge standard library (“batteries included”).
- Ecosystem: data (Pandas, Polars), ML (scikit-learn, PyTorch), web (FastAPI), automation (pathlib, subprocess).
- Everywhere: scripts, services, ETL, notebooks, serverless, containers.
What is Python (in one paragraph)
Python is a high-level, interpreted programming language focused on readability and developer speed. It supports multiple styles—procedural, object-oriented, and functional—and ships with a rich standard library, so common jobs (files, JSON, HTTP, CLI args) need little extra code. Its package ecosystem makes it a go-to for data engineering, machine learning, and web services.
Install & Set Up (the low-friction way)
# 1) Install Python ≥3.10 from python.org or your OS package manager
# 2) Create an isolated environment for each project
python -m venv .venv
# 3) Activate it (Windows: .venv\Scripts\activate)
source .venv/bin/activate
# 4) Upgrade pip and install a formatter + linter + pytest
python -m pip install --upgrade pip
pip install black ruff pytest
Tip: run
black .andruff .in CI to keep code clean automatically.
Your First 90 Seconds of Python
# hello.py
from datetime import datetime
name = input("Your name: ") or "friend"
print(f"Hello, {name}! It is {datetime.now():%Y-%m-%d %H:%M}.")
Run with python hello.py. That’s input, formatting, and imports in three lines.
Python Basics You’ll Use Daily
# collections & comprehensions
nums = [1, 2, 3, 4]
squares = [n*n for n in nums if n % 2 == 0] # [4, 16]
by_id = {n: n*n for n in nums} # {1:1, 2:4, 3:9, 4:16}
# functions + type hints
from typing import Iterable, List
def take(n: int, it: Iterable[int]) -> List[int]:
out = []
for x in it:
out.append(x)
if len(out) == n: break
return out
# classes (dataclass = less boilerplate)
from dataclasses import dataclass
@dataclass
class User:
id: int
email: str
How to Use Python… for Real Work
1) Automation & CLI
# rename_photos.py
from pathlib import Path
for i, p in enumerate(sorted(Path("photos").glob("*.jpg")), start=1):
p.rename(p.with_name(f"img_{i:04d}.jpg"))
print("Done.")
Run it once, commit the change, and move on.
2) Data wrangling (CSV → clean → Parquet)
# etl_min.py
import pandas as pd
df = pd.read_csv("events.csv")
df = df.query("country == 'US'").assign(event_date=lambda d: pd.to_datetime(d["event_time"]).dt.date)
df.to_parquet("events_us.parquet", index=False)
print(f"Rows: {len(df):,}")
3) Talk to a database (pure Python, no ORM required)
# db_select.py
from sqlalchemy import create_engine, text
import pandas as pd
engine = create_engine("sqlite:///demo.db") # swap for postgres/snowflake engines
with engine.begin() as conn:
rows = pd.read_sql(text("select 1 as n union all select 2"), conn)
print(rows.to_dict(orient="records"))
4) Tiny web API (FastAPI)
# api.py
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Item(BaseModel):
name: str
price: float
@app.get("/health")
def health(): return {"ok": True}
@app.post("/items")
def create_item(it: Item): return {"msg": f"stored {it.name}", "price": it.price}
Run: pip install fastapi uvicorn pydantic then uvicorn api:app --reload.
5) Data/ML starter (train, save, load)
# train_and_use.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import joblib
X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000).fit(Xtr, ytr)
print("accuracy:", model.score(Xte, yte))
joblib.dump(model, "model.joblib")
# later…
clf = joblib.load("model.joblib")
print("predict:", clf.predict(Xte[:3]).tolist())
Batteries You Should Know (standard library gold)
pathlibfor files/paths,jsonfor JSON,csvfor CSV.subprocessfor shell commands,argparsefor CLIs.concurrent.futuresfor multi-process CPU work,asynciofor async I/O.venvfor environments,loggingfor structured logs,functools(lru_cache,cached_property).
Best Practices That Save You Hours
- One project = one virtualenv. Pin versions in
requirements.txtorpyproject.toml. - Format + lint + test:
black,ruff,pytest. - Type hints (
mypyor Pyright) catch mistakes before runtime. - Don’t hardcode secrets. Use env vars +
pydantic-settingsor a vault. - Small functions, pure where possible. Side effects live at the edges (I/O).
- Logging > print in services; add IDs and context to logs.
Performance & Scale (choose the right tool)
- Vectorize with NumPy/Polars/Pandas—avoid Python loops on large arrays.
- Bigger than RAM? Use Dask or PySpark for distributed DataFrames; DuckDB for fast local SQL.
- I/O first: compress (zstd), columnar formats (Parquet), and push filters down to the database.
Testing (confidence to refactor)
# test_etl_min.py
import pandas as pd
from etl_min import pd as _pd # same pandas for types
def test_filter_and_parquet(tmp_path, monkeypatch):
df = pd.DataFrame([{"country":"US","event_time":"2025-01-01"}, {"country":"CA","event_time":"2025-01-02"}])
df.to_csv(tmp_path/"events.csv", index=False)
monkeypatch.chdir(tmp_path)
import etl_min # runs the script in tmp dir
out = _pd.read_parquet("events_us.parquet")
assert len(out) == 1 and out.iloc[0]["country"] == "US"
Run with pytest -q.
Common Pitfalls (and how to dodge them)
- “Works on my machine.” Use virtualenvs and lockfile; document
python --version. - Slow Pandas loops. Prefer vectorized ops or
.applyon chunks; consider Polars. - Mixing tabs/spaces or inconsistent style—fix with
black. - Invisible errors. Always check return codes, wrap risky code with logging and clear exceptions.
- Secrets in code. Never commit
.envwith real tokens; use a secret manager.
FAQ (quick answers)
Is Python good for beginners? Yes—readable syntax and tons of tutorials.
Is it fast enough? For I/O and gluing systems, absolutely. For heavy math, rely on NumPy/Polars (C/ Rust under the hood) or scale out with Dask/Spark.
Where should I run it? Locally for scripts; containers or serverless for services; notebooks for exploration.
ETL/ELT (batch data movement & transforms)
- Frames/compute:
pandas,polars,pyarrow,duckdb,dask,pyspark,numpy - File & object storage I/O:
fsspec,smart_open,s3fs,gcsfs,adlfs,boto3,google-cloud-storage,azure-storage-blob - Compression/serialization:
gzip(stdlib),zstandard,lz4,brotli,orjson,ujson,msgpack,fastavro,avro-python3,protobuf - Streaming/connectors (when needed):
confluent-kafka,kafka-python,pulsar-client - Ingestion helpers:
dlt(Data Load Tool),singer-python(for Singer taps/targets)
Data validation / quality gates
- DataFrame-native validation:
pandera - General schema validation:
pydantic(v1/v2),marshmallow,voluptuous,cerberus,schema - Expectation frameworks / profiling:
great_expectations,frictionless,ydata-profiling - (Spark):
pydeequ(wrapper over Deequ; JVM required)
DB / Warehouse I/O + SQL
- SQL toolkit / query builder / ORM:
SQLAlchemy,ibis-framework(ibis) - Embedded/columnar engine:
duckdb,pyarrow.dataset - Postgres:
psycopg2/psycopg(new),asyncpg - MySQL/MariaDB:
mysqlclient,PyMySQL,aiomysql - SQLite:
sqlite3(stdlib),aiosqlite - Snowflake:
snowflake-connector-python,snowflake-sqlalchemy,snowflake-snowpark-python - BigQuery:
google-cloud-bigquery,pandas-gbq - Trino/Presto:
trino,presto-python-client - SQL Server/ODBC:
pyodbc - Oracle:
oracledb - Speedy DF loads:
connectorx
ML training + experiment tracking + serving (all Python libs)
- Training / models:
scikit-learn,xgboost,lightgbm,catboost,pytorch,torchvision,tensorflow/keras - Feature tooling:
featuretools - Experiment tracking / registry:
mlflow,wandb,neptune-client,comet-ml - Tuning:
optuna,ray[tune] - Batch utils:
joblib,cloudpickle,dill - Serving APIs:
fastapi,starlette,pydantic,uvicorn,gunicorn,bentoml,ray[serve],onnxruntime,mlserver,tritonclient
Monitoring & alerting (data, models, services)
- Metrics/tracing/logs:
prometheus_client,opentelemetry-sdk(+opentelemetry-instrumentation-*),statsd,datadog,sentry-sdk,structlog,loguru - Data/ML monitoring & drift:
evidently,whylogs,alibi-detect - Notifications:
slack_sdk,apprise,twilio
Python Patterns
- Creational (5): Factory Method, Abstract Factory, Builder, Prototype, Singleton
- Structural (7): Adapter, Facade, Proxy, Decorator, Composite, Bridge, Flyweight
- Behavioral (11): Strategy, Command, Observer, Iterator, Template Method, State, Chain of Responsibility, Mediator, Memento, Visitor, Interpreter




