What Is Python?


why Python?

  • Readable: code looks like pseudocode, great for teams.
  • Productive: huge standard library (“batteries included”).
  • Ecosystem: data (Pandas, Polars), ML (scikit-learn, PyTorch), web (FastAPI), automation (pathlib, subprocess).
  • Everywhere: scripts, services, ETL, notebooks, serverless, containers.

What is Python (in one paragraph)

Python is a high-level, interpreted programming language focused on readability and developer speed. It supports multiple styles—procedural, object-oriented, and functional—and ships with a rich standard library, so common jobs (files, JSON, HTTP, CLI args) need little extra code. Its package ecosystem makes it a go-to for data engineering, machine learning, and web services.


Install & Set Up (the low-friction way)

# 1) Install Python ≥3.10 from python.org or your OS package manager
# 2) Create an isolated environment for each project
python -m venv .venv
# 3) Activate it (Windows: .venv\Scripts\activate)
source .venv/bin/activate
# 4) Upgrade pip and install a formatter + linter + pytest
python -m pip install --upgrade pip
pip install black ruff pytest

Tip: run black . and ruff . in CI to keep code clean automatically.


Your First 90 Seconds of Python

# hello.py
from datetime import datetime

name = input("Your name: ") or "friend"
print(f"Hello, {name}! It is {datetime.now():%Y-%m-%d %H:%M}.")

Run with python hello.py. That’s input, formatting, and imports in three lines.


Python Basics You’ll Use Daily

# collections & comprehensions
nums = [1, 2, 3, 4]
squares = [n*n for n in nums if n % 2 == 0]    # [4, 16]
by_id = {n: n*n for n in nums}                 # {1:1, 2:4, 3:9, 4:16}

# functions + type hints
from typing import Iterable, List
def take(n: int, it: Iterable[int]) -> List[int]:
    out = []
    for x in it:
        out.append(x)
        if len(out) == n: break
    return out

# classes (dataclass = less boilerplate)
from dataclasses import dataclass
@dataclass
class User:
    id: int
    email: str

How to Use Python… for Real Work

1) Automation & CLI

# rename_photos.py
from pathlib import Path
for i, p in enumerate(sorted(Path("photos").glob("*.jpg")), start=1):
    p.rename(p.with_name(f"img_{i:04d}.jpg"))
print("Done.")

Run it once, commit the change, and move on.

2) Data wrangling (CSV → clean → Parquet)

# etl_min.py
import pandas as pd

df = pd.read_csv("events.csv")
df = df.query("country == 'US'").assign(event_date=lambda d: pd.to_datetime(d["event_time"]).dt.date)
df.to_parquet("events_us.parquet", index=False)
print(f"Rows: {len(df):,}")

3) Talk to a database (pure Python, no ORM required)

# db_select.py
from sqlalchemy import create_engine, text
import pandas as pd

engine = create_engine("sqlite:///demo.db")  # swap for postgres/snowflake engines
with engine.begin() as conn:
    rows = pd.read_sql(text("select 1 as n union all select 2"), conn)
print(rows.to_dict(orient="records"))

4) Tiny web API (FastAPI)

# api.py
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Item(BaseModel):
    name: str
    price: float

@app.get("/health")
def health(): return {"ok": True}

@app.post("/items")
def create_item(it: Item): return {"msg": f"stored {it.name}", "price": it.price}

Run: pip install fastapi uvicorn pydantic then uvicorn api:app --reload.

5) Data/ML starter (train, save, load)

# train_and_use.py
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import joblib

X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000).fit(Xtr, ytr)
print("accuracy:", model.score(Xte, yte))
joblib.dump(model, "model.joblib")

# later…
clf = joblib.load("model.joblib")
print("predict:", clf.predict(Xte[:3]).tolist())

Batteries You Should Know (standard library gold)

  • pathlib for files/paths, json for JSON, csv for CSV.
  • subprocess for shell commands, argparse for CLIs.
  • concurrent.futures for multi-process CPU work, asyncio for async I/O.
  • venv for environments, logging for structured logs, functools (lru_cache, cached_property).

Best Practices That Save You Hours

  • One project = one virtualenv. Pin versions in requirements.txt or pyproject.toml.
  • Format + lint + test: black, ruff, pytest.
  • Type hints (mypy or Pyright) catch mistakes before runtime.
  • Don’t hardcode secrets. Use env vars + pydantic-settings or a vault.
  • Small functions, pure where possible. Side effects live at the edges (I/O).
  • Logging > print in services; add IDs and context to logs.

Performance & Scale (choose the right tool)

  • Vectorize with NumPy/Polars/Pandas—avoid Python loops on large arrays.
  • Bigger than RAM? Use Dask or PySpark for distributed DataFrames; DuckDB for fast local SQL.
  • I/O first: compress (zstd), columnar formats (Parquet), and push filters down to the database.

Testing (confidence to refactor)

# test_etl_min.py
import pandas as pd
from etl_min import pd as _pd  # same pandas for types

def test_filter_and_parquet(tmp_path, monkeypatch):
    df = pd.DataFrame([{"country":"US","event_time":"2025-01-01"}, {"country":"CA","event_time":"2025-01-02"}])
    df.to_csv(tmp_path/"events.csv", index=False)
    monkeypatch.chdir(tmp_path)
    import etl_min  # runs the script in tmp dir
    out = _pd.read_parquet("events_us.parquet")
    assert len(out) == 1 and out.iloc[0]["country"] == "US"

Run with pytest -q.


Common Pitfalls (and how to dodge them)

  • “Works on my machine.” Use virtualenvs and lockfile; document python --version.
  • Slow Pandas loops. Prefer vectorized ops or .apply on chunks; consider Polars.
  • Mixing tabs/spaces or inconsistent style—fix with black.
  • Invisible errors. Always check return codes, wrap risky code with logging and clear exceptions.
  • Secrets in code. Never commit .env with real tokens; use a secret manager.

FAQ (quick answers)

Is Python good for beginners? Yes—readable syntax and tons of tutorials.
Is it fast enough? For I/O and gluing systems, absolutely. For heavy math, rely on NumPy/Polars (C/ Rust under the hood) or scale out with Dask/Spark.
Where should I run it? Locally for scripts; containers or serverless for services; notebooks for exploration.


ETL/ELT (batch data movement & transforms)

  • Frames/compute: pandas, polars, pyarrow, duckdb, dask, pyspark, numpy
  • File & object storage I/O: fsspec, smart_open, s3fs, gcsfs, adlfs, boto3, google-cloud-storage, azure-storage-blob
  • Compression/serialization: gzip(stdlib), zstandard, lz4, brotli, orjson, ujson, msgpack, fastavro, avro-python3, protobuf
  • Streaming/connectors (when needed): confluent-kafka, kafka-python, pulsar-client
  • Ingestion helpers: dlt (Data Load Tool), singer-python (for Singer taps/targets)

Data validation / quality gates

  • DataFrame-native validation: pandera
  • General schema validation: pydantic (v1/v2), marshmallow, voluptuous, cerberus, schema
  • Expectation frameworks / profiling: great_expectations, frictionless, ydata-profiling
  • (Spark): pydeequ (wrapper over Deequ; JVM required)

DB / Warehouse I/O + SQL

  • SQL toolkit / query builder / ORM: SQLAlchemy, ibis-framework (ibis)
  • Embedded/columnar engine: duckdb, pyarrow.dataset
  • Postgres: psycopg2 / psycopg (new), asyncpg
  • MySQL/MariaDB: mysqlclient, PyMySQL, aiomysql
  • SQLite: sqlite3 (stdlib), aiosqlite
  • Snowflake: snowflake-connector-python, snowflake-sqlalchemy, snowflake-snowpark-python
  • BigQuery: google-cloud-bigquery, pandas-gbq
  • Trino/Presto: trino, presto-python-client
  • SQL Server/ODBC: pyodbc
  • Oracle: oracledb
  • Speedy DF loads: connectorx

ML training + experiment tracking + serving (all Python libs)

  • Training / models: scikit-learn, xgboost, lightgbm, catboost, pytorch, torchvision, tensorflow/keras
  • Feature tooling: featuretools
  • Experiment tracking / registry: mlflow, wandb, neptune-client, comet-ml
  • Tuning: optuna, ray[tune]
  • Batch utils: joblib, cloudpickle, dill
  • Serving APIs: fastapi, starlette, pydantic, uvicorn, gunicorn, bentoml, ray[serve], onnxruntime, mlserver, tritonclient

Monitoring & alerting (data, models, services)

  • Metrics/tracing/logs: prometheus_client, opentelemetry-sdk (+ opentelemetry-instrumentation-*), statsd, datadog, sentry-sdk, structlog, loguru
  • Data/ML monitoring & drift: evidently, whylogs, alibi-detect
  • Notifications: slack_sdk, apprise, twilio

Python Patterns

  • Creational (5): Factory Method, Abstract Factory, Builder, Prototype, Singleton
  • Structural (7): Adapter, Facade, Proxy, Decorator, Composite, Bridge, Flyweight
  • Behavioral (11): Strategy, Command, Observer, Iterator, Template Method, State, Chain of Responsibility, Mediator, Memento, Visitor, Interpreter