ML training + experiment tracking + serving (all Python libs)

The big picture (how these fit together)

Data → Train (sklearn/xgboost/lightgbm/catboost/torch/tf) 
       ↘ Tune (optuna / ray[tune]) → pick best model
        ↘ Track (mlflow / wandb / neptune / comet-ml) → metrics, params, artifacts
Artifacts → Serialize (joblib / cloudpickle / dill) → “model.pkl”
Serve → API (fastapi/starlette + pydantic + uvicorn/gunicorn)
        or Model servers (bentoml / ray[serve] / mlserver / tritonclient to NVIDIA Triton)
Runtime optimizations → onnxruntime (optional)
Feature engineering → featuretools (optional)

Training / models (what each is best at)

scikit-learn – Swiss-army knife for classic ML (tabular, small/medium data).
Strengths: pipelines, preprocessing, cross-val; wide algo coverage. Use when: CPU tabular problems, quick baselines, clean API.
XGBoost – Gradient boosting, very strong on tabular with careful tuning.
Strengths: accuracy, handling missing values; CPU/GPU. Notes: can overfit; watch n_estimators, max_depth, eta.
LightGBM – Faster/lighter boosting (histogram algorithm).
Strengths: speed on large/tabular, categorical support via integer encoding. Notes: sensitive to num_leaves & min_data_in_leaf.
CatBoost – Boosting with native categorical handling (no heavy encoding).
Strengths: great defaults, less feature engineering; handles text-ish features. Notes: GPU helps; watch training time on huge data.
PyTorch (+ torchvision for images) – Flexible, Pythonic deep learning.
Strengths: custom architectures, research → prod; strong community. Use when: images, custom nets, you want control.
TensorFlow/Keras – DL framework with production tooling (TF Serving, TFX).
Strengths: high-level Keras API, good for production pipelines; XLA. Use when: you want the TF ecosystem.

Quick pick: Tabular → start with LightGBM or XGBoost; Images → PyTorch + torchvision; Need turnkey classic ML → scikit-learn.

Feature tooling

featuretools – Automated feature engineering on relational/transactional tables.
Use when: you’ve got entity relationships (customers ↔ orders ↔ items) and want fast baseline features. Tip: limit primitives; review leakage.

Experiment tracking / registry

MLflow – OSS tracking + model registry + artifacts. Runs anywhere (local/S3).
Good for: teams wanting self-host or lightweight tracking.
Weights & Biases (wandb) – Hosted tracking, great dashboards/sweeps/artifacts.
Good for: collaborative experiments, visualizations, minimal setup.
Neptune (neptune-client) – Hosted tracking with flexible metadata & dashboards.
Comet (comet-ml) – Hosted tracking similar to W&B; nice experiment mgmt.

Quick pick: Need on-prem/OSS → MLflow. Want SaaS UX & sweeps → W&B (or Neptune/Comet if your team already uses them).

Tuning (hyperparameters)

Optuna – Elegant, fast Bayesian + pruners. Tight integrations (sklearn, xgboost, lightgbm, catboost, PyTorch).
Use when: single machine or modest scale; you want the simplest API that still rocks.
Ray Tune (ray[tune]) – Distributed hyperparam search on a cluster; many schedulers/algos.
Use when: you need to parallelize across CPUs/GPUs or many nodes.

Quick pick: Start Optuna; move to Ray Tune when parallelism/distribution matters.

Batch utils (serialization & parallel)

joblib – Save/load sklearn pipelines fast; simple parallel loops.
Use for: dump()/load() models; CPU-bound map with Parallel(n_jobs=...).
cloudpickle – Serialize dynamic Python objects/functions (more flexible than pickle).
Use for: sending callables to workers (Ray/Dask), custom objects.
dill – Even more permissive pickling (e.g., lambdas).
Use carefully: portability can suffer; prefer joblib/cloudpickle first.

Gotcha: Pickles are not stable across Python/library versions. Pin versions or export to ONNX for portability where possible.

Serving APIs & runtimes

FastAPI + Pydantic + Uvicorn (Starlette under the hood) – Build typed, fast REST inference services.
Pattern: fastapi app, pydantic request/response models, uvicorn as ASGI server, gunicorn for multi-worker in prod.
BentoML – Package and deploy models with batteries included (runners, adapters, OCI images).
Great for: standardized packaging and multi-model services.
Ray Serve – Scalable model serving on Ray cluster (Pythonic APIs, autoscaling).
Great for: many models, dynamic routing, distributed workloads.
MLServer – Seldon’s multi-framework model server (supports MLflow, SKL, XGBoost, etc.).
Great for: standard serving protocols; easy Dockerization.
Triton Inference Server (use tritonclient) – NVIDIA’s high-perf server for GPU/CPU; ensembles, batching.
Great for: high-throughput DL, multiple frameworks, GPUs. Client lives in your Python service.
ONNX Runtime – High-performance runtime for ONNX models (CPU/GPU, quantization).
Great for: portable, fast inference once you export to ONNX.

Quick pick: Simple service → FastAPI. Need packaging/best-practices → BentoML. Many models / scale-out → Ray Serve. GPU throughput → Triton (+ tritonclient). Want portability → export to ONNX and run with onnxruntime.

Two tiny “glue” recipes

1) Train + tune + track (sklearn + Optuna + MLflow) in ~20 lines

import mlflow, optuna
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)

def objective(trial: optuna.Trial) -> float:
    params = dict(n_estimators=trial.suggest_int("n_estimators", 100, 600),
                  max_depth=trial.suggest_int("max_depth", 3, 20),
                  min_samples_leaf=trial.suggest_int("min_samples_leaf", 1, 10))
    with mlflow.start_run():
        clf = RandomForestClassifier(**params, n_jobs=-1, random_state=0)
        score = cross_val_score(clf, X, y, cv=5, scoring="roc_auc").mean()
        mlflow.log_params(params); mlflow.log_metric("roc_auc", score)
        return 1.0 - score  # Optuna minimizes
study = optuna.create_study()
study.optimize(objective, n_trials=20)
print("best:", study.best_params)

2) Serve the model (FastAPI + Pydantic + Uvicorn)

# save: joblib.dump(model, "model.pkl")
import joblib
from fastapi import FastAPI
from pydantic import BaseModel

class Inp(BaseModel):
    mean_radius: float
    mean_texture: float
    mean_smoothness: float
    # ... add required features

app = FastAPI()
model = joblib.load("model.pkl")

@app.post("/predict")
def predict(x: Inp):
    X = [[x.mean_radius, x.mean_texture, x.mean_smoothness]]
    proba = float(model.predict_proba(X)[0][1])
    return {"score": proba}
# run: uvicorn app:app --host 0.0.0.0 --port 8000

Pragmatic “pick one” starter stack

Tabular: lightgbm or xgboost → Optuna → MLflow → serialize with joblib → FastAPI + Uvicorn (later: export to ONNX + onnxruntime if you need portability).
Vision: torch + torchvision → W&B for tracking → serve with BentoML or Ray Serve (GPU) → consider Triton for max throughput.
General deep learning (TF): tensorflow/keras → W&B/MLflow → TF Serving or BentoML.

Common gotchas & pro tips

Version pinning: pin Python + libs for training and serving; pickle breakage is real.
Determinism: set seeds and limit threads for reproducible metrics (OMP_NUM_THREADS, BLAS).
Feature parity: lock the same preprocessing at train and serve (sklearn Pipeline, or export preprocessing into ONNX).
Throughput: prefer batch inference (Triton/MLServer) or async endpoints; profile JSON parse & validation.
Observability: log inputs (hashed), latencies, and outputs; add drift checks (evidently/whylogs) later.
Resource fit: CPU tabular often beats small DL; use GPU when it actually moves the needle.

Data/ML Engineer Blog