Creative Codes
← All insights
AI/MLJune 1, 202610 min read

From Training to Endpoint: How We Deploy Custom ML Models

Most ML tutorials end at model.fit(). Production starts there. Here's how we take a trained model from notebook to a FastAPI endpoint that handles real traffic.

By Muhammad Hassan

At Creative Codes, we've deployed custom ML models for classification, sentiment analysis, anomaly detection, and predictive scoring. The training step is usually the smallest part of the work. Getting from a trained model to something that runs reliably in production is where most teams struggle. This is how we do it.

What "production" means for an ML model

A production ML model needs to:

  • Accept requests over HTTP (or another transport)
  • Return predictions in a consistent format in predictable time
  • Handle errors gracefully without crashing the service
  • Be monitorable: you need to know if accuracy is drifting or the endpoint is slow
  • Be deployable without a PhD in DevOps

FastAPI is our default serving framework for custom models. It's async, fast, auto-generates OpenAPI docs, and integrates cleanly with Python ML libraries.

Step 1: Data preparation

Before training, data preparation decisions lock in your model's ceiling. The mistakes made here can't be fixed with a better algorithm.

The things we always do:

  • Schema validation early. Know what columns exist, their types, and the range of values. Anything outside that range at inference time will silently degrade predictions.
  • Split before any preprocessing. Fit scalers, encoders, and imputers on training data only. Apply the fit transformers to validation and test sets. Applying fit on the full dataset before splitting is a common data leakage mistake.
  • Document the target variable. If it's a classification target, document every class label and its distribution. Imbalanced classes need handling.
python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

df = pd.read_csv("training_data.csv")

# Split BEFORE fitting anything
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # transform only, not fit_transform

Step 2: Training and evaluation

For most tabular classification and regression tasks, we start with gradient boosting (XGBoost or LightGBM) before reaching for neural networks. It's faster to train, faster to serve, and often performs comparably on structured data.

For text: sentence-transformers for embeddings + a lightweight classifier on top. For images: fine-tuned EfficientNet or ResNet with frozen base layers.

The evaluation step people skip: define what "good enough" means before training, not after. If the model needs to achieve 95% precision on the positive class to be useful, write that down. Don't adjust the threshold post-hoc to hit a number.

python
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

model = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=6)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

If the eval metrics don't hit the bar: more data or better features first, hyperparameter tuning second, different model architecture last. Jumping straight to architecture changes is the most expensive way to improve a model.

Step 3: Packaging the model and its transformers

This is the step most tutorials skip that causes the most production bugs. The model needs the scaler that was fit on training data. If you serialize the model without the scaler, inference will fail.

We use joblib for sklearn-compatible transformers and the model. We package them together into a single artifact:

python
import joblib

# Save as a bundle, not separately
artifact = {
    "model": model,
    "scaler": scaler,
    "feature_columns": list(X_train.columns),
    "model_version": "1.0.0",
    "trained_at": "2026-06-01",
}

joblib.dump(artifact, "model_artifact.joblib")

Versioning matters. Include the version string in the artifact. When something breaks at inference time, you need to know which model version was running.

Step 4: The FastAPI endpoint

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")

# Load once at startup, not on every request
artifact = joblib.load("model_artifact.joblib")
model = artifact["model"]
scaler = artifact["scaler"]
feature_columns = artifact["feature_columns"]

class PredictRequest(BaseModel):
    features: dict[str, float]

class PredictResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
    try:
        # Validate feature keys match what the model expects
        df = pd.DataFrame([request.features])
        df = df.reindex(columns=feature_columns, fill_value=0)
        
        scaled = scaler.transform(df)
        prediction = int(model.predict(scaled)[0])
        probability = float(np.max(model.predict_proba(scaled)[0]))
        
        return PredictResponse(
            prediction=prediction,
            probability=probability,
            model_version=artifact["model_version"],
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}")

@app.get("/health")
async def health():
    return {"status": "ok", "model_version": artifact["model_version"]}

Key decisions in this endpoint:

  • Load the artifact once at startup. Loading on every request is a 10-100x latency hit.
  • Validate inputs. reindex with fill_value=0 handles missing features gracefully instead of crashing.
  • Return the model version. Every prediction response should include which model generated it.
  • Health endpoint. Required for load balancers and uptime monitoring.

Step 5: Docker packaging

dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model_artifact.joblib .
COPY main.py .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

One decision to make: include the model artifact in the Docker image (simpler, larger image) or load it at startup from S3/GCS (more complex, enables model updates without rebuilding the image). For most clients, we start with artifact-in-image and move to remote loading when model iteration frequency justifies it.

Step 6: Monitoring in production

Two things you must monitor:

  • Latency. P95 and P99, not just average. A model that averages 50ms but spikes to 2s at P99 will cause production incidents.
  • Prediction distribution. Log the prediction class distribution over time. If a classifier that was 70/30 class split starts outputting 95/5, something has changed in the input data.

We use Prometheus + Grafana for metrics, with a middleware that logs prediction distributions to a database for drift detection.

python
import time
from fastapi import Request

@app.middleware("http")
async def add_timing_header(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = (time.time() - start) * 1000
    response.headers["X-Response-Time-Ms"] = str(round(duration, 2))
    return response

What this looks like end-to-end

At Creative Codes, a typical custom ML model deployment follows this path:

  1. Data audit and feature agreement with client (1-2 days)
  2. Training pipeline with tracked experiments (2-3 days)
  3. Evaluation against agreed metrics (1 day)
  4. FastAPI endpoint with health check and versioning (1 day)
  5. Docker build + deployment to client's infrastructure (1 day)
  6. Monitoring setup + handoff documentation (1 day)

Total: 7-10 days for a clean custom classifier or regression model, start to production endpoint.


If you need a custom ML model built and deployed, tell us what you're trying to predict.

Related: Fine-Tuning vs RAG: How to Choose for Your Use Case

AI & Machine Learning services →

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning

We publish new posts every few weeks. See more on the insights page.