How long does it take to deploy a custom ML model to a production API?

At Creative Codes, a typical model-to-endpoint pipeline takes 1-2 weeks depending on model complexity and data prep. Training a classification model on a clean dataset, wrapping it in a FastAPI endpoint, adding Docker packaging and basic monitoring: 3-5 days. A larger model with custom preprocessing, A/B routing, and auto-scaling: 2-3 weeks. The biggest variable is data quality, not model architecture.

What's the difference between deploying a custom ML model vs using an LLM API?

A custom ML model (scikit-learn, XGBoost, PyTorch classifier) runs inference in milliseconds for fractions of a cent per thousand calls at scale. An LLM API call takes 1-3 seconds and costs $0.01-$0.10+ per call. For structured prediction: classification, regression, anomaly detection, a custom model running on your own infrastructure is almost always faster and cheaper at volume. We use LLMs for unstructured text understanding, custom models for everything else.

Do you use Docker for ML model deployment?

Yes. Every model we deploy ships in a Docker container: the model file, preprocessing code, and FastAPI app in one portable image. This ensures the inference environment matches training, makes deployment to any cloud provider trivial, and lets us version model deployments with Docker tags. We use Docker Compose for local testing and a VPS or AWS Lambda container image for production, depending on traffic requirements.

← All insights

AI/MLJune 1, 202610 min read

From Training to Endpoint: How We Deploy Custom ML Models

Most ML tutorials end at model.fit(). The real work starts there. Here's how we take a trained model from notebook to a FastAPI endpoint that handles real traffic.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

The training step is usually the smallest part of a custom ML project. The real work starts at deployment. Getting from a trained model to something that runs reliably in production is where most teams struggle. This is how we do it.

What "production" means for an ML model

A production ML model needs to:

Accept requests over HTTP (or another transport)
Return predictions in a consistent format in predictable time
Handle errors gracefully without crashing the service
Be monitorable: you need to know if accuracy is drifting or the endpoint is slow
Be deployable without a PhD in DevOps

FastAPI is our default serving framework for custom models. It's async, fast, auto-generates OpenAPI docs, and integrates cleanly with Python ML libraries.

Step 1: Data preparation

Before training, data preparation decisions lock in your model's ceiling. The mistakes made here can't be fixed with a better algorithm.

The things we always do:

Schema validation early. Know what columns exist, their types, and the range of values. Anything outside that range at inference time will silently degrade predictions.
Split before any preprocessing. Fit scalers, encoders, and imputers on training data only. Apply the fit transformers to validation and test sets. Applying fit on the full dataset before splitting is a common data leakage mistake.
Document the target variable. If it's a classification target, document every class label and its distribution. Imbalanced classes need handling.

python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

df = pd.read_csv("training_data.csv")

# Split BEFORE fitting anything
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # transform only, not fit_transform

Step 2: Training and evaluation

For most tabular classification and regression tasks, we start with gradient boosting (XGBoost or LightGBM) before reaching for deep neural networks. It's faster to train, faster to serve, and often performs comparably on structured data.

For text: sentence-transformers for embeddings + a lightweight classifier on top. For images: fine-tuned EfficientNet or ResNet with frozen base layers.

The evaluation step people skip: define what "good enough" means before training, not after. If the model needs to achieve 95% precision on the positive class to be useful, write that down. Don't adjust the threshold post-hoc to hit a number.

python

from xgboost import XGBClassifier
from sklearn.metrics import classification_report

model = XGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=6)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

If the eval metrics don't hit the bar: more data or better features first, hyperparameter tuning second, different model architecture last. Jumping straight to architecture changes is the most expensive way to improve a model.

Step 3: Packaging the model and its transformers

This is the step most tutorials skip that causes the most production bugs. The model needs the scaler that was fit on training data. If you serialize the model without the scaler, inference will fail.

We use joblib for sklearn-compatible transformers and the model. We package them together into a single artifact:

python

import joblib

# Save as a bundle, not separately
artifact = {
    "model": model,
    "scaler": scaler,
    "feature_columns": list(X_train.columns),
    "model_version": "1.0.0",
    "trained_at": "2026-06-01",
}

joblib.dump(artifact, "model_artifact.joblib")

Versioning matters. Include the version string in the artifact. When something breaks at inference time, you need to know which model version was running.

Step 4: The FastAPI endpoint

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")

# Load once at startup, not on every request
artifact = joblib.load("model_artifact.joblib")
model = artifact["model"]
scaler = artifact["scaler"]
feature_columns = artifact["feature_columns"]

class PredictRequest(BaseModel):
    features: dict[str, float]

class PredictResponse(BaseModel):
    prediction: int
    probability: float
    model_version: str

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
    try:
        # Validate feature keys match what the model expects
        df = pd.DataFrame([request.features])
        df = df.reindex(columns=feature_columns, fill_value=0)
        
        scaled = scaler.transform(df)
        prediction = int(model.predict(scaled)[0])
        probability = float(np.max(model.predict_proba(scaled)[0]))
        
        return PredictResponse(
            prediction=prediction,
            probability=probability,
            model_version=artifact["model_version"],
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}")

@app.get("/health")
async def health():
    return {"status": "ok", "model_version": artifact["model_version"]}

Key decisions in this endpoint:

Load the artifact once at startup. Loading on every request is a 10-100x latency hit.
Validate inputs. reindex with fill_value=0 handles missing features gracefully instead of crashing.
Return the model version. Every prediction response should include which model generated it.
Health endpoint. Required for load balancers and uptime monitoring.

Step 5: Docker packaging

dockerfile

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model_artifact.joblib .
COPY main.py .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

One decision to make: include the model artifact in the Docker image (simpler, larger image) or load it at startup from S3/GCS (more complex, enables model updates without rebuilding the image). For most clients, we start with artifact-in-image and move to remote loading when model iteration frequency justifies it.

Step 6: Monitoring in production

Two things you must monitor:

Latency. P95 and P99, not just average. A model that averages 50ms but spikes to 2s at P99 will cause production incidents.
Prediction distribution. Log the prediction class distribution over time. If a classifier that was 70/30 class split starts outputting 95/5, something has changed in the input data.

We use Prometheus + Grafana for metrics, with a middleware that logs prediction distributions to a database for drift detection.

python

import time
from fastapi import Request

@app.middleware("http")
async def add_timing_header(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = (time.time() - start) * 1000
    response.headers["X-Response-Time-Ms"] = str(round(duration, 2))
    return response

Input distribution drift is the third thing to monitor, and the one most teams skip. Your model was trained on data with a particular distribution of feature values. If the production inputs start looking different — different range, different mean, different proportion of missing values — the model is extrapolating outside its training distribution and accuracy degrades silently.

We use Population Stability Index (PSI) to detect this. PSI compares the distribution of a feature in training data vs. recent production data. A PSI under 0.1 is stable; 0.1-0.25 is worth watching; above 0.25 means meaningful drift and you should investigate whether retraining is needed.

The simplest way to set this up is Evidently AI, which generates data drift reports with a few lines of code:

python

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

def check_input_drift(reference_df, production_df, feature_columns):
    column_mapping = ColumnMapping(numerical_features=feature_columns)
    report = Report(metrics=[DataDriftPreset()])
    report.run(
        reference_data=reference_df,
        current_data=production_df,
        column_mapping=column_mapping
    )
    results = report.as_dict()
    return results["metrics"][0]["result"]["dataset_drift"]

# Run weekly, alert if drift detected
if check_input_drift(training_sample, last_7d_production_inputs, feature_columns):
    send_slack_alert(":warning: Input drift detected — model retraining may be needed")

Run this weekly against a sample of production inputs compared to your training data. If drift is detected, the next step is understanding why: did a data source change format, did user behavior shift, did an upstream pipeline start sending different values? The answer determines whether you need retraining, feature engineering changes, or upstream fixes.

The retraining trigger we typically set: if PSI exceeds 0.25 on any feature, or if the production prediction distribution shifts more than 10 percentage points from the training distribution, we schedule a retraining run. The model gets retrained on data that includes the recent production inputs, re-evaluated against the same held-out test set, and deployed only if it matches or exceeds the baseline accuracy. This keeps the model accurate without requiring a developer to manually check performance on a schedule.

What this looks like end-to-end

At Creative Codes, a typical custom ML model deployment follows this path:

Data audit and feature agreement with client (1-2 days)
Training pipeline with tracked experiments (2-3 days)
Evaluation against agreed metrics (1 day)
FastAPI endpoint with health check and versioning (1 day)
Docker build + deployment to client's infrastructure (1 day)
Monitoring setup + handoff documentation (1 day)

Total: 7-10 days for a clean custom classifier or regression model, from first data review to a running production endpoint with monitoring in place.

If you need a custom ML model built, deployed, and monitored in production, tell us what you're trying to predict.

For a real production example of multi-modal ML extraction deployed at scale, see the DataVersion Document Intelligence case study: 500+ PDF formats processed, 99.2% field accuracy, full Docker-packaged extraction pipeline running in production.

AI & Machine Learning services →

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning →

← All insights

AI/ML9 min

Document AI in Production: OCR, Structured Extraction, and PDF Parsing at Scale

AI/ML9 min

LLM Integration for Production Apps: API Design, Latency, and Cost Control

AI/ML9 min

Qdrant vs ChromaDB vs Pinecone: Choosing a Vector Database for Production RAG

We publish new posts every few weeks. See more on the insights page.