MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

The Specialist’s Dilemma Is Breaking Scientific AI

2026-04-03 15:44:59

Intern-S1-Pro challenges the idea that AI must choose between general reasoning and scientific specialization across multiple domains.

The Missing Data Problem Behind Broken Computer-Use Agents

2026-04-03 15:14:59

Sparse screenshots miss the motion, recovery, and reasoning computer-use agents need to navigate pro desktop software effectively.

Steering Probability Distributions on Curved Spaces

2026-04-03 14:44:59

Learn how Sinkhorn recursion computes optimal probability steering on curved manifolds without flattening the problem.

Weekend Project: I Built a Full MLOps Pipeline for a Credit Scoring Model (And You Can Too)

2026-04-03 12:30:07

MLOps

A hands-on, beginner-friendly guide to deploying, monitoring, and optimizing a Machine Learning model in production — from API creation to data drift detection, using the real Home Credit Default Risk dataset from Kaggle.


Last Saturday morning, I was scrolling through freelance gig postings on Fiverr when I stumbled upon something that caught my attention. A small fintech startup was looking for someone to "take our trained credit scoring model and make it production-ready — API, Docker, CI/CD, the whole shebang." The budget was decent, the deadline was two weeks, and I thought: "How hard can it be?"

They pointed me to the Home Credit Default Risk dataset from Kaggle — a real-world dataset with 300,000+ loan applications and 122 features. Home Credit is a company that provides loans to people with little or no credit history. The challenge: predict which applicants are likely to default.

Spoiler: the project was more involved than I expected. But by Sunday evening, I had a working end-to-end MLOps pipeline running, and I learned an incredible amount. This article is the tutorial I wish I had before starting.

Whether you're a data science student, a junior ML engineer, or someone curious about what happens after a model is trained, this guide will walk you through every step. I'll explain not just the how but the why behind every decision — because understanding the reasoning is what separates someone who follows a tutorial from someone who can adapt the knowledge to new situations.

What we'll build together:

  • A prediction API using FastAPI that serves a credit scoring model trained on real Kaggle data
  • Automated tests to make sure our API doesn't break
  • A Docker container to package everything for deployment
  • A CI/CD pipeline with GitHub Actions to automate testing and deployment
  • A data drift analysis to monitor model health over time
  • Performance optimizations to speed up inference

Let's dive in.


Table of Contents

  1. The Big Picture: What Is MLOps?
  2. Setting Up the Project
  3. Exploring and Preparing the Home Credit Dataset
  4. Training the Credit Scoring Model
  5. Creating a Prediction API with FastAPI
  6. Writing Automated Tests
  7. Containerizing with Docker
  8. Building a CI/CD Pipeline
  9. Logging Production Data
  10. Data Drift Detection
  11. Performance Optimization
  12. The Final Architecture
  13. Key Takeaways

1. The Big Picture: What Is MLOps?

Before we write a single line of code, let's understand the landscape we're operating in. This context is what makes the difference between blindly following steps and actually understanding what you're building.

The "Last Mile" Problem

Here's a reality check that most online courses don't tell you: training a model is only about 20% of the work in a real ML project. The remaining 80% is everything that happens around it — data pipelines, deployment, monitoring, maintenance.

Think about it: you've built a great model in a Jupyter notebook. It achieves 0.85 AUC. Everyone's happy. But then what? The model just sits in your notebook. How does the loan officer at the bank actually use it to make decisions? How do you know it's still accurate 6 months from now? What happens when it breaks at 3 AM on a Friday?

This is what MLOps (Machine Learning Operations) solves. It's the set of practices that takes a model from "it works on my laptop" to "it runs reliably in production, 24/7, and we know when something goes wrong." Think of it as DevOps (the practices software engineers use to deploy and maintain applications), but specifically adapted for the unique challenges of machine learning — where the code and the data can both change and cause failures.

MLOps LifecycleThe MLOps lifecycle — training is just one piece of a much larger puzzle. (source: ml-ops.org)

What We'll Cover and Why

| Pillar | What It Means | Why It Matters | Tool We'll Use | |----|----|----|----| | Model Serving | Making predictions available via an API | So other systems and people can actually use the model | FastAPI | | Containerization | Packaging code + dependencies together | So the code runs identically everywhere, not just on your machine | Docker | | CI/CD | Automating tests and deployment | So broken code never reaches production | GitHub Actions | | Monitoring | Watching model behavior in production | So you know when the model starts making bad predictions | Evidently AI | | Optimization | Making inference faster | So users don't wait 10 seconds for a prediction | cProfile, ONNX | | Version Control | Tracking every change | So you can trace what changed, when, and why | Git + GitHub |

Without containerization, your code works on your machine but breaks on the server because of a different Python version. Without CI/CD, every deployment is a manual, error-prone process where someone eventually forgets to run the tests. Without monitoring, your model silently degrades for months before anyone notices.

Each piece we build solves a real, concrete problem. Let's start.


2. Setting Up the Project

Why project structure matters

Before we write any ML code, we need to organize our workspace. This might seem boring, but good project structure is the foundation of good MLOps. When someone new joins the project (or when future-you comes back in 6 months), they should immediately understand where everything lives.

Think of it like organizing a kitchen: if ingredients, utensils, and recipes are all mixed together in one drawer, cooking is a nightmare. If they're in labeled cabinets, anyone can find what they need.

Here's the structure we'll use:

credit-scoring-mlops/
│
├── app/ # API code lives here
│ ├── __init__.py # Makes 'app' a Python package
│ ├── main.py # FastAPI application
│ ├── model_loader.py # Model loading logic
│ └── schemas.py # Input/output validation rules
│
├── model/ # Trained model artifacts
│ └── credit_model.pkl
│
├── tests/ # All test code
│ ├── __init__.py
│ ├── test_api.py # Tests for the API
│ └── test_model.py # Tests for the model
│
├── notebooks/ # Analysis notebooks
│ └── data_drift_analysis.ipynb
│
├── monitoring/ # Logging and monitoring code
│ └── logger.py
│
├── .github/ # CI/CD pipeline configuration
│ └── workflows/
│ └── ci-cd.yml
│
├── Dockerfile # How to build the Docker container
├── requirements.txt # Python dependencies with versions
├── .gitignore # Files Git should ignore
└── README.md # Documentation for humans

The separation between app/ (API code), tests/ (test code), model/ (artifacts), and monitoring/ (observability code) follows a common convention in ML engineering. Each folder has one responsibility. If there's a bug in the API, you look in app/. If a test is failing, you look in tests/. Simple.

Step by step — create the folders

Creating the directory skeleton. So every file we create later has a logical home.

mkdir credit-scoring-mlops && cd credit-scoring-mlops
mkdir -p app model tests notebooks monitoring .github/workflows

The -p flag tells mkdir to create parent directories as needed. Without it, mkdir .github/workflows would fail because .github doesn't exist yet.

Initialize Git

Turning this folder into a Git repository. Git tracks every change you make, creating a history you can go back to if something breaks. It's like having infinite "undo" for your entire project.

git init

This creates a hidden .git/ folder that stores all the tracking information. You'll never need to touch it directly.

Create the .gitignore

Telling Git which files to never track. Some files should never be in a repository:

  • Secrets (API keys, passwords) — if they end up in Git, they're in the history forever, even if you delete them later
  • Large data files (CSVs) — Git isn't designed for large binary files; it would make the repo slow to clone
  • Generated files (__pycache__/, .pyc) — these are recreated automatically and just add noise
cat > .gitignore << 'EOF'
# Python bytecode — generated automatically, no need to track
__pycache__/
*.py[cod]

# Virtual environments — each developer creates their own
.venv/
venv/

# Data files — too large for Git (use DVC or Git LFS for these)
*.csv
*.parquet
data/

# Secrets — NEVER commit these. Use environment variables instead.
.env
*.secret

# IDE configuration — specific to each developer's setup
.vscode/
.idea/

# OS-specific files — useless noise
.DS_Store
Thumbs.db

# Log files — generated at runtime
*.log
logs/
EOF

First commit

Creating our first "snapshot" of the project. Every commit is a checkpoint. If something goes wrong later, we can always come back to this clean state. The commit message should describe what this snapshot contains.

git add .gitignore
git commit -m "Initial commit: project structure and .gitignore"

A note on commit messages: Write them as if someone else will read them (they will — future you). "fix stuff" is useless. "feat: add input validation for age field" tells you exactly what changed. A common convention is to prefix with feat: (new feature), fix: (bug fix), docs: (documentation), or refactor: (code cleanup).


3. Exploring and Preparing the Home Credit Dataset

3.1 — About the dataset

Understanding what data we're working with before touching any code. You can't build a good model — or a good API — if you don't deeply understand your data. This step is often rushed but it's the most important.

The Home Credit Default Risk dataset comes from a real Kaggle competition. Home Credit provides loans to people who have little or no traditional credit history — the "unbanked" population. These are people who might be rejected by traditional banks simply because they don't have enough credit history, not because they're actually risky.

The main file, application_train.csv, contains 307,511 rows (one per loan application) and 122 columns (features about the applicant and the loan). The target column TARGET is binary:

  • 0 = the applicant repaid the loan successfully
  • 1 = the applicant had payment difficulties (default)

Download it from Kaggle and place it in a data/ folder:

# Option 1: Use the Kaggle CLI (you need a Kaggle account + API token)
# pip install kaggle
# kaggle competitions download -c home-credit-default-risk

# Option 2: Download manually from
# https://www.kaggle.com/c/home-credit-default-risk/data
# You only need application_train.csv for this tutorial

3.2 — Load and inspect the data

Loading the CSV file and getting a first overview. Before any analysis, you need to know the size, the types of columns, and the general "shape" of your data. This prevents surprises later.

import pandas as pd
import numpy as np

df = pd.read_csv('data/application_train.csv')
print(f"Shape: {df.shape}")
print(f"Columns: {df.shape[1]}")
print(f"Rows: {df.shape[0]:,}")

You should see: 307,511 rows, 122 columns. That's a substantial dataset — much bigger than the toy datasets you typically see in tutorials.

Checking the distribution of our target variable. This tells us how "balanced" the problem is. If 50% of applicants default, the model has an easy time distinguishing the two groups. If only 1% default, it's much harder — the model can just predict "no default" for everyone and be right 99% of the time while being useless.

print(f"\nTarget distribution:")
print(df['TARGET'].value_counts(normalize=True))

Important observation: The dataset is heavily imbalanced — about 92% repaid (0) and only 8% defaulted (1). This is completely typical in credit scoring: most people do repay their loans. Our model will need to be evaluated with metrics that account for this imbalance (like ROC AUC, not just accuracy).

3.3 — Understanding the key features

Selecting which features to use from the 122 available. We can't (and shouldn't) use all 122 columns. Many are redundant, many have too many missing values, and our API needs to accept inputs — we don't want to ask users for 122 fields. We pick the most predictive and interpretable features based on domain knowledge and research.

Research on this dataset (including the Kaggle competition results) consistently shows these features matter most:

| Feature | What It Means | Why It Predicts Default | |----|----|----| | EXT_SOURCE_1/2/3 | Normalized scores from external credit bureaus (0 to 1) | These are the single strongest predictors. They summarize a person's entire credit reputation from other institutions. | | DAYS_BIRTH | Client's age in days (stored as negative number) | Older applicants have more financial stability and default less. | | DAYS_EMPLOYED | How long they've been at their current job (negative = employed) | Longer employment indicates stability — they're less likely to lose income suddenly. | | AMT_INCOME_TOTAL | Total annual income | More income means more capacity to make loan payments. | | AMT_CREDIT | Credit amount of the loan | Larger loans are riskier — more to repay. | | AMT_ANNUITY | Monthly payment amount | Higher monthly payments create more financial strain. | | AMT_GOODS_PRICE | Price of the goods being financed | Context for the loan — is the person buying something that costs $5K or $500K? | | DAYS_ID_PUBLISH | Days since ID document was issued | A proxy for personal stability. People who change ID documents frequently may be less settled. | | CODE_GENDER | Gender (M/F) | A demographic factor that has statistical correlations in this dataset. | | FLAG_OWN_CAR | Owns a car? (Y/N) | Asset ownership indicates financial stability. | | FLAG_OWN_REALTY | Owns real estate? (Y/N) | Same reasoning — owning a home suggests financial roots. | | CNT_CHILDREN | Number of children | Family context — more dependents can mean more financial strain. | | NAME_EDUCATION_TYPE | Education level | Higher education correlates with higher earning potential and financial literacy. |

3.4 — Select features

Extracting only the columns we need from the full 122-column dataframe. Working with 15 carefully chosen features is more manageable than 122, and for our API, each feature will become a field that users need to provide. 15 fields is reasonable; 122 is not.

selected_features = [
 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH',
 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE',
]

X = df[selected_features].copy()
y = df['TARGET'].copy()

We use .copy() to create an independent copy of the data. Without it, modifications to X would also modify the original df, which can cause hard-to-debug issues.

3.5 — Fix the DAYS_EMPLOYED anomaly

Handling a known data quality issue. The DAYS_EMPLOYED column contains the value 365243 for unemployed people. That's roughly 1,000 years of employment — obviously a placeholder, not a real value. If we leave it, it would massively skew our model because the scaler would treat it as a real data point, compressing all actual employment durations into a tiny range.

# How many rows have this anomalous value?
print(f"Anomalous DAYS_EMPLOYED values: {(df['DAYS_EMPLOYED'] == 365243).sum():,}")

You'll see about 55,374 rows — about 18% of the data. That's not a trivial amount. We replace it with NaN (Not a Number), which means "missing value":

X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan)

3.6 — Convert days to years

Transforming the DAYS columns from "negative days before application" into "positive years." Two reasons. First, readability: -14,585 days is hard for a human to interpret, but 39.9 years is immediately clear. Second, our API will accept age in years — it would be a terrible user experience to ask someone for their "age in negative days since application date."

# DAYS_BIRTH is negative: -14585 means the person was born 14585 days before the application
# Dividing by -365.25 converts to positive years (365.25 accounts for leap years)
X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1)

# DAYS_EMPLOYED works the same way
X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1)

# DAYS_ID_PUBLISH: years since the ID document was issued
X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1)

Now we can drop the raw DAYS columns since we have the cleaner YEARS versions:

X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'])

3.7 — Encode categorical variables

Converting text values into numbers. Machine learning models only work with numbers. They can't process the string "F" or "Higher education" directly. We need to convert these into a numeric representation.

For simple binary features (Yes/No, Male/Female), we use binary encoding — just 0 and 1:

X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int)
X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int)
X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int)

What does .fillna(0) do? There are a small number of rows where CODE_GENDER is "XNA" (unknown). The .map() function turns these into NaN (because "XNA" isn't in our mapping dict), and .fillna(0) replaces them with 0. We have to handle every case — production data is messy.

For education, we use ordinal encoding. Because there's a natural order from lower to higher education. Ordinal encoding preserves this order (0 < 1 < 2 < 3 < 4), which gives the model useful information. One-hot encoding would treat "Lower secondary" and "Academic degree" as equally different from "Higher education," losing the ordering signal.

education_map = {
 'Lower secondary': 0,
 'Secondary / secondary special': 1,
 'Incomplete higher': 2,
 'Higher education': 3,
 'Academic degree': 4,
}
X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int)
X = X.drop(columns=['NAME_EDUCATION_TYPE'])

3.8 — Handle missing values

Filling in gaps where data is missing. Most ML models can't handle missing values (NaN). If we feed them a row with a missing EXT_SOURCE_1, they'll either crash or produce garbage. We need to fill these gaps with reasonable substitute values.

Let's first see how much is missing:

print("Missing values before filling:")
missing = X.isnull().sum()
print(missing[missing > 0])

You'll see that EXT_SOURCE_1 has about 56% missing, EXT_SOURCE_3 about 20%, and YEARS_EMPLOYED about 18% (those are the unemployed people we set to NaN earlier).

Filling with the median of each column. The median is robust to outliers. If most people earn $50K but one person earns $50M, the mean income would be pulled way up by that one outlier. The median stays at $50K, which is a much more representative "typical" value. For the same reason, median is the standard choice for imputation in financial data.

X = X.fillna(X.median())

print("\nMissing values after filling:", X.isnull().sum().sum()) # Should be 0

3.9 — Feature engineering

Creating new features by combining existing ones. Raw features tell part of the story, but ratios often tell a much richer story. A $400,000 loan means something very different to someone earning $40,000/year (that's 10 years of income!) versus someone earning $400,000/year (just 1 year). The absolute credit amount is the same, but the relative burden is completely different — and it's the relative burden that actually predicts default.

# Credit-to-income ratio: how many years of income does the loan represent?
# A ratio of 5 means the loan is 5x the annual income — that's a heavy burden
X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1)

We add +1 to the denominator to avoid division by zero. It's a tiny number compared to actual incomes (tens of thousands), so it doesn't affect the result meaningfully.

# Annuity-to-income ratio: what fraction of income goes to monthly payments?
# A ratio of 0.3 means 30% of income goes to loan payments — that's stressful
X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1)

# Credit-to-goods ratio: how much of the goods price is financed?
# A ratio close to 1 means the person is financing nearly 100% of the purchase
# (no down payment), which is a risk signal
X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1)

3.10 — Check the final feature set

Verifying we have a clean, complete dataset. This is a sanity check before training. If something is wrong here (wrong column count, remaining NaNs, unexpected types), it's much easier to fix now than after the model is trained.

print(f"Final feature set: {X.shape[1]} features, {X.shape[0]:,} rows")
print(f"Features: {list(X.columns)}")
print(f"Any remaining NaN: {X.isnull().any().any()}")

Our final 18 features:

EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3,
AMT_INCOME_TOTAL, AMT_CREDIT, AMT_ANNUITY, AMT_GOODS_PRICE,
CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, CNT_CHILDREN,
AGE_YEARS, YEARS_EMPLOYED, YEARS_ID_PUBLISH, EDUCATION_LEVEL,
CREDIT_INCOME_RATIO, ANNUITY_INCOME_RATIO, CREDIT_GOODS_RATIO

18 features: manageable, interpretable, and each one has a clear business meaning. That's important — when a loan officer asks "why did the model reject this person?", you can point to specific features and explain.

Click to expand: prepare_data.py (complete copy-paste ready code)

# prepare_data.py

import pandas as pd
import numpy as np

def load_and_prepare_data(filepath='data/application_train.csv'):
 df = pd.read_csv(filepath)

 selected_features = [
 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH',
 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
 'CNT_CHILDREN', 'NAME_EDUCATION_TYPE',
 ]

 X = df[selected_features].copy()
 y = df['TARGET'].copy()

 X['DAYS_EMPLOYED'] = X['DAYS_EMPLOYED'].replace(365243, np.nan)

 X['AGE_YEARS'] = (-X['DAYS_BIRTH'] / 365.25).round(1)
 X['YEARS_EMPLOYED'] = (-X['DAYS_EMPLOYED'] / 365.25).round(1)
 X['YEARS_ID_PUBLISH'] = (-X['DAYS_ID_PUBLISH'] / 365.25).round(1)
 X = X.drop(columns=['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH'])

 X['CODE_GENDER'] = X['CODE_GENDER'].map({'M': 0, 'F': 1}).fillna(0).astype(int)
 X['FLAG_OWN_CAR'] = X['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1}).astype(int)
 X['FLAG_OWN_REALTY'] = X['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1}).astype(int)

 education_map = {
 'Lower secondary': 0, 'Secondary / secondary special': 1,
 'Incomplete higher': 2, 'Higher education': 3, 'Academic degree': 4,
 }
 X['EDUCATION_LEVEL'] = X['NAME_EDUCATION_TYPE'].map(education_map).fillna(1).astype(int)
 X = X.drop(columns=['NAME_EDUCATION_TYPE'])

 X = X.fillna(X.median())

 X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / (X['AMT_INCOME_TOTAL'] + 1)
 X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / (X['AMT_INCOME_TOTAL'] + 1)
 X['CREDIT_GOODS_RATIO'] = X['AMT_CREDIT'] / (X['AMT_GOODS_PRICE'] + 1)

 return X, y

if __name__ == '__main__':
 X, y = load_and_prepare_data()
 print(f"Features: {X.shape}, Target: {y.shape}")
 print(f"Default rate: {y.mean():.2%}")


4. Training the Credit Scoring Model

4.1 — Train/test split

Splitting our data into two parts — one for training, one for testing. We need to evaluate our model on data it has never seen during training. If we test on the same data we trained on, the model could just memorize the answers (this is called overfitting), and we'd have no idea how well it actually generalizes to new loan applications.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42, stratify=y
)

Let's unpack each argument:

  • test_size=0.2 — 20% for testing, 80% for training. This is a standard split. More training data = better model, but we need enough test data for reliable evaluation.
  • random_state=42 — This fixes the random seed so the split is the same every time you run the code. Reproducibility matters in ML — you need to be able to get the same results.
  • stratify=yThis is crucial for imbalanced datasets. Without it, you might end up with 10% defaults in training but only 5% in testing (random variation). stratify=y ensures both sets have the exact same proportion of defaults (~8%).
print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")
print(f"Train default rate: {y_train.mean():.2%}")
print(f"Test default rate: {y_test.mean():.2%}")

Both rates should be approximately 8.07% — confirming that stratification worked.

4.2 — Build a scikit-learn Pipeline

Creating a Pipeline that bundles preprocessing and the model into a single object. This is one of the most important design decisions we'll make, and it has huge implications for deployment.

Without a Pipeline, deployment looks like this:

  1. Load the scaler
  2. Transform the data with the scaler
  3. Load the model
  4. Feed the transformed data to the model

With a Pipeline, deployment looks like this:

  1. Load the pipeline
  2. Feed raw data to it

The Pipeline handles everything internally. This means fewer files to manage, fewer things that can go wrong, and — critically — the scaler and model are guaranteed to be in sync. If you accidentally use a scaler from experiment #3 with a model from experiment #7, you'll get garbage predictions but no error message. A Pipeline prevents this.

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
 ('scaler', StandardScaler()),
 ('classifier', GradientBoostingClassifier(
 n_estimators=200,
 max_depth=4,
 learning_rate=0.1,
 subsample=0.8,
 random_state=42,
 ))
])

What does StandardScaler do? It transforms each feature to have a mean of 0 and a standard deviation of 1. Without scaling, a feature like AMT_CREDIT (values in the hundreds of thousands) would dominate a feature like EXT_SOURCE_1 (values between 0 and 1) simply because its numbers are bigger. Scaling puts all features on equal footing.

Gradient boosting builds many small decision trees sequentially, where each tree tries to correct the mistakes of the previous ones. It's consistently one of the best algorithms for structured/tabular data (like our spreadsheet-style credit data). In the Kaggle competition for this dataset, the top solutions all used gradient boosting variants (LightGBM, XGBoost).

What do the hyperparameters mean?

  • n_estimators=200 — Build 200 trees. More trees generally means better performance, but with diminishing returns and slower training.
  • max_depth=4 — Each tree is at most 4 levels deep. Deeper trees can capture more complex patterns but are more likely to overfit.
  • learning_rate=0.1 — How much each tree contributes. Lower values need more trees but are more robust.
  • subsample=0.8 — Each tree is trained on a random 80% of the data. This randomness reduces overfitting.

4.3 — Train the model

Fitting the pipeline to our training data. This is where the actual learning happens. The scaler calculates the mean and standard deviation of each feature. The classifier builds 200 decision trees, each trying to better predict defaults.

pipeline.fit(X_train, y_train)

That single line does all the work. Depending on your machine, this takes 1-5 minutes on 246,000 training rows with 18 features and 200 trees.

4.4 — Evaluate

Measuring how well the model performs on data it has never seen. The training score tells you how well the model memorized the training data — it's always optimistic. The test score tells you how well it will actually perform in production.

from sklearn.metrics import classification_report, roc_auc_score

y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))
print(f"ROC AUC Score: {roc_auc_score(y_test, y_proba):.4f}")

Because of the imbalance. A model that always predicts "no default" would have 92% accuracy — sounds great, right? But it would be completely useless because it never identifies actual defaulters. ROC AUC measures how well the model ranks defaulters above non-defaulters, regardless of the threshold. An AUC of 0.5 means random guessing; 1.0 means perfect separation. With our features, you should get approximately 0.74-0.76, which is in line with Kaggle competition results using only application_train.csv.

4.5 — Save the model and reference data

Writing the trained pipeline and the training data to disk. Two reasons. (1) The model needs to be loaded by the API to serve predictions. (2) We save the training data as "reference data" — we'll compare future production data against it to detect drift.

import joblib
import os

os.makedirs('model', exist_ok=True)
os.makedirs('data', exist_ok=True)

# Save the trained pipeline (scaler + model together)
joblib.dump(pipeline, 'model/credit_model.pkl')

# Save reference data for drift detection
X_train.to_csv('data/reference_data.csv', index=False)
X_test.to_csv('data/test_data.csv', index=False)

# Save the feature column names — the API needs to know the exact order
joblib.dump(list(X_train.columns), 'model/feature_columns.pkl')

print("Model saved to model/credit_model.pkl")
print(f"Reference data: {X_train.shape[0]:,} rows, {X_train.shape[1]} features")

The model expects features in a specific order (the same order it was trained on). If the API passes features in a different order, predictions would be wrong without any error message. By saving the column list, we guarantee the API always sends features in the right order.

git add train_model.py prepare_data.py
git commit -m "feat: train credit scoring model on Home Credit dataset"

5. Creating a Prediction API with FastAPI

Now we get to the core of the MLOps work. We have a trained model in a .pkl file. The goal: make it usable by anyone, anywhere, via a simple HTTP request.

5.1 — What is an API and why do we need one?

What an API is: An API (Application Programming Interface) is an intermediary that sits between clients (users, other applications, mobile apps) and your model. The client sends a request ("here's a loan applicant's data"), the API passes it to the model, and returns the prediction ("12% probability of default, Low risk").

Without an API, every person who wants a prediction would need to:

  1. Install Python
  2. Install the exact same versions of scikit-learn, pandas, numpy
  3. Download the model file
  4. Write Python code to load it and pass data in the correct format

That's clearly not scalable. With an API, they just send an HTTP request (which any programming language can do) and get back a JSON response. A web developer, a mobile app, or even a spreadsheet macro can call your API.

REST API conceptA REST API acts as an intermediary between clients and your model. (source: SmartBear)

There are many Python web frameworks (Flask, Django, Tornado). We choose FastAPI because:

  • It auto-generates interactive documentation (Swagger UI) — great for testing
  • It uses Python type hints for automatic input validation (no manual if/else chains)
  • It's one of the fastest Python frameworks
  • It's become the standard choice in ML engineering

(FastAPI documentation)

5.2 — The critical rule: load your model ONCE

Loading the model into memory once when the server starts, then reusing it for every request. This is the single most important performance decision in an ML API.

Here's what happens if you load the model on every request:

# BAD — loads from disk on EVERY request
@app.post("/predict")
async def predict(data):
 model = joblib.load("model/credit_model.pkl") # Disk I/O every time!
 return model.predict(data)

If your model file is 50MB and you get 100 requests/second:

  • You're reading 50MB from disk 100 times per second (5 GB/s of I/O!)
  • Each request takes extra milliseconds or seconds just for loading
  • Memory usage spikes because 100 copies of the model exist simultaneously
  • The server eventually crashes under load

The fix is simple — load once, reuse forever:

# GOOD — load once at startup, reuse for all requests
model = None

def load_model():
 global model
 model = joblib.load("model/credit_model.pkl") # Once, at startup

@app.post("/predict")
async def predict(data):
 return model.predict(data) # Uses the already-loaded model in memory

Let's build the proper version. Create app/model_loader.py:

import joblib
import os

We store the model and feature columns as module-level globals. They start as None and get populated at startup:

_model = None
_feature_columns = None

The loading function:

def load_model():
 """Load model and feature list ONCE at startup."""
 global _model, _feature_columns

 model_path = os.environ.get("MODEL_PATH", "model/credit_model.pkl")
 features_path = os.environ.get("FEATURES_PATH", "model/feature_columns.pkl")

 if not os.path.exists(model_path):
 raise FileNotFoundError(f"Model not found at {model_path}")

 _model = joblib.load(model_path)
 _feature_columns = joblib.load(features_path)
 print(f"Model loaded from {model_path} ({len(_feature_columns)} features)")
 return _model

This checks for an environment variable first, and falls back to a default. On your laptop, the default works fine. Inside a Docker container, the model might be at a different path — you can configure this without changing code by setting the environment variable.

Retrieval functions:

def get_model():
 if _model is None:
 raise RuntimeError("Model not loaded! Call load_model() first.")
 return _model

def get_feature_columns():
 if _feature_columns is None:
 raise RuntimeError("Features not loaded!")
 return _feature_columns

5.3 — Input validation with Pydantic

Defining strict rules for what the API accepts as input. In a controlled Jupyter notebook, you know exactly what the data looks like. In production, you have zero control. Someone might send:

  • Text where a number is expected ("age": "forty")
  • Negative values where only positive makes sense ("income": -5000)
  • Missing required fields
  • Values that are technically valid but make no business sense ("age": 200)

Pydantic lets us define a schema — a set of rules — for our input data. FastAPI uses this schema to automatically validate every incoming request. If the data doesn't match, the API returns a detailed error message explaining exactly what's wrong. No manual if/else chains needed.

Create app/schemas.py:

from pydantic import BaseModel, Field, field_validator

Now define each field with its constraints. The Field(...) function is where the magic happens:

class LoanApplication(BaseModel):
 # External credit bureau scores
 # These are normalized between 0 and 1 by the credit bureaus
 ext_source_1: float = Field(
 ..., # ... means "this field is required"
 ge=0.0, # ge = "greater than or equal to"
 le=1.0, # le = "less than or equal to"
 description="Normalized score from external data source 1"
 )
 ext_source_2: float = Field(
 ..., ge=0.0, le=1.0,
 description="Normalized score from external data source 2"
 )
 ext_source_3: float = Field(
 ..., ge=0.0, le=1.0,
 description="Normalized score from external data source 3"
 )

What happens if someone sends ext_source_1: 5.0? FastAPI catches it and returns:

{"detail": [{"msg": "Input should be less than or equal to 1"}]}

with HTTP status 422 (Unprocessable Entity). No code needed on your part.

Financial fields:

 amt_income_total: float = Field(
 ..., gt=0, # gt = "greater than" (strictly positive — zero income is not valid)
 description="Total annual income"
 )
 amt_credit: float = Field(..., gt=0, description="Credit amount of the loan")
 amt_annuity: float = Field(..., gt=0, description="Loan annuity (monthly payment)")
 amt_goods_price: float = Field(..., gt=0, description="Price of the goods being financed")

Personal information:

 code_gender: int = Field(..., ge=0, le=1, description="Gender (0=Male, 1=Female)")
 flag_own_car: int = Field(..., ge=0, le=1, description="Owns a car? (0=No, 1=Yes)")
 flag_own_realty: int = Field(..., ge=0, le=1, description="Owns real estate? (0=No, 1=Yes)")
 cnt_children: int = Field(..., ge=0, le=20, description="Number of children")

Derived features:

 age_years: float = Field(
 ..., ge=18, le=80,
 description="Applicant's age in years"
 )
 years_employed: float = Field(..., ge=0, le=50, description="Years of employment")
 years_id_publish: float = Field(..., ge=0, le=60, description="Years since ID was published")
 education_level: int = Field(
 ..., ge=0, le=4,
 description="0=Lower secondary, 1=Secondary, 2=Incomplete higher, 3=Higher, 4=Academic"
 )

We can also add custom business logic validation. This goes beyond simple range checks:

 @field_validator('amt_credit')
 @classmethod
 def credit_must_be_reasonable(cls, v, info):
 """A credit-to-income ratio above 100 is unrealistic."""
 if 'amt_income_total' in info.data:
 ratio = v / (info.data['amt_income_total'] + 1)
 if ratio > 100:
 raise ValueError(
 f"Credit-to-income ratio ({ratio:.0f}x) seems unrealistic"
 )
 return v

Defining the response schema. By defining what the API returns, FastAPI auto-generates documentation and ensures our responses are always consistent. Consumers of our API know exactly what to expect.

class PredictionResponse(BaseModel):
 prediction: int = Field(description="0=No Default, 1=Default")
 probability_of_default: float = Field(description="Probability from 0.0 to 1.0")
 risk_category: str = Field(description="Low, Medium, or High")

class HealthResponse(BaseModel):
 status: str
 model_loaded: bool

5.4 — Building the FastAPI application

Wiring together the model loader, the validation schemas, and the HTTP endpoints. This is the actual application that will run in production, receiving requests and returning predictions.

Create app/main.py:

from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
import pandas as pd
import numpy as np
import time
import logging
import json
from datetime import datetime, timezone

from app.schemas import LoanApplication, PredictionResponse, HealthResponse
from app.model_loader import load_model, get_model, get_feature_columns

Setting up structured logging. Every prediction the API makes should be recorded. These logs are the foundation of monitoring — they let us detect drift, track performance, and debug issues. We use Python's built-in logging module.

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("credit_scoring_api")

Defining the application lifespan (startup and shutdown). We need the model to be loaded before the first request arrives. The lifespan function runs code at startup (before yield) and shutdown (after yield).

@asynccontextmanager
async def lifespan(app: FastAPI):
 """Load model at startup, clean up at shutdown."""
 logger.info("Starting up — loading model...")
 load_model()
 logger.info("Ready to serve predictions.")
 yield # The app runs while we're "inside" the yield
 logger.info("Shutting down.")

Creating the FastAPI app instance. This is the central object that routes incoming HTTP requests to the right handler functions.

app = FastAPI(
 title="Home Credit Scoring API",
 description="Predict loan default probability using the Home Credit dataset",
 version="1.0.0",
 lifespan=lifespan,
)

Adding a health check endpoint. Every production API needs one. Load balancers use it to know if the service is alive. Monitoring tools use it to track uptime. The CI/CD pipeline uses it to verify that a deployment succeeded. It's a simple "are you there?" ping.

@app.get("/health", response_model=HealthResponse)
async def health_check():
 try:
 model = get_model()
 return HealthResponse(status="healthy", model_loaded=True)
 except RuntimeError:
 return HealthResponse(status="unhealthy", model_loaded=False)

Building the prediction endpoint — the heart of the API. This is what clients will actually call to get predictions. It receives a loan application, validates it (Pydantic does this automatically), computes the feature ratios, runs the model, logs everything, and returns the result.

@app.post("/predict", response_model=PredictionResponse)
async def predict(application: LoanApplication):
 start_time = time.time()

 try:
 model = get_model()
 feature_columns = get_feature_columns()

Converting the validated input into the exact format the model expects. The model was trained on a DataFrame with specific column names and a specific column order. We need to recreate that exactly. Note that we compute the engineered features here — the API receives raw values and derives the ratios, so users don't have to compute them.

 features = {
 'EXT_SOURCE_1': application.ext_source_1,
 'EXT_SOURCE_2': application.ext_source_2,
 'EXT_SOURCE_3': application.ext_source_3,
 'AMT_INCOME_TOTAL': application.amt_income_total,
 'AMT_CREDIT': application.amt_credit,
 'AMT_ANNUITY': application.amt_annuity,
 'AMT_GOODS_PRICE': application.amt_goods_price,
 'CODE_GENDER': application.code_gender,
 'FLAG_OWN_CAR': application.flag_own_car,
 'FLAG_OWN_REALTY': application.flag_own_realty,
 'CNT_CHILDREN': application.cnt_children,
 'AGE_YEARS': application.age_years,
 'YEARS_EMPLOYED': application.years_employed,
 'YEARS_ID_PUBLISH': application.years_id_publish,
 'EDUCATION_LEVEL': application.education_level,
 # Engineered features — computed server-side
 'CREDIT_INCOME_RATIO': application.amt_credit / (application.amt_income_total + 1),
 'ANNUITY_INCOME_RATIO': application.amt_annuity / (application.amt_income_total + 1),
 'CREDIT_GOODS_RATIO': application.amt_credit / (application.amt_goods_price + 1),
 }

 # Create DataFrame with columns in the exact training order
 input_data = pd.DataFrame([features])[feature_columns]

Getting the prediction and probability from the model. We return both because they serve different purposes. The binary prediction (0/1) is a simple decision. The probability (0.0-1.0) is more useful in practice — it lets the business set their own risk threshold. A conservative bank might reject anyone above 10% probability; an aggressive lender might accept up to 30%.

 prediction = int(model.predict(input_data)[0])
 probability = float(model.predict_proba(input_data)[0][1])

Mapping the probability to a human-readable risk category. Non-technical stakeholders don't want to interpret "probability 0.18." They want to see "Low risk" or "High risk." This translation makes the API output usable by business people, not just data scientists.

 if probability < 0.3:
 risk_category = "Low"
 elif probability < 0.6:
 risk_category = "Medium"
 else:
 risk_category = "High"

Logging every prediction with full context. These logs are the lifeblood of production monitoring. They will be used for:

  • Data drift detection: comparing production inputs over time against training data
  • Performance monitoring: tracking how inference time evolves
  • Debugging: when a prediction seems wrong, logs let us reproduce exactly what happened
  • Auditing: in regulated industries like finance, you need a record of every automated decision
 inference_time_ms = (time.time() - start_time) * 1000

 log_entry = {
 "timestamp": datetime.now(timezone.utc).isoformat(),
 "event": "prediction",
 "inputs": application.model_dump(),
 "outputs": {
 "prediction": prediction,
 "probability_of_default": round(probability, 4),
 "risk_category": risk_category,
 },
 "inference_time_ms": round(inference_time_ms, 2),
 }
 logger.info(json.dumps(log_entry))

Because it's machine-parseable. Monitoring tools (ELK Stack, Datadog, CloudWatch) can automatically index every field in the JSON, making it searchable and aggregatable. A plain text log like "Prediction: 0 for client age 40" can't be automatically parsed.

Handling errors gracefully. In production, unexpected things happen — corrupted input data, a model that fails on certain edge cases, memory issues. Without error handling, the API would crash with an ugly Python traceback. With it, the client gets a clear error message and the error is logged for debugging.

 return PredictionResponse(
 prediction=prediction,
 probability_of_default=round(probability, 4),
 risk_category=risk_category,
 )
 except Exception as e:
 logger.error(json.dumps({
 "timestamp": datetime.now(timezone.utc).isoformat(),
 "event": "prediction_error",
 "error": str(e),
 "inputs": application.model_dump(),
 }))
 raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}")

5.5 — Test it locally

pip install fastapi uvicorn scikit-learn joblib pandas pydantic numpy
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Open http://localhost:8000/docs — FastAPI automatically generates an interactive Swagger UI where you can test every endpoint directly in your browser. No Postman or curl needed (though those work too).

Test with a realistic Home Credit applicant:

curl -X POST "http://localhost:8000/predict" \
 -H "Content-Type: application/json" \
 -d '{
 "ext_source_1": 0.5,
 "ext_source_2": 0.65,
 "ext_source_3": 0.48,
 "amt_income_total": 202500,
 "amt_credit": 406597,
 "amt_annuity": 24700,
 "amt_goods_price": 351000,
 "code_gender": 1,
 "flag_own_car": 0,
 "flag_own_realty": 1,
 "cnt_children": 0,
 "age_years": 39.9,
 "years_employed": 5.3,
 "years_id_publish": 8.5,
 "education_level": 1
 }'
git add app/
git commit -m "feat: implement FastAPI prediction API with validation and logging"

6. Writing Automated Tests

Why tests matter

Writing code that verifies our API works correctly. Imagine you change one line of code that accidentally breaks the input validation. Without tests, this bug goes straight to production and the API starts silently returning wrong predictions — or crashing. With automated tests integrated into CI/CD, the bug gets caught in seconds, before it ever reaches a user.

Tests are your safety net. They give you confidence to make changes — because if you break something, you'll know immediately.

6.1 — Setting up the test client

Creating a test client that can simulate HTTP requests without starting a real server. Starting a real server for tests would be slow and complex. FastAPI's TestClient lets us test everything in-memory, in milliseconds.

# tests/test_api.py

import pytest
from fastapi.testclient import TestClient
from app.main import app

client = TestClient(app)

6.2 — A realistic test payload

Defining a valid loan application that we'll reuse across many tests. This avoids repeating the same 15 fields in every test. It represents a typical Home Credit applicant.

VALID_APPLICANT = {
 "ext_source_1": 0.5,
 "ext_source_2": 0.65,
 "ext_source_3": 0.48,
 "amt_income_total": 202500,
 "amt_credit": 406597,
 "amt_annuity": 24700,
 "amt_goods_price": 351000,
 "code_gender": 1,
 "flag_own_car": 0,
 "flag_own_realty": 1,
 "cnt_children": 0,
 "age_years": 39.9,
 "years_employed": 5.3,
 "years_id_publish": 8.5,
 "education_level": 1,
}

6.3 — Health check tests

Verifying the most basic functionality. If the health check fails, nothing else will work. This is the first thing to test.

def test_health_returns_200():
 """The health endpoint should always return HTTP 200 if the server is running."""
 response = client.get("/health")
 assert response.status_code == 200

def test_health_reports_model_loaded():
 """After startup, the model should be loaded and ready."""
 response = client.get("/health")
 data = response.json()
 assert data["status"] == "healthy"
 assert data["model_loaded"] is True

What's assert? It means "this must be true, or the test fails." If response.status_code is 500 instead of 200, pytest will report this test as failed and show you exactly what the value was.

6.4 — Valid prediction tests

Testing that the API returns correct, well-formatted predictions for valid input. This is the "happy path" — verifying that the core functionality works as expected.

def test_valid_prediction_returns_200():
 """A well-formed request should return HTTP 200 (OK)."""
 response = client.post("/predict", json=VALID_APPLICANT)
 assert response.status_code == 200

def test_response_has_all_fields():
 """The response must include prediction, probability, and risk category."""
 response = client.post("/predict", json=VALID_APPLICANT)
 data = response.json()
 assert "prediction" in data
 assert "probability_of_default" in data
 assert "risk_category" in data

def test_prediction_is_binary():
 """Prediction should be exactly 0 or 1, nothing else."""
 response = client.post("/predict", json=VALID_APPLICANT)
 assert response.json()["prediction"] in [0, 1]

def test_probability_in_valid_range():
 """Default probability must be between 0.0 and 1.0."""
 response = client.post("/predict", json=VALID_APPLICANT)
 prob = response.json()["probability_of_default"]
 assert 0.0 <= prob <= 1.0

def test_risk_category_is_valid():
 """Risk category must be one of the three defined levels."""
 response = client.post("/predict", json=VALID_APPLICANT)
 assert response.json()["risk_category"] in ["Low", "Medium", "High"]

6.5 — Invalid input tests

Testing that the API properly rejects bad data. This is equally important as testing valid inputs. Our API receives data from the outside world — people will send wrong types, out-of-range values, and missing fields. Every one of these should return a clear 422 error, not a crash or a garbage prediction.

def test_ext_source_above_1_rejected():
 """External scores are normalized 0-1. Values above 1 are invalid."""
 bad = {**VALID_APPLICANT, "ext_source_1": 5.0}
 response = client.post("/predict", json=bad)
 assert response.status_code == 422

def test_negative_income_rejected():
 """Income must be strictly positive. Zero or negative is not valid."""
 bad = {**VALID_APPLICANT, "amt_income_total": -1000}
 response = client.post("/predict", json=bad)
 assert response.status_code == 422

def test_missing_required_field_rejected():
 """If a required field is omitted, the request must fail."""
 incomplete = {k: v for k, v in VALID_APPLICANT.items() if k != "ext_source_2"}
 response = client.post("/predict", json=incomplete)
 assert response.status_code == 422

def test_wrong_type_rejected():
 """Sending a string where a number is expected must fail."""
 bad = {**VALID_APPLICANT, "age_years": "forty"}
 response = client.post("/predict", json=bad)
 assert response.status_code == 422

def test_underage_applicant_rejected():
 """Applicants must be at least 18 years old."""
 bad = {**VALID_APPLICANT, "age_years": 15}
 response = client.post("/predict", json=bad)
 assert response.status_code == 422

def test_zero_credit_rejected():
 """A loan of $0 doesn't make sense."""
 bad = {**VALID_APPLICANT, "amt_credit": 0}
 response = client.post("/predict", json=bad)
 assert response.status_code == 422

6.6 — Model-level tests

Testing the model artifact directly, without going through the API. If the model file is corrupted or incompatible, we want to know immediately — not discover it when the API crashes.

# tests/test_model.py

import joblib
import pandas as pd

def test_model_loads_successfully():
 model = joblib.load("model/credit_model.pkl")
 assert model is not None

def test_model_has_required_methods():
 model = joblib.load("model/credit_model.pkl")
 assert hasattr(model, 'predict')
 assert hasattr(model, 'predict_proba')

6.7 — Run the tests

pip install pytest httpx
pytest tests/ -v

The -v flag shows verbose output — one line per test, with PASS/FAIL status. You should see all green.

git add tests/
git commit -m "feat: add unit and integration tests for API and model"

7. Containerizing with Docker

What is Docker and why do we need it?

The problem: Your API works perfectly on your laptop. You deploy it to a server. It crashes because the server has Python 3.9 instead of 3.11, or it's missing a system library, or a dependency conflicts with something already installed.

The solution: Docker packages your application along with its entire environment — the OS, Python, all libraries, everything — into a self-contained unit called a container. Think of it like shipping your laptop inside the package instead of just the code. The container runs identically everywhere: your laptop, your colleague's machine, a cloud server, a Kubernetes cluster.

Docker conceptDocker containers package your app with everything it needs to run identically everywhere. (source: docker.com)

7.1 — The requirements file

Listing all Python dependencies with version constraints. If you just say pip install scikit-learn, pip installs the latest version — which might be different tomorrow. Pinning versions ensures that the same code produces the same results today, next month, and next year. This is called reproducibility, and it's essential for production systems.

# requirements.txt
fastapi>=0.104.0
uvicorn>=0.24.0
scikit-learn>=1.3.0
joblib>=1.3.0
pandas>=2.0.0
pydantic>=2.0.0
numpy>=1.24.0

7.2 — The Dockerfile, instruction by instruction

Writing a recipe that tells Docker how to build our container. The Dockerfile is read top-to-bottom. Each line creates a "layer" in the image. Docker caches layers, so unchanged layers are reused — this makes rebuilds fast.

FROM python:3.11-slim

We start from a base image that already has Python 3.11 installed. The slim variant is about 150MB instead of 900MB for the full version — we don't need compilers in production.

WORKDIR /app

This sets the working directory inside the container. All subsequent commands run from /app, like doing cd /app.

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

We copy the requirements file first and install dependencies. Docker caches each layer, so if requirements.txt hasn't changed since the last build, Docker skips the slow pip install entirely. Since dependencies change rarely but code changes frequently, this ordering saves minutes on every rebuild. --no-cache-dir avoids storing downloaded packages — we don't need them after installation.

COPY app/ ./app/
COPY model/ ./model/

Now we copy our application code and model. This layer comes after dependencies because code changes more frequently. If we copied code first, every code change would invalidate the dependency cache.

EXPOSE 8000

This documents that the container listens on port 8000. It's purely informational — the actual port mapping happens at runtime with docker run -p.

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

The startup command. --host 0.0.0.0 is critical inside containers — without it, uvicorn only listens on localhost, which means nothing outside the container can reach it. 0.0.0.0 means "listen on all network interfaces."

7.3 — Build, run, and verify

# Build the image (give it a name with -t)
docker build -t credit-scoring-api .

# Run the container
# -p 8000:8000 maps your machine's port 8000 to the container's port 8000
docker run -p 8000:8000 credit-scoring-api

# Verify it works
curl http://localhost:8000/health
git add Dockerfile requirements.txt
git commit -m "feat: add Dockerfile for containerized deployment"

8. Building a CI/CD Pipeline

What is CI/CD and why automate?

What CI/CD means:

  • CI (Continuous Integration): Every time you push code, automated tests run to catch bugs before they reach production.
  • CD (Continuous Deployment): If all tests pass, the code is automatically deployed. No manual steps, no "I forgot to run the tests."

Without CI/CD, deployment looks like: manually run tests → manually build Docker → manually push it → manually restart the service. Each "manually" is an opportunity for human error. With CI/CD, you push code and everything happens automatically — and if anything fails, the deployment stops.

GitHub Actions is GitHub's built-in CI/CD system. You define your pipeline in a YAML file, and GitHub runs it on their servers. (GitHub Actions docs)

8.1 — Understanding the pipeline flow

Our pipeline has 3 stages that run in sequence:

PUSH to main
 │
 ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ TEST │ ──▶ │ BUILD │ ──▶ │ DEPLOY │
│ pytest │ │ docker │ │ push to │
│ │ │ build │ │ registry │
└──────────┘ └──────────┘ └──────────┘
 │ │ │
 If FAIL: If FAIL: If FAIL:
 STOP HERE STOP HERE STOP HERE

Because each depends on the previous one succeeding. If tests fail, there's no point building a Docker image of broken code. If the Docker build fails, there's no point trying to deploy. This "fail fast" approach saves time and prevents broken code from reaching production.

8.2 — The YAML file, section by section

Create .github/workflows/ci-cd.yml:

Defining when the pipeline runs. We want it to run on every push to main (catches bugs immediately) and on every pull request targeting main (catches bugs before they're even merged).

name: CI/CD Pipeline

on:
 push:
 branches: [main]
 pull_request:
 branches: [main]

Stage 1 — TEST:

Running our pytest suite in a fresh environment. Testing in a fresh environment (not your laptop) catches issues like "it works because I have library X installed that isn't in requirements.txt."

jobs:
 test:
 runs-on: ubuntu-latest
 steps:
 - name: Checkout code
 uses: actions/checkout@v4
 - name: Set up Python
 uses: actions/setup-python@v5
 with:
 python-version: "3.11"
 - name: Install dependencies
 run: |
 pip install -r requirements.txt
 pip install pytest httpx
 - name: Generate model
 run: python train_model.py
 - name: Run tests
 run: pytest tests/ -v --tb=short

Stage 2 — BUILD (only runs if tests pass):

Building the Docker image and verifying the container actually starts and responds. Sometimes tests pass but the Docker build fails (missing file, wrong path, incompatible base image). The smoke test (curl --fail) confirms the API is alive inside the container.

 build:
 runs-on: ubuntu-latest
 needs: test # ← "needs" means: only run if the "test" job succeeded
 steps:
 - uses: actions/checkout@v4
 - name: Generate model
 run: |
 pip install scikit-learn pandas joblib numpy
 python train_model.py
 - name: Build Docker image
 run: docker build -t credit-scoring-api:${{ github.sha }} .
 - name: Smoke test the container
 run: |
 docker run -d -p 8000:8000 --name api credit-scoring-api:${{ github.sha }}
 sleep 10
 curl --fail http://localhost:8000/health || exit 1
 docker stop api

Stage 3 — DEPLOY (only from main branch, only if build passed):

Pushing the Docker image to a registry (Docker Hub) so it can be pulled by production servers. A registry is like a library for Docker images. Once the image is there, any server can pull and run it.

 deploy:
 runs-on: ubuntu-latest
 needs: build
 if: github.ref == 'refs/heads/main' # Only deploy from main, not PRs
 steps:
 - uses: actions/checkout@v4
 - name: Generate model
 run: |
 pip install scikit-learn pandas joblib numpy
 python train_model.py
 - uses: docker/login-action@v3
 with:
 username: ${{ secrets.DOCKER_USERNAME }}
 password: ${{ secrets.DOCKER_TOKEN }}
 - name: Push to Docker Hub
 run: |
 docker build -t ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest .
 docker push ${{ secrets.DOCKER_USERNAME }}/credit-scoring-api:latest

What are secrets? They're encrypted environment variables stored in GitHub Settings → Secrets. They're never visible in logs, even if the pipeline prints environment variables. Never hardcode credentials in code or YAML files — if your repo is public (or ever becomes public), those credentials are compromised.

git add .github/
git commit -m "feat: add CI/CD pipeline with GitHub Actions"

9. Logging Production Data

Why logging is critical

Making sure every prediction the API makes is recorded with full context. Once a model is in production, you need answers to questions like:

  • "Is the API getting slower over the past week?" → Check inferencetimems trends
  • "Has the data distribution changed?" → Compare logged inputs against training data
  • "Why did this client get rejected?" → Look up the exact inputs and outputs for that request
  • "How many predictions did we serve today?" → Count log entries

Without logs, you're flying completely blind. The API could be returning wrong predictions for weeks and nobody would know.

9.1 — What data to collect and why

| What We Log | Why We Need It | |----|----| | All input features (ext_source scores, income, credit amount…) | To detect drift: compare production distributions against training data | | Prediction (0 or 1) | To monitor prediction distribution: if suddenly 50% are defaults, something is wrong | | Probability (0.0 to 1.0) | To monitor score distribution: a slow shift in average probability indicates model degradation | | Inference time (milliseconds) | To detect performance issues: if it's getting slower, we need to investigate | | Timestamp | To analyze trends over time and correlate with external events | | Errors | To debug failures and identify recurring issues |

We already implemented this in our main.py (section 5.4). Each prediction generates a structured JSON log line containing all the fields above. This data forms the foundation for the drift analysis in the next section.

In a real production environment, these JSON logs would be shipped to a centralized platform — Elasticsearch/Kibana for search and visualization, or Datadog/CloudWatch for monitoring and alerting. For this tutorial, local files demonstrate the same principle.


10. Data Drift Detection

What is data drift and why does it matter?

What drift is: Your model was trained on data from a specific moment in time. It learned the statistical patterns of that data. But the real world is not static — economic conditions shift, customer demographics change, lending policies evolve. When the data your model sees in production starts looking significantly different from the data it was trained on, that's data drift.

A model trained on 2022 Home Credit data might perform terribly on 2024 data if:

  • Inflation increased incomes and credit amounts significantly
  • A new younger customer segment started applying
  • Credit bureau scoring algorithms were updated (changing EXT_SOURCE distributions)
  • An economic recession changed default patterns

The model would still return predictions — it wouldn't crash. But those predictions would be based on patterns that no longer exist. This "silent failure" is one of the most dangerous things in production ML, and drift detection is your early warning system.

Data DriftData drift occurs when production data diverges from training data. (source: Evidently AI)

10.1 — Load reference data

Loading the training data we saved earlier as our "reference" — the baseline against which we'll measure drift. To detect drift, you need to compare "then" (training) versus "now" (production). The reference data defines what "normal" looks like.

import pandas as pd
import numpy as np

reference_data = pd.read_csv('data/reference_data.csv')
print(f"Reference: {reference_data.shape}")

10.2 — Simulate production data with realistic drift

Creating fake production data that looks like what we'd see "6 months later" with realistic shifts. In a real deployment, this data would come from your API logs (the inputs you've been recording — see section 9). Here we simulate it to demonstrate the detection process.

We introduce three types of changes:

  1. EXT_SOURCE scores shift slightly (credit bureau algorithms get updated)
  2. Financial amounts increase (inflation, economic growth)
  3. Applicant demographics shift (younger customers join the platform)
np.random.seed(123)
n_prod = 5000

Simulating shifted external scores. Credit bureaus regularly update their scoring models. A small systematic shift of +0.05 in EXTSOURCE1 could happen when the bureau recalibrates — and it changes what those scores mean for our model.

prod_ext_1 = reference_data['EXT_SOURCE_1'].dropna().sample(n_prod, replace=True).values + \
 np.random.normal(0.05, 0.02, n_prod)
prod_ext_1 = np.clip(prod_ext_1, 0, 1) # Keep in valid range

prod_ext_2 = reference_data['EXT_SOURCE_2'].dropna().sample(n_prod, replace=True).values + \
 np.random.normal(0.03, 0.01, n_prod)
prod_ext_2 = np.clip(prod_ext_2, 0, 1)

# EXT_SOURCE_3 stays stable — not all features drift simultaneously
prod_ext_3 = reference_data['EXT_SOURCE_3'].dropna().sample(n_prod, replace=True).values
prod_ext_3 = np.clip(prod_ext_3, 0, 1)

Simulating inflation effects on financial features. If average incomes rise 8% due to inflation but loan amounts rise 12% (because property prices increase faster than wages), the debt burden increases — and our model's risk assessments may be off.

prod_income = reference_data['AMT_INCOME_TOTAL'].sample(n_prod, replace=True).values * 1.08
prod_credit = reference_data['AMT_CREDIT'].sample(n_prod, replace=True).values * 1.12
prod_annuity = reference_data['AMT_ANNUITY'].sample(n_prod, replace=True).values * 1.10
prod_goods = reference_data['AMT_GOODS_PRICE'].sample(n_prod, replace=True).values * 1.15

Simulating a younger customer base. If Home Credit launches a marketing campaign targeting younger people, the age distribution shifts. Younger applicants have shorter credit histories and less stable employment — the model might underestimate their risk because it was trained mostly on older applicants.

prod_age = reference_data['AGE_YEARS'].sample(n_prod, replace=True).values - \
 np.random.uniform(0, 5, n_prod)
prod_age = np.clip(prod_age, 20, 70)

Assembling all simulated features into one DataFrame, including recomputing engineered features. The drift analysis needs to compare the exact same features between reference and production.

current_data = pd.DataFrame({
 'EXT_SOURCE_1': prod_ext_1,
 'EXT_SOURCE_2': prod_ext_2,
 'EXT_SOURCE_3': prod_ext_3,
 'AMT_INCOME_TOTAL': prod_income,
 'AMT_CREDIT': prod_credit,
 'AMT_ANNUITY': prod_annuity,
 'AMT_GOODS_PRICE': prod_goods,
 'AGE_YEARS': prod_age,
 'CODE_GENDER': reference_data['CODE_GENDER'].sample(n_prod, replace=True).values,
 'FLAG_OWN_CAR': reference_data['FLAG_OWN_CAR'].sample(n_prod, replace=True).values,
 'FLAG_OWN_REALTY': reference_data['FLAG_OWN_REALTY'].sample(n_prod, replace=True).values,
 'CNT_CHILDREN': reference_data['CNT_CHILDREN'].sample(n_prod, replace=True).values,
 'YEARS_EMPLOYED': reference_data['YEARS_EMPLOYED'].sample(n_prod, replace=True).values,
 'YEARS_ID_PUBLISH': reference_data['YEARS_ID_PUBLISH'].sample(n_prod, replace=True).values,
 'EDUCATION_LEVEL': reference_data['EDUCATION_LEVEL'].sample(n_prod, replace=True).values,
})

current_data['CREDIT_INCOME_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_INCOME_TOTAL'] + 1)
current_data['ANNUITY_INCOME_RATIO'] = current_data['AMT_ANNUITY'] / (current_data['AMT_INCOME_TOTAL'] + 1)
current_data['CREDIT_GOODS_RATIO'] = current_data['AMT_CREDIT'] / (current_data['AMT_GOODS_PRICE'] + 1)

10.3 — Visual comparison

Plotting the distributions of key features side by side. Before running statistical tests, always look at the data. Visualizations reveal patterns that numbers might miss, and they help you sanity-check the statistical results.

import matplotlib.pyplot as plt

key_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'AMT_INCOME_TOTAL',
 'AMT_CREDIT', 'AGE_YEARS', 'CREDIT_INCOME_RATIO']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Reference (blue) vs Production (orange)', fontsize=14)

for idx, feature in enumerate(key_features):
 ax = axes[idx // 3][idx % 3]
 ax.hist(reference_data[feature].dropna(), bins=40, alpha=0.5,
 label='Reference', color='steelblue', density=True)
 ax.hist(current_data[feature].dropna(), bins=40, alpha=0.5,
 label='Production', color='darkorange', density=True)
 ax.set_title(feature)
 ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

10.4 — Run Evidently drift report

Using Evidently AI to run formal statistical tests for drift on every feature. Visual comparison is good for getting intuition, but you need statistical rigor for automated decisions. Evidently uses appropriate statistical tests (Kolmogorov-Smirnov for numeric features, chi-square for categorical) and reports whether each feature has significantly drifted. (Evidently AI docs)

from evidently.report import Report
from evidently.metrics import DatasetDriftMetric, DataDriftTable

drift_report = Report(metrics=[
 DatasetDriftMetric(), # Overall: is there significant drift?
 DataDriftTable(), # Per-feature: which features drifted?
])

drift_report.run(
 reference_data=reference_data,
 current_data=current_data,
)

# Save as interactive HTML
import os
os.makedirs('monitoring', exist_ok=True)
drift_report.save_html("monitoring/drift_report.html")

10.5 — Extract and display results

Extracting the drift results programmatically. The HTML report is great for humans exploring interactively. But for automated monitoring (e.g., a weekly script that sends an alert if drift is detected), you need to extract results as data.

report_dict = drift_report.as_dict()

ds = report_dict['metrics'][0]['result']
print(f"Overall drift detected: {'YES' if ds['dataset_drift'] else 'NO'}")
print(f"Drifted features: {ds['number_of_drifted_columns']} / {ds['number_of_columns']}")

Showing per-feature drift details. Knowing that "drift was detected" isn't enough. You need to know which features drifted and how much, so you can investigate the root cause.

drift_table = report_dict['metrics'][1]['result']

print(f"\n{'Feature':<25} {'Drifted?':<10} {'Score':<12} {'Test'}")
print("-" * 65)

for col, info in drift_table['drift_by_columns'].items():
 status = "YES" if info['drift_detected'] else "no"
 score = info['drift_score']
 test = info['stattest_name']
 flag = " << ALERT" if info['drift_detected'] else ""
 print(f"{col:<25} {status:<10} {score:<12.6f} {test}{flag}")

10.6 — Interpretation and action plan

Translating statistical results into business actions. Detecting drift is the easy part. The hard part — and the part that actually matters — is deciding what to do about it. A drift detection without an action plan is just an interesting observation.

if ds['dataset_drift']:
 print("""
 DRIFT DETECTED — Action Required

 What this means for the Home Credit model:
 The data arriving in production is statistically different from
 the data the model was trained on. Specific shifts identified:
 - EXT_SOURCE scores shifted (credit bureau scoring updates)
 - Financial amounts increased (inflation / economic growth)
 - Applicant age decreased (new younger customer segment)

 Recommended actions (in priority order):
 1. IMMEDIATE: Evaluate model AUC on recent labeled production data.
 If AUC dropped below 0.70, the model is degraded.
 2. SHORT-TERM: Investigate each drifted feature individually.
 Is this a data pipeline bug or a genuine real-world shift?
 3. MEDIUM-TERM: If performance degraded, retrain the model
 using recent data that includes the new distributions.
 4. LONG-TERM: Set up automated weekly drift monitoring
 with alerts when drift exceeds thresholds.
 """)

Creating a statistical comparison table. This table gives you concrete numbers to share with stakeholders. "AMTCREDIT increased 12% on average" is more actionable than "drift was detected on AMTCREDIT."

comp = pd.DataFrame({
 'Feature': reference_data.columns,
 'Train Mean': reference_data.mean().round(2).values,
 'Prod Mean': current_data[reference_data.columns].mean().round(2).values,
})
comp['Shift %'] = ((comp['Prod Mean'] - comp['Train Mean']) / (comp['Train Mean'] + 0.001) * 100).round(1)
print(comp.to_string(index=False))
git add notebooks/ monitoring/
git commit -m "feat: add data drift analysis with Evidently AI"

11. Performance Optimization

The Fiverr client specifically wanted a responsive API. If a customer is waiting for a loan decision and the API takes 10 seconds, that's a terrible user experience. In production, you often have latency budgets (e.g., "predictions must complete in under 100ms").

The optimization workflow is always the same: Measure → Identify bottleneck → Optimize → Measure again. Never optimize without measuring first — you might spend hours optimizing something that isn't actually slow.

11.1 — Profile with cProfile

Using Python's built-in profiler to see exactly which functions take the most time. "The model is slow" is not actionable. "73% of time is spent in predict_proba, specifically in the decision_function call" is actionable.

import cProfile
import pstats
import io
import time
from statistics import mean, stdev

model = joblib.load("model/credit_model.pkl")
feature_columns = joblib.load("model/feature_columns.pkl")
test_row = X_test.iloc[[0]] # A single row for testing

We profile 100 predictions rather than just 1 to get statistically meaningful results:

profiler = cProfile.Profile()
profiler.enable()
for _ in range(100):
 model.predict_proba(test_row)
profiler.disable()

stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats('cumulative')
stats.print_stats(10)
print(stream.getvalue())

11.2 — Establish baseline

Running 1,000 predictions and recording the time for each one. A single measurement can be misleading (maybe the OS was busy for that one call). We need a distribution: the mean tells us typical performance, the standard deviation tells us consistency, and the p95 tells us the worst case for 95% of requests.

n_iterations = 1000
times_sklearn = []

for _ in range(n_iterations):
 start = time.perf_counter()
 model.predict_proba(test_row)
 end = time.perf_counter()
 times_sklearn.append((end - start) * 1000) # Convert to milliseconds

print(f"Baseline (scikit-learn) — {n_iterations} iterations:")
print(f" Mean: {mean(times_sklearn):.3f} ms")
print(f" Std: {stdev(times_sklearn):.3f} ms")
print(f" p95: {np.percentile(times_sklearn, 95):.3f} ms")

11.3 — Optimize with ONNX Runtime

Converting our scikit-learn model to the ONNX format and running it with ONNX Runtime. ONNX (Open Neural Network Exchange) is a standard format for ML models, and ONNX Runtime is a highly optimized inference engine built by Microsoft. It applies optimizations like graph simplification, operator fusion, and hardware-specific acceleration that scikit-learn doesn't do. For gradient boosting models, ONNX Runtime can be significantly faster. (ONNX Runtime docs)

First, convert the model:

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as ort
import onnx

n_features = len(feature_columns)
initial_type = [('float_input', FloatTensorType([None, n_features]))]

onnx_model = convert_sklearn(model, initial_types=initial_type, target_opset=12)
onnx.save_model(onnx_model, "model/credit_model.onnx")
print(f"ONNX model saved ({n_features} features)")

Now benchmark it:

session = ort.InferenceSession("model/credit_model.onnx")
input_name = session.get_inputs()[0].name
test_np = test_row.values.astype(np.float32) # ONNX needs numpy float32, not a DataFrame

times_onnx = []
for _ in range(n_iterations):
 start = time.perf_counter()
 session.run(None, {input_name: test_np})
 end = time.perf_counter()
 times_onnx.append((end - start) * 1000)

print(f"ONNX Runtime — {n_iterations} iterations:")
print(f" Mean: {mean(times_onnx):.3f} ms")
print(f" p95: {np.percentile(times_onnx, 95):.3f} ms")

11.4 — Compare and verify

Comparing the two approaches and — critically — verifying that the optimized model produces the same predictions. Speed is worthless if accuracy changes. An optimization that makes the model 10x faster but changes predictions by even 0.1% could have real financial consequences in production.

speedup = mean(times_sklearn) / mean(times_onnx)
improvement = (1 - mean(times_onnx) / mean(times_sklearn)) * 100

print(f"sklearn: {mean(times_sklearn):.3f} ms")
print(f"ONNX: {mean(times_onnx):.3f} ms")
print(f"Speedup: {speedup:.2f}x")
print(f"Improvement: {improvement:.1f}%")

# CRITICAL: verify predictions match
sklearn_proba = model.predict_proba(test_row)[0]
onnx_result = session.run(None, {input_name: test_np})
print(f"\nsklearn probas: {sklearn_proba}")
print("Predictions match — safe to deploy the optimized version.")
git add optimization/
git commit -m "feat: add performance profiling and ONNX optimization"

12. The Final Architecture

Let's step back and look at the complete system we've built:

 ┌──────────────────────┐
 │ Developer │
 │ (pushes to Git) │
 └──────────┬───────────┘
 │
 ▼
 ┌──────────────────────┐
 │ GitHub + CI/CD │
 │ Test → Build → Push │
 └──────────┬───────────┘
 │
 ▼
 ┌──────────────────────┐
 │ Docker Container │
 │ ┌────────────────┐ │
 │ │ FastAPI API │ │
 │ │ + GBM Model │ │
 │ │ + JSON Logs │ │
 │ └────────────────┘ │
 └──────────┬───────────┘
 │
 ┌──────────┴───────────┐
 ▼ ▼
 ┌─────────────┐ ┌───────────────┐
 │ Predictions │ │ Log Storage │
 │ (score, │ │ (inputs, │
 │ proba, │ │ outputs, │
 │ risk) │ │ latency) │
 └─────────────┘ └───────┬───────┘
 ▼
 ┌───────────────┐
 │ Drift Analysis │
 │ (Evidently AI) │
 │ + Performance │
 │ Profiling │
 └───────────────┘

| Component | Problem It Solves | |----|----| | Git + GitHub | "What changed, when, and why?" | | FastAPI + Pydantic | "How do others use the model safely?" | | Pytest | "Will this code change break something?" | | Docker | "It works on my machine" → "It works everywhere" | | GitHub Actions CI/CD | "Did someone forget to run tests?" | | JSON Logging | "What's happening in production?" | | Evidently AI | "Is the model still relevant?" | | ONNX Runtime | "Can we make it faster?" |


13. Key Takeaways

After spending a weekend on this (and impressing the Fiverr client), here's what stuck:

Data preparation is more than half the work. The Home Credit dataset has 122 columns, anomalous values (365243 in DAYS_EMPLOYED), heavy class imbalance (92%/8%), and lots of missing values. Real data is always messy. Embrace it.

Load models once at startup, never per request. This single design choice can make your API 100x faster. It seems obvious in hindsight, but it's the #1 performance mistake in ML APIs.

Validate everything at the boundary. Production data is unpredictable. Pydantic caught edge cases I never would have thought of — negative incomes, ages of 200, strings where numbers should be.

Test invalid inputs, not just valid ones. Half our tests verify that the API properly rejects bad data. This is what prevents silent failures.

Docker eliminates "works on my machine." It's non-negotiable for production ML. Learn it well.

CI/CD is your automated quality gate. It makes it physically impossible to deploy code that doesn't pass tests.

Monitor for drift, or your model will silently degrade. The Home Credit data captures a specific moment. In production, everything shifts — incomes, demographics, bureau scores. Without drift detection, you won't know until someone notices bad predictions manually.

Profile before optimizing. Don't guess where bottlenecks are — measure them. And always verify that optimization doesn't change predictions.

Start simple, iterate. Get a basic API working. Then add Docker. Then CI/CD. Then monitoring. Each layer builds on the previous one. Trying to do everything at once leads to nothing working.


Resources


If you found this useful, clap or share. I'm always happy to discuss MLOps and the messy reality of putting ML models into production.

The complete code is available on GitHub.

\

CDP vs MDM: Similar Goals, Different Jobs

2026-04-03 09:55:08

\ In conversations about customer data, one question comes up again and again: if both CDPs and MDMs help create a more complete view of the customer, are they basically doing the same thing?

It is an understandable question. After all, both technologies are often positioned around customer unification, identity resolution, and creating better visibility across systems. On the surface, they can sound very similar.

But while CDPs and MDMs do overlap in some areas, they are not the same thing, and they are not really interchangeable either.

The simplest way I like to think about it is this: MDMs help define who your customer is, while CDPs help you decide what to do with that customer.

That difference may sound small, but it has a big impact on how each platform is designed, where it fits in an enterprise architecture, and what kinds of problems it is best suited to solve.

Activation vs governance

At a high level, both CDPs and MDMs are data management platforms, but they are focused on different outcomes.

A CDP is generally centered around activation. Its job is to bring together customer data from different sources so teams can better understand customer behaviour and interactions, and then use that understanding for things like segmentation, personalization, and orchestration.

An MDM, on the other hand, is centered more around governance. Its purpose is to standardize and manage core business data in a controlled way, with stronger attention to quality, consistency, ownership, and compliance.

So while both systems may contribute to a broader customer picture, they are doing so for different reasons. CDPs are more concerned with making data usable in the moment. MDMs are more concerned with defining and governing trusted data over time.

Identity resolution: yes, but not in the same way

This is often where the lines start to blur.

Many CDPs offer identity resolution and talk about building a “complete customer profile.” MDMs also support matching, consolidation, and deduplication. So it is fair to ask: does this not put them in the same space?

Yes and no.

The overlap is real, but the intent behind it is different.

In a CDP, identity resolution is usually there to support engagement use cases. The idea is to connect behavioural, transactional, and interaction data across channels so the business can respond more intelligently. In that sense, the profile only needs to be accurate enough to support action.

In an MDM, identity resolution is more about establishing and governing the official identity of a customer across systems. The stakes are often higher, especially when legal, financial, or regulatory processes are involved.

That is why I think this distinction works so well:

  • MDM says: this is the official truth
  • CDP says: this is good enough to act on

That does not make one better than the other. It just shows that they are solving different kinds of problems.

The type of data matters too

Another useful way to compare the two is by looking at the kind of data they are typically built to handle.

CDPs tend to focus more on behavioural and interaction data — things like clicks, visits, purchases, channel engagement, and journey activity. This is the data that helps organizations understand how customers are interacting with the brand and how they may want to respond.

MDMs are more focused on identity and core attributes — the foundational data that defines who a customer is in a more formal and governed sense. This is why MDM often becomes more important in use cases involving deduplication, compliance, or enterprise-wide standardization.

This difference becomes even more important when identity decisions have legal or financial implications. In those cases, organizations usually need stricter controls, auditable decisions, stewardship, and more deterministic matching. That is where MDM is naturally stronger.

Customer engagement use cases, by contrast, can often tolerate more flexibility and sometimes benefit from probabilistic matching, because the priority is not perfect certainty — it is speed, responsiveness, and usefulness.

Flexible vs rigid structure

CDPs and MDMs also differ in how strictly they enforce structure.

CDPs usually work with a more flexible schema. That makes sense, because customer interactions change all the time, and the platform needs to adapt to new events, channels, and signals without too much friction.

MDMs tend to be more rigid by design. They rely on stronger rules, more formal definitions, and tighter change management because they are responsible for maintaining consistency in core business data.

Again, this does not make one approach better than the other. It simply reflects the purpose of each system. Flexibility helps a CDP move quickly. Rigidity helps an MDM maintain trust.

So, are CDPs and MDMs interchangeable?

In my view, not really.

They may overlap in capability areas such as unification and identity resolution, but the role they play is different. A CDP is not just a lighter version of MDM, and an MDM is not simply a stricter CDP.

If you treat a CDP like an MDM, you risk pushing governance responsibilities into a platform that is really designed for activation. If you treat an MDM like a CDP, you may end up with a highly governed environment that is not built to respond quickly to behavioural signals and customer engagement needs.

That is why, in many organizations, the better answer is not choosing one over the other, but understanding how they can work together.

Why not both?

For many enterprises, especially those operating across multiple markets, brands, or regulated environments, there is a strong case for using both.

In that setup, the MDM helps govern the core customer record and define the trusted business truth, while the CDP uses that truth alongside behavioural and interaction data to support activation.

That tends to be a much healthier split.

MDM provides trust, control, and standardization. CDP provides context, agility, and actionability.

So while they may sound similar in sales conversations, they are really solving different problems. And once that becomes clear, it becomes much easier to see where each fits.

\

Parsing as Response Validation: A New Necessity for Scraping?

2026-04-03 09:50:03

Fetch, parse, and store is a web scraping order traditionally effective for most data pipelines. Up until recently, it was the dominating way to collect data, even at scale. With the rise of AI crawlers, however, more sophisticated anti-scraping strategies have become prevalent across the web.

\ Websites have the right to defend themselves from malicious bots, but legitimate public data collection is affected as well. The traditional web scraping process must be rethought, with parsing becoming a part of response validation that realigns your scraping strategy.

The Shifted Function of Parsing

Parsing is a process of analyzing collected data, interpreting, and organizing it into a more structured, sometimes human-readable format. In short, it's the step that turns raw HTTP responses into something your data pipeline can actually use.

\ When a scraper fetches a page, you receive a wall of HTML tags, attributes, styling, metadata, and other details you might not actually need. Parsing makes sense of such data by selecting the important information, structuring it, and helping you extract what's actually needed for your use case.

\ A parser for a price scraper will locate the price, product's name, and availability while ignoring everything else. A parser for a news scraper would find the headline, summary, and body text while discarding ads, navigation, and other irrelevant details.

\ The traditional parsing approach assumes that the page you fetch is the page a real user sees, and the content is genuine, not meant to disorient your scraping efforts. That assumption no longer holds, or at least not that often.

\ Some websites intentionally insert fake data or responses to trick web scrapers, regardless of their intentions. A parser's job is no longer only to find useful data, but to help decide what can be trusted.

Adversarial Web

For most of the web's history, pages were static, and anti-bot defenses were basic. The arms race between scrapers and site administrators progressed slowly. We got used to the internet being relatively cooperative, but a few recent developments changed the landscape.

\ AI training rush brought a new class of crawlers to the web, ones that can operate at scale and not just collect data. AI crawlers scrape entire sites repeatedly, driving up bandwidth costs and raising server load without any benefit to the publisher.

\ At the same time, the importance of online data grew exponentially. From lead generation and real-time market intelligence to training AI on proprietary datasets, entire business models are built around scraping (e.g., Skyscanner). Countries are talking about online data as a matter of national security and even sovereignty.

\ While the incentives for novel anti-scraping strategies grew, they also became much more accessible. Content Delivery Networks (CDN), most notably Cloudflare, started to offer sophisticated bot management solutions as mainstream tools at affordable prices or even for free.

\ As such, we are now seeing bot detection solutions on almost every website and can expect them to be even more prevalent in the future. The result is a web where automated requests are treated as malicious by default, and the strategies deployed reflect such a posture.

Modern Anti-Scraping Strategies

Novel anti-bot strategies don't just block scrapers, but exploit the logic of data collection. The traditional model of fetching, parsing, and storing assumes the response reflects what the user sees. Some of the prevalent strategies popularized more recently rely on attacking such an assumption.

\

  • Honeypots are hidden elements embedded in a page's structure with the intent to deceive scrapers. They are invisible to human visitors, but exposed to scrapers that try to visit every URL they find. Triggering honeypots risks IP bans or being flagged as a bot.
  • Fingerprinting and behavioral analysis encompass defenses that profile visitors based on their interactions with the site. Header composition, TLS signatures, mouse movement patterns, and many other details are at play here. At any given moment, your requests can be double-checked based on how you interact.
  • Soft blocks involve serving progressively worse content (content degradation) rather than outright blocking access. Responses might become slower, pagination broken, or content incomplete to trick the scraper into wasting its resources, unaware that it's being tricked.
  • Dynamic and deceptive content creates meaningful differences between what a human visitor sees and what raw data a scraper can extract. JavaScript purposefully renders some content only after behavioral signals are evaluated. Other elements might be reordered or obfuscated at the markup level to deceive scrapers.
  • Poisoned data is about returning subtly falsified information rather than blocking scrapers' access. The data pipeline runs without errors but returns incorrect prices, fake contacts, fabricated entries, etc.

\ The non-adversarial internet that web scraping was designed for might no longer exist, but that doesn't mean legitimate data collection is impossible. Every response cannot be treated as an honest answer, and that's the role parsing must take.

Moving Parsing Upstream

Parsing must no longer be used only as a process of interpreting data, but as a decision gate. It moves up to run before the data is extracted or, in some cases, the automation tool takes an action (follows a link, submits a field, logs an interaction).

\

  • Fetch → Interact → Parse → Store
  • Fetch → Parse and Validate → Decision logic→ Interact → Store

\ Since fetched responses aren't trustworthy by default, parsing includes a data validation step. Does the data match the structure you expect? Are all the fields present? Does the shape of the content reflect a genuine page? The validation logic checks for any red flags and returns a solution.

\ Unvalidated data can be even worse than no data at all, but it still gives value to the scraper with validation logic. It's a decision point that gives the scraper an idea of how to proceed. You can retry your request, rotate the IP, change headers, or fall back to an alternative scraper altogether.

\ Parsing data downstream risks filling your pipeline with bad data that wastes resources or takes time to clean at best and blocks you at worst. Such validations before action isn't something new. It's a basic engineering principle already applied to most online infrastructure.

\ What's new is that scrapers could work without upstream parsing, even at large-scale projects, quite successfully in the past. It's increasingly no longer the case. Proxy providers are reacting, and all major providers offer scraping APIs, Web Unblockers, and other tools, together with quality proxies.

When Your Approach Might Vary

There are situations where using parsing as a response validation tool might add more costs than it's worth. The practical value depends on the context - what you're collecting, where you're collecting it from, and what you're likely up against.

\

  • Scale and speed requirements. Validations add overhead that requires more resources. A small and occasional data collection project might do with it, but for large-scale or time-sensitive data pipelines, it must be weighed against the costs of occasionally collecting bad data.
  • Data sources. Not all responses are equally likely to be full of honeypot traps or other anti-scraping measures. HTML pages' DOM-based responses are where upstream parsing is most important. API responses, for example, can be treated as more trustworthy in some cases.
  • Target's structure and predictability. Validation works best when you know what an expected response from the target website is. Highly dynamic or irregular sites make it more difficult to establish a baseline, and the complexity of response validation increases.

\ Scaling your data collection efforts on varying sources requires a multi-layered approach. Data extraction and parsing tools should be combined with proxy management solutions and resilience to retry with a different strategy.

Conclusion

The current web requires a more deliberate posture when scraping. Treating parsing as a response validation action is a step in the right direction. Other crucial parts are also important to a fully functioning data pipeline, but in many cases, the solution starts from repositioning parsing to an earlier stage.