← Back to Projects Titanic Survival Prediction API Docker demo
01

Problem

A machine learning model that runs only on the developer's machine is not deployable. It depends on a specific Python version, a specific set of installed libraries, and a specific directory structure — none of which another person, server, or cloud environment is guaranteed to have. The traditional answer is documentation and hope: "install these packages, make sure your numpy matches, good luck." Docker eliminates that entirely. The challenge in this project was learning how to package not just a trained model but an entire ML inference stack — training environment, model artifacts, and FastAPI service — into reproducible containers that anyone can pull and run with a single command, anywhere.


02

Solution

This project trained a Random Forest classifier on the Titanic dataset inside a Jupyter Docker container, resolved a numpy/pandas binary incompatibility that crashed the serving layer, and packaged the FastAPI prediction API into a Docker image published to Docker Hub. The result is a fully containerized, publicly pullable ML service: docker pull thetechlearner/titanic-survival-api:v1.0 runs a production-ready API that accepts passenger features and returns survival probabilities with structured confidence levels — no local Python environment required.


03

Skills Acquired


04

Deep Dive

There is a well-known gap between "it works on my machine" and "it works in production." Docker exists to close that gap. Before this project, I understood containerization conceptually — after it, I had shipped a containerized ML service to a public registry that anyone in the world can pull and run.

This project built a two-container ML system: one container for training (Jupyter + scikit-learn on the Titanic dataset), and a second, leaner container for serving predictions via FastAPI. The serving container was pushed to Docker Hub as thetechlearner/titanic-survival-api:v1.0 — a publicly verifiable, reviewer-approved deployment artifact.

The gap between "I trained a model" and "anyone can run my model" is containerization. This sprint was where I crossed that line — from local artifact to publicly deployable service.

Why This Project?

This was the Docker and Containerization sprint of my TripleTen AI and Machine Learning Bootcamp. The assignment: train a model from scratch inside a Docker container, build a serving API, containerize it, and publish the image to Docker Hub for others to pull and run. The Titanic survival dataset was the chosen domain — a classic binary classification problem that put the focus on the containerization workflow rather than model complexity.

I treated every step as a real MLOps exercise — not just making things run but making them reproducible. The numpy/pandas incompatibility that crashed the first serving attempt was a real production-class dependency management problem, not a notebook error.


What You'll Learn from This


The Service

The API predicts Titanic passenger survival from 7 features: passenger class, sex, age, siblings/spouses aboard, parents/children aboard, fare, and embarkation port. The trained Random Forest model applies feature engineering at inference time — computing age group, fare level, family size, and travel-alone status — before making a prediction.

EndpointMethodPurpose
/GETHealth check — confirms API is running
/predictPOSTSingle passenger prediction with survival probability
/predict-batchPOSTMultiple passenger predictions in one request
/model-infoGETModel algorithm, accuracy, and feature metadata
/docsGETAuto-generated Swagger UI (FastAPI built-in)

The prediction response includes not just survived and survival_probability but also a prediction_confidence label (High / Medium / Low) and a full passenger_profile object — showing the feature-engineered representation the model actually used.

# Live test result — Class 1 adult female
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"Pclass": 1, "Sex": "female", "Age": 30.0,
      "SibSp": 0, "Parch": 0, "Fare": 50.0, "Embarked": "S"}'

# Response
{
  "survived": 1,
  "survival_probability": 0.9152319739084446,
  "prediction_confidence": "High",
  "passenger_profile": {
    "class": "Class 1",
    "gender": "female",
    "age_group": "Adult",
    "family_size": 1,
    "traveling_alone": true,
    "fare_level": "High",
    "embarkation_port": "Southampton"
  }
}

How I Built It

Phase 1

Training Environment — Jupyter in Docker

The model was trained inside a jupyter/base-notebook Docker container mounted to the local project directory. Running docker exec titanic-trainer python train_model.py executed the full training pipeline — data loading, feature engineering (5 new features: FamilySize, IsAlone, Title, AgeGroup, FareGroup), label encoding, Random Forest fitting, and artifact serialization — producing titanic_model.joblib, label_encoders.joblib, and model_metadata.json in a models/ directory.

The first training attempt failed with a FileNotFoundError because the models/ directory did not exist inside the container. Fix: docker exec titanic-trainer mkdir -p models. A minor but realistic production detail — containers start from a clean image with no persistent directories.

Phase 2

Dependency Debugging — numpy/pandas Incompatibility

The first serving container crashed immediately with:

ValueError: numpy.dtype size changed, may indicate binary incompatibility.
Expected 96 from C header, got 88 from PyObject

The root cause: numpy was not pinned in requirements.txt, so pip resolved it to a version compiled against a different ABI than pandas expected. The fix was adding numpy==1.24.3 — the version compatible with pandas==2.0.3. Without an explicit pin, pip can resolve any numpy version, and the compiled C extensions between packages can silently break.

# requirements.txt — before (missing numpy pin)
fastapi==0.104.1
uvicorn==0.24.0
pandas==2.0.3
scikit-learn==1.3.0
joblib==1.3.1
python-multipart==0.0.6

# requirements.txt — after (numpy pinned to compatible version)
fastapi==0.104.1
uvicorn==0.24.0
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
joblib==1.3.1
python-multipart==0.0.6

Phase 3

Publishing to Docker Hub

The serving image was tagged and pushed to Docker Hub under the correct account username. An initial push to vietnguyen/titanic-survival-api failed because the Docker Hub account is thetechlearner — the image tag must match the registry account. After retagging and creating the repository on Docker Hub first, both v1.0 and latest pushed successfully.

# Retag with correct Docker Hub username
docker tag titanic-survival-api thetechlearner/titanic-survival-api:v1.0
docker tag titanic-survival-api thetechlearner/titanic-survival-api:latest

# Push both tags
docker push thetechlearner/titanic-survival-api:v1.0
docker push thetechlearner/titanic-survival-api:latest

# Anyone can now pull and run the API
docker pull thetechlearner/titanic-survival-api:v1.0
docker run -p 8000:8000 thetechlearner/titanic-survival-api:v1.0

Model Performance

The Random Forest (100 estimators, max_depth=10) was trained on 712 samples and evaluated on 179 held-out samples using a stratified 80/20 split.

MetricValueNotes
AlgorithmRandomForestClassifier100 estimators, max_depth=10
Training set712 passengersstratified split, random_state=42
Test set179 passengers20% held out
Test Accuracy81.56%accuracy_score on test set
Precision (survived)0.79class 1, 69 support
Recall (survived)0.71caught 71% of actual survivors
F1 (survived)0.75harmonic mean of precision/recall
Live Test (Class 1, female, 30)91.5% survival probabilityHigh confidence

Top 10 feature importances from the trained model:

RankFeatureImportanceType
1Sex28.1%raw input
2Fare17.1%raw input
3Title12.7%engineered from Name
4Age12.3%raw input (median-imputed)
5Pclass9.2%raw input
6FamilySize5.1%SibSp + Parch + 1
7FareGroup4.3%quartile bucketing of Fare
8SibSp3.0%raw input
9Embarked2.7%raw input
10AgeGroup2.3%age bucket (Child/Teen/Adult/Middle/Senior)

Reviewer Approval

The project received full approval from TripleTen reviewer Victor Camargo. Every checkpoint was verified against the official evaluation criteria.

CheckpointStatus
Image reference is usable and well-formedAPPROVED ✓
Container starts successfullyAPPROVED ✓
Runtime logs look healthyAPPROVED ✓
GET / endpointAPPROVED ✓
POST /predict endpointAPPROVED ✓
POST /predict-batch endpointAPPROVED ✓
GET /model-info endpointAPPROVED ✓
Image can be pulled and run locallyAPPROVED ✓
API endpoints respond per expected contractAPPROVED ✓
Documentation and usability sufficientAPPROVED ✓

Key Takeaways


What I Learned & Why It Matters to Employers

Before this sprint, I understood Docker conceptually — images, containers, layers. After it, I had debugged a real binary dependency incompatibility, navigated Docker Hub authentication, and pushed a publicly pullable ML service to a container registry. The distinction matters in interviews: I can talk about containerization from the inside, not just from the documentation. Any ML Engineer role that deploys models to production uses Docker or Kubernetes — and the ability to go from training notebook to versioned, publicly runnable image is exactly the MLOps skill that bridges data science and engineering.

Conclusion & Reflections

The most durable lesson from this sprint: containerization is not a deployment detail — it is a reproducibility guarantee. Every time someone runs docker pull thetechlearner/titanic-survival-api:v1.0, they get the exact same Python version, the exact same library versions, and the exact same model weights that the reviewer validated. No setup instructions. No "works on my machine."

That guarantee is what makes containers the standard unit of deployment for ML systems in production. It is also what makes a containerized portfolio project more credible than a GitHub notebook: the artifact is verifiable, pullable, and runnable by anyone with Docker installed — including a technical interviewer.

Project RequirementStatus
Model trained inside a Docker containerYES — docker exec titanic-trainer python train_model.py
Serving API containerized separatelyYES — titanic-survival-api image ✓
Dependency incompatibility resolvedYES — numpy pinned to 1.24.3 ✓
Image published to Docker HubYES — thetechlearner/titanic-survival-api:v1.0
All 4 API endpoints implementedYES — /, /predict, /predict-batch, /model-info ✓
Full reviewer approvalYES — 10/10 checkpoints approved ✓

Want to Pull and Run the API?

The Docker image is publicly available on Docker Hub — pull it and test it yourself.