Problem
Mobile carriers routinely carry customers on legacy plans that no longer match how those customers actually use their phones. For Megaline, this meant a substantial portion of their user base was on outdated pricing that was either overcharging customers who used less than they paid for or leaving revenue on the table for heavy users who needed more. The business had two modern plans — Smart and Ultra — but no systematic way to identify which plan each legacy customer should be moved to. Manual review was impractical at scale, and rule-based approaches require explicit thresholds that ignore the interactions between usage features. What Megaline needed was a classification model that could learn the boundary between Smart and Ultra from historical usage data and apply that boundary to new customers automatically.
Solution
This project trained a Random Forest classifier on Megaline's behavioral data — monthly calls, minutes used, messages sent, and internet data consumed — to recommend the appropriate modern plan for each legacy customer. The workflow followed a full end-to-end ML pipeline: feature inspection, train/validation/test split, model selection across Decision Tree and Random Forest algorithms, hyperparameter tuning, and final evaluation against a held-out test set. The target was a minimum test accuracy of 75%; the final model achieved 81.8%, clearing the threshold by 6.8 percentage points. Sanity checks against a dummy classifier baseline confirmed that the model was learning real signal from the behavioral features rather than exploiting class imbalance. This was the first complete ML project in the training program — the point where individual concepts like decision boundaries and ensemble methods connected into a working system.
Skills Acquired
- Python — the implementation language for the full pipeline: data loading, feature analysis, model training, hyperparameter tuning, and evaluation.
- scikit-learn — the machine learning framework used to train both the Decision Tree and Random Forest classifiers, run accuracy evaluations, and generate the dummy classifier baseline. scikit-learn's consistent estimator API made it straightforward to swap algorithms and compare results without rewriting the evaluation code.
- Pandas — used for data loading, inspection, and feature analysis. Pandas DataFrames were the primary data structure throughout preprocessing, enabling column-wise analysis of the four behavioral features before they were passed to the model.
- Decision Tree — the baseline model in the comparison. A single Decision Tree is interpretable and fast, but prone to overfitting — using it as the starting point established a performance floor that Random Forest needed to meaningfully exceed.
- Random Forest — the final model that achieved 81.8% test accuracy. Random Forest averages the predictions of many decorrelated decision trees, reducing variance and improving generalization over any individual tree — which is why it consistently outperforms a single Decision Tree on tabular classification tasks.
What makes the result meaningful is not just the final accuracy number — it is the reasoning behind each decision along the way.
Deep Dive
Megaline, a mobile carrier, has a problem. Many of their customers are still on legacy plans — plans that no longer match how they actually use their phones. The business wants to move these customers to one of two modern options: Smart or Ultra. But how do you know which plan fits which customer?
You look at the data. Every month, Megaline tracks each customer's calls, minutes, messages, and data usage. That behavioral footprint tells a story about what plan they actually need — and a classification model can learn to read it.
Why This Project?
This was Sprint 8 of my TripleTen AI and Machine Learning Bootcamp — my first full machine learning project. Up to this point, I had learned the theory: what a Decision Tree does, how Random Forest improves on it, what hyperparameters control overfitting. This was where I applied all of it to a real dataset for the first time.
I treated it like a genuine business problem. Megaline's goal isn't to maximize some abstract score — it's to recommend the right plan so customers stay satisfied and don't churn. That framing shaped every decision I made, including which metrics to focus on and which ones to deprioritize.
What I Learned
This was my first time making deliberate, justified decisions about which metric to optimize — and why accuracy + precision made more sense than F1 for this specific business case. That kind of reasoning — metric selection tied to real business impact — is something I now apply to every project.
What You'll Learn from This
- Why tree-based models don't need feature scaling — and the specific reason distance-based algorithms do
- How to choose between accuracy, precision, recall, and F1 based on the actual business question
- What happens when you increase
n_estimatorsfrom 100 to 10,000 — and when that matters - How to structure a clean train/validation/test workflow so test results are genuinely unbiased
- Why Random Forest consistently outperforms Decision Tree — and what the tradeoffs are
Key Takeaways
- Random Forest (10,000 trees) achieved 81.8% accuracy on the final test set — exceeding the ≥ 75% target by +6.8%
- Feature scaling was deliberately skipped — tree algorithms split on thresholds, not distances; scaling would add complexity with zero benefit
- Accuracy and precision were the right metrics here — a missed Ultra recommendation is recoverable; a wrong recommendation damages trust
- Increasing trees from 100 → 10,000 improved Random Forest precision by +2.3% — diminishing returns, but meaningful for this use case
- Random Forest outperformed Decision Tree on every metric across every experiment
The Dataset
Megaline's usage history: 3,214 customers, 5 columns. All four features
captured behavioral signals — how often customers called, how long they talked, how many
texts they sent, how much data they consumed. The target column — is_ultra
— indicated which plan the customer was actually on that month.
| Column | Description | Range (approx.) |
|---|---|---|
| calls | Number of calls per month | 0 – 244 |
| minutes | Total call duration (min) | 0 – 1,632 |
| messages | Number of texts | 0 – 224 |
| mb_used | Data used (MB) | 0 – 49,745 |
| is_ultra | Target: Ultra=1, Smart=0 | — |
No missing values. No duplicate rows. No encoding needed — all features were already numeric. This was as clean as a dataset gets, which meant the focus was entirely on modeling decisions.
My Process
Phase 1
Import Libraries & Load Data
Loaded the dataset and imported the tools needed: pandas for data handling, and scikit-learn's classifiers, splitters, and metrics for the full ML workflow.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix ) df = pd.read_csv('/datasets/users_behavior.csv')
Phase 2
Exploratory Data Analysis
Explored the dataset with df.head(), df.info(), and
df.describe(). Three things stood out immediately:
- 3,214 records — no missing values in any column
- All features are numeric — no encoding required
mb_usedranges 0–49,745 whilecallsranges 0–244 — a scale difference worth investigating
print("Shape:", df.shape) # (3214, 5) df.info() # all non-null, float64 + int64 df.describe() # reveals mb_used >> other features print(df['is_ultra'].value_counts(normalize=True)) # 0 (Smart): ~69.4% # 1 (Ultra): ~30.6%
Phase 3
Feature Scaling Consideration
The scale difference between mb_used (~0–49,745) and calls
(~0–244) raised a question: does this require feature scaling?
The answer depends on the algorithm. Distance-based methods — logistic regression, KNN, SVMs, neural networks — are affected by feature scales because they compute distances between data points. Tree-based methods are not. Decision Trees and Random Forests make binary splits on individual feature thresholds, so a feature's scale has no impact on its ability to split.
# Feature value ranges — scale difference is significant print(df[['calls', 'minutes', 'messages', 'mb_used']].describe().loc[['min', 'max']]) # Conclusion: tree algorithms split on thresholds, not distances. # Feature scaling is NOT required for Decision Tree or Random Forest. # Skipping StandardScaler — would add complexity with zero benefit here.
What I Learned
Before this project, I knew feature scaling as a "step you do before modeling." After this project, I understood why — and more importantly, when not to. Making a justified decision to skip a common preprocessing step is more valuable than following a checklist blindly.
Phase 4
Model Training & Validation
A stratified 60/20/20 split kept the ~30.6% Ultra class ratio consistent across train, validation, and test sets. The model had not seen the test set at any point during this phase.
X = df.drop('is_ultra', axis=1) y = df['is_ultra'] # Step 1: Separate test set (20%) X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Step 2: Split remaining into train (60%) and validation (20%) X_train, X_valid, y_train, y_valid = train_test_split( X_temp, y_temp, test_size=0.25, random_state=42 ) # Result: 1,928 train / 643 valid / 643 test
I ran two validation attempts. The first established baseline performance; the second tuned hyperparameters to push further. The goal here wasn't to pass the target — it was to understand how much each change actually moved the needle:
Validation Attempt 1 — Baseline:
model_dt_v1 = DecisionTreeClassifier( max_depth=6, min_samples_split=20, min_samples_leaf=10, max_features='sqrt', random_state=42 ) model_rf_v1 = RandomForestClassifier( n_estimators=100, max_depth=10, min_samples_split=5, max_features='sqrt', random_state=42 ) # Decision Tree: Accuracy 76.5%, Precision 67.8% # Random Forest: Accuracy 79.5%, Precision 71.4%
Validation Attempt 2 — Tuned: Increased max_depth
for Decision Tree; boosted Random Forest to 10,000 trees with more conservative splits:
model_dt_v2 = DecisionTreeClassifier( max_depth=10, min_samples_split=10, min_samples_leaf=5, max_features='sqrt', random_state=42 ) model_rf_v2 = RandomForestClassifier( n_estimators=10000, # 100x more trees max_depth=10, min_samples_split=5, min_samples_leaf=2, max_features='sqrt', random_state=42 ) # Decision Tree: Accuracy 76.5% (no change), Precision 64.5% (-3.3%) # Random Forest: Accuracy 79.9% (+0.4%), Precision 73.7% (+2.3%)
| Model | Accuracy | Precision | vs. Target (≥ 75%) |
|---|---|---|---|
| Decision Tree — Baseline | 76.5% | 67.8% | PASSES ✓ |
| Random Forest — 100 trees | 79.5% | 71.4% | PASSES ✓ |
| Decision Tree — Tuned | 76.5% | 64.5% | PASSES ✓ |
| Random Forest — 10,000 trees | 79.9% | 73.7% | PASSES ✓ |
Phase 4 — Key Decision
Why Accuracy and Precision — Not F1
To understand the rationale, consider the question each metric is actually answering:
- Accuracy: "How often does the model recommend the right plan across all my customers?"
- Precision: "As a customer, when the model recommends Ultra — how much can I trust that?"
- Recall: "Of all customers who should be on Ultra, how many did we actually catch?"
- F1: "What's the balanced score between Precision and Recall?"
For a plan recommendation system, missing an Ultra recommendation is not catastrophic — a customer can upgrade later when they need more data. The business does not lose that customer over a missed recommendation. But recommending the wrong plan incorrectly damages trust. That's why Accuracy and Precision are the right metrics here. Recall and F1 become less relevant when missing a prediction is recoverable.
Final Test Results
The model had not seen the test set at any point — no training, no validation, no tuning. This is what makes the result meaningful: an unbiased estimate of real-world performance. Selected model: Random Forest with 10,000 trees.
y_test_pred = model_rf_v2.predict(X_test) final_acc = accuracy_score(y_test, y_test_pred) # 0.818 final_prec = precision_score(y_test, y_test_pred) # 0.775 final_rec = recall_score(y_test, y_test_pred) # 0.532 final_f1 = f1_score(y_test, y_test_pred) # 0.631 # Confusion Matrix: # [[426 29] # [ 88 100]] # True Negatives (Smart → Smart): 426 # False Positives (Smart → Ultra): 29 # False Negatives (Ultra → Smart): 88 # True Positives (Ultra → Ultra): 100
| Metric | Score | Notes |
|---|---|---|
| Accuracy (Test) | 81.8% | Target ≥ 75% — PASSED ✓ (+6.8%) |
| Precision (Test) | 77.5% | When model says Ultra, it's right 77.5% of the time |
| Recall (Test) | 53.2% | Expected — deprioritized for this business case |
| Accuracy (Validation) | 79.9% | Generalization gap: +1.9% — model improved on test |
The model actually performed better on the test set than on validation — 81.8% vs. 79.9%. This is rare, and it suggests the train/validation split may have landed slightly tougher examples in the validation set. Either way, the test result is what matters for deployment decisions.
Main Takeaways
- Random Forest consistently dominated. It outperformed Decision Tree on every metric across every experiment — baseline, tuned, and test.
- More trees helped — up to a point. Going from 100 → 10,000 trees improved precision by +2.3% and accuracy by +0.4%. Real gains, though diminishing. There's a compute cost to weigh against marginal improvement.
- Feature scaling is algorithm-specific. Skipping StandardScaler was the correct decision for tree-based methods — and it was a decision I made deliberately, not by accident.
- Metric selection is a business decision, not a default. Choosing accuracy + precision over F1 required understanding what "wrong" and "missed" actually cost in this context.
- Clean data doesn't mean easy modeling. With no missing values and no encoding needed, every modeling decision stood on its own — no preprocessing noise to hide behind.
Conclusion & Reflections
This was Sprint 8 — my first ML project. Looking back, what I'm most proud of isn't the 81.8% accuracy. It's the decision-making process that got there: the deliberate choice to skip feature scaling, the justification for accuracy over F1, the structured comparison across two validation attempts before touching the test set.
In a real deployment, a model like this could run inside Megaline's CRM to flag legacy-plan customers with a recommended upgrade. With 77.5% precision, nearly 4 out of 5 customers flagged for Ultra actually belong there — a reliable enough signal to drive outreach campaigns.
Growth from Sprint 8 → Sprint 9
Sprint 8 taught me the fundamentals of a clean ML workflow. By Sprint 9, I was handling class imbalance, running GridSearchCV over 108 parameter combinations, and validating data missingness with a formal MAR test before imputing. The habits built here — methodical splitting, careful metric selection, explicit before/after comparisons — carried forward into every project since.
| Project Requirement | Status |
|---|---|
| Accuracy ≥ 75% on test set | ACHIEVED — 81.8% (+6.8% margin) ✓ |
| Train / validation / test split used | YES — stratified 60/20/20 ✓ |
| Multiple models evaluated | YES — Decision Tree + Random Forest, 2 rounds ✓ |
| Feature scaling decision documented | YES — justified skip for tree-based methods ✓ |
| Metric selection justified | YES — Accuracy + Precision over F1 ✓ |
Want to Explore the Full Code?
The complete notebook — all phases, both validation attempts, final test results — is on GitHub.