← Back to Projects Instacart Customer Behavior EDA demo
01

Problem

A grocery delivery platform with 206,000 customers and 4.5 million order-product records cannot improve what it does not understand. Instacart's core business question — what do customers buy, when, and how often — sounds simple until the raw data reveals three distinct quality problems that must be resolved before any analysis is trustworthy. Missing values, implicit duplicates, and inconsistent representations of the same logical concept each require a different fix, and applying the wrong fix introduces bias into every downstream conclusion. The real challenge of this project is not generating interesting charts — it is building the discipline to clean data systematically before a single finding is surfaced. Getting that discipline right on a real-world dataset at scale is what separates exploratory work that informs decisions from exploratory work that misleads them.


02

Solution

This project worked through Instacart's publicly released 2017 Kaggle dataset — five interrelated CSV files totaling 4.5 million order-product records — following a full EDA workflow: load, inspect, preprocess, analyze, and visualize. Three preprocessing issues were identified and resolved before analysis began, each addressed with a documented rationale rather than a default fix. The behavioral analysis that followed surfaced concrete, actionable patterns: peak shopping hours cluster around late morning and early afternoon, weekly reorder rhythms concentrate on Sunday and Monday, and the most reordered items (bananas, strawberries, baby spinach) reveal a consistent preference for fresh produce over packaged goods. Understanding the customer's shopping cadence and product preferences gives Instacart's product and merchandising teams a foundation for personalization, inventory planning, and promotional timing. Treating the dataset as a business problem rather than a technical exercise shaped every decision — from which columns to inspect first to which visualizations would be most useful to a non-technical stakeholder.


03

Skills Acquired

What that discipline actually looks like in practice — and what the data revealed once it was trustworthy — is what the rest of this writeup covers.


04

Deep Dive

Bananas. The single most reordered grocery item in the Instacart dataset — 557,763 reorders across 206,209 customers. Not protein powder, not premium coffee, not organic kale. Bananas. And the data makes the pattern impossible to ignore once you start looking.

This project was my first real encounter with messy, real-world data at scale. Five CSV files, 4.5 million order-product records, and a mandate to clean everything before drawing a single conclusion. The kind of work that separates someone who can run a notebook from someone who can actually trust their results.

Sprint 2 Purpose: The purpose of Sprint 2 is to conduct an Exploratory Data Analysis (EDA) by developing the skills of cleaning and analyzing data with an actual dataset from Instacart, a grocery delivery platform.
This was Sprint 2 of my TripleTen AI and Machine Learning Bootcamp — my first project using the full EDA workflow: load, inspect, preprocess, analyze, visualize. The dataset was publicly released by Instacart in 2017 for a Kaggle competition. The goal: understand customer shopping behavior well enough to surface actionable insights for the business.

Why This Project?

Sprint 2 introduced two skills that underpin everything else in data science: cleaning data you didn't collect and asking the right question before touching the analysis. The Instacart dataset is large enough that lazy preprocessing produces subtly wrong results — and small enough that careful work is actually feasible.

The business framing mattered too. Instacart's goal isn't to run notebooks — it's to increase revenue. Every question I asked (when do people shop, what do they reorder, how long between orders) maps to something a product or operations team would genuinely act on: push notification timing, personalized reorder reminders, inventory optimization.

What I Learned

This was my first time managing project workflow with Jira — using board and list views to track each step. I also used groupby(), duplicated(), drop_duplicates(), unique(), and nunique() in anger for the first time. The discipline of loading libraries and datasets only once, structuring notebooks into logical sections, and commenting every analytical decision came from code review feedback in this sprint.

Working with Jira in Sprint 2 — Board and List views

Jira Board & List views used to track each sprint step from data loading through final feedback.


The Datasets

Five CSV files from Instacart's 2017 Kaggle release. Together they describe what customers ordered, when, and in what sequence — at a scale that forces systematic preprocessing.

DatasetRowsColumnsKey Fields
instacart_orders 478,967 6 order_id, user_id, order_dow, order_hour_of_day, days_since_prior_order
products 49,694 4 product_id, product_name, aisle_id, department_id
order_products 4,545,007 4 order_id, product_id, add_to_cart_order, reordered
aisles 134 2 aisle_id, aisle
departments 21 2 department_id, department

Project Initialization

The notebook began by loading all required libraries and importing the five datasets in a single initialization cell — a discipline reinforced during Sprint 2 to keep dependencies explicit and reproducible.

# Upload Libraries
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

# Upload Datasets
df_orders        = pd.read_csv('/datasets/instacart_orders.csv',  sep=';')
df_products      = pd.read_csv('/datasets/products.csv',          sep=';')
df_aisles        = pd.read_csv('/datasets/aisles.csv',            sep=';')
df_departments   = pd.read_csv('/datasets/departments.csv',       sep=';')
df_order_products = pd.read_csv('/datasets/order_products.csv',   sep=';')

# Define different paths to set up the loop in the next code block
file_paths = {
    "Orders":         "/datasets/instacart_orders.csv",
    "Products":       "/datasets/products.csv",
    "Aisles":         "/datasets/aisles.csv",
    "Departments":    "/datasets/departments.csv",
    "Order_Products": "/datasets/order_products.csv"
}
dfs = {name: pd.read_csv(path, sep=';') for name, path in file_paths.items()}
for name, df in dfs.items():
    print(f"Dataset: {name}")
    df.info(show_counts=True)
    print("\n" * 2)

Phase 1 — Data Preprocessing

Before any analysis, every table needed to pass a two-part check: no duplicate rows, no unexpected missing values. The results were more interesting than a clean dataset would suggest.

Orders Table

15 Duplicate Rows Removed

The orders table contained 15 fully duplicate rows — same order ID, same user, same timestamp. After drop_duplicates(), a double-check with duplicated().any() confirmed the table was clean. The days_since_prior_order column had 28,819 nulls, all correctly corresponding to first-time orders where no prior order exists.

Products Table

1,258 Missing Product Names → Filled with "Unknown"

All 1,258 missing product names mapped exclusively to aisle_id = 100 and department_id = 21 — both labeled "missing" in their respective lookup tables. This wasn't data entry noise; it was a structural pattern in how Instacart handled unclassified items. The right call was filling with 'Unknown' rather than dropping rows, preserving the order-product relationships downstream.

Duplicate product names also appeared — 1,465 rows with names that matched another row after lowercasing. These were kept because they had distinct product_id values, indicating genuinely different SKUs with similar names (e.g., store-brand vs. name-brand variants).

Order Products Table

836 Missing Cart Position Values — Left as NaN

The add_to_cart_order column had 836 missing values. Investigating the affected orders revealed they all contained 65 or more items — the minimum affected order had 65 items, the maximum had 127. The most plausible explanation: Instacart's system only tracked cart-add position up to a certain threshold. Filling with a sentinel value like 999 would pollute any cart-position analysis; leaving the nulls preserves analytical integrity.

# Verify all missing cart positions come from large orders
orders_with_missing = df_order_products.loc[
    df_order_products['add_to_cart_order'].isna(), 'order_id'
].unique()

order_counts = df_order_products['order_id'].value_counts()
missing_order_counts = order_counts.loc[orders_with_missing]

# All affected orders have > 64 items — confirmed
print((missing_order_counts > 64).all())  # True
print(missing_order_counts.min())          # 65

Phase 2 — Behavioral Analysis

With clean data, four core questions drove the analysis — each with a direct business application for Instacart's product and ops teams.

Question A

When Do People Shop?

Orders concentrated heavily between 9 AM and 4 PM, peaking at 10 AM with 40,578 orders. The pre-dawn hours (2–4 AM) saw fewer than 1,000 orders each. Wednesday and Saturday showed nearly identical hourly distributions — both peak between the 9th and 17th hours, with the busiest single hour exceeding 5,000 orders.

HourOrdersHourOrders
6 AM4,21512 PM38,034
7 AM13,0431 PM39,007
8 AM25,0242 PM39,631
9 AM35,8963 PM39,789
10 AM40,5784 PM38,112
11 AM40,0325 PM31,930
Bar chart: Grocery Orders by Hour of Day

Orders by hour of day — peak volume concentrates between 10 AM and 4 PM.

Overlapping histogram: Orders by Hour on Wednesday vs Saturday

Wednesday vs Saturday hourly distributions — nearly identical shapes, both peaking mid-morning.

Question B

What Day Do People Shop Most?

Sunday and Monday drove the highest order volumes — 84,090 and 82,185 orders respectively. Mid-week (Wednesday–Thursday) was consistently the slowest, with Thursday seeing only 59,810 orders. This bimodal weekly pattern — heavy at the start, lighter mid-week, recovering toward the weekend — suggests Instacart users plan grocery orders around their weekly routines rather than shopping spontaneously.

DayOrders
Sunday84,090
Monday82,185
Tuesday65,833
Wednesday60,897
Thursday59,810
Friday63,488
Saturday62,649
Bar chart: Grocery Shopping by Day of the Week

Sunday and Monday lead order volume; mid-week (Wednesday–Thursday) is slowest.

Question C

How Long Between Orders?

Two spikes dominated the reorder gap distribution: 7 days (44,577 orders) and 30 days (51,337 orders — the single largest bucket). Weekly shoppers and monthly shoppers are clearly distinct behavioral segments. The days between these spikes showed steady decline, suggesting most users commit to a cadence rather than ordering ad hoc.

The practical implication: push notification timing and email reminders should be calibrated to these two rhythms. A generic "you haven't ordered in a while" message at day 14 misses both peaks.

Bar chart: Number of Orders by Days Since Prior Order

Reorder gap distribution — two clear spikes at 7 days (weekly shoppers) and 30 days (monthly shoppers).

Question D

Top Reordered Products

Merging the order_products and products tables, then aggregating by product_id on reordered = 1, surfaced the top 20 most-reordered items. The list is almost entirely fresh produce — a signal that Instacart's stickiest customers are buying perishables on a schedule, not stocking up on pantry goods.

RankProductProduct IDReorders
1Banana24852557,763
2Bag of Organic Bananas13176444,450
3Organic Strawberries21137286,239
4Organic Baby Spinach21903262,233
5Organic Hass Avocado47209236,229
6Organic Avocado47766187,743
7Organic Whole Milk27845162,251
8Large Lemon47626150,244
9Organic Raspberries27966147,248
10Strawberries16797139,245

Key Takeaways


Conclusion & Reflections

This project confirmed something that's easy to underestimate early in a data career: preprocessing isn't setup — it's analysis. The decisions I made about how to handle missing product names, duplicate orders, and structural nulls in the cart position column directly shaped every conclusion downstream. Getting those decisions wrong wouldn't have produced obvious errors; it would have produced subtly biased results that looked fine.

The behavioral patterns themselves were clear and actionable. But the harder skill — knowing why data is missing before deciding what to do about it — is what made the analysis trustworthy.

Growth into Sprint 3+

Sprint 2 built the preprocessing habits that every subsequent project depended on. By Sprint 6 (Video Game Sales), I was handling missing scores and platform lifecycles with the same systematic approach. By Sprint 9 (Churn Prediction), the same discipline applied to feature engineering and class imbalance. The fundamentals compound.

Project RequirementStatus
Libraries and datasets loaded once at notebook startDONE ✓
Duplicate rows found and removed in all 5 tablesDONE — 15 duplicate orders removed ✓
Missing values identified and handled with justificationDONE — 1,258 product names, 836 cart positions ✓
[A] Easy: hour/day/reorder-gap analyses with visualizationsDONE — bar charts for all three ✓
[B] Medium: Wed vs Sat histogram, orders/customer, top 20 productsDONE ✓
[C] Hard: order size distribution, top 20 reordered itemsDONE ✓
Project conclusion with business framingDONE ✓

Want to See the Full Notebook?

The complete code — preprocessing, all visualizations, and behavioral analysis — is on GitHub.