Instacart Customer Behavior EDA

01

Problem

A grocery delivery platform with 206,000 customers and 4.5 million order-product records cannot improve what it does not understand. Instacart's core business question — what do customers buy, when, and how often — sounds simple until the raw data reveals three distinct quality problems that must be resolved before any analysis is trustworthy. Missing values, implicit duplicates, and inconsistent representations of the same logical concept each require a different fix, and applying the wrong fix introduces bias into every downstream conclusion. The real challenge of this project is not generating interesting charts — it is building the discipline to clean data systematically before a single finding is surfaced. Getting that discipline right on a real-world dataset at scale is what separates exploratory work that informs decisions from exploratory work that misleads them.

02

Solution

This project worked through Instacart's publicly released 2017 Kaggle dataset — five interrelated CSV files totaling 4.5 million order-product records — following a full EDA workflow: load, inspect, preprocess, analyze, and visualize. Three preprocessing issues were identified and resolved before analysis began, each addressed with a documented rationale rather than a default fix. The behavioral analysis that followed surfaced concrete, actionable patterns: peak shopping hours cluster around late morning and early afternoon, weekly reorder rhythms concentrate on Sunday and Monday, and the most reordered items (bananas, strawberries, baby spinach) reveal a consistent preference for fresh produce over packaged goods. Understanding the customer's shopping cadence and product preferences gives Instacart's product and merchandising teams a foundation for personalization, inventory planning, and promotional timing. Treating the dataset as a business problem rather than a technical exercise shaped every decision — from which columns to inspect first to which visualizations would be most useful to a non-technical stakeholder.

03

Skills Acquired

Python — the implementation language for all five phases of the project: data loading, inspection, preprocessing, behavioral analysis, and visualization.
Pandas — the primary data manipulation library. Five separate CSV files were loaded, inspected for quality issues, merged on shared keys, and analyzed as a unified dataset. Pandas was used for missing value detection, deduplication, column-level filtering, groupby aggregations, and value counts that produced every behavioral finding in the project.
Matplotlib — used to produce bar charts and histograms visualizing order frequency by hour of day, day of week, and days-since-prior-order, and to surface the top reordered products. Clean, readable charts that can be presented to a non-technical audience are an explicit deliverable of the EDA workflow.
Jupyter — the development environment that combined executable Python code with markdown narrative, making the full analysis reproducible and self-documenting. Jupyter's cell-by-cell structure mirrors the step-by-step EDA methodology.
EDA (Exploratory Data Analysis) — the structured methodology applied throughout: inspect every column, identify data quality issues, resolve them with documented reasoning, then generate and test questions against the cleaned data. EDA is not a single technique but a discipline — the outputs are only as reliable as the preprocessing that precedes them.

What that discipline actually looks like in practice — and what the data revealed once it was trustworthy — is what the rest of this writeup covers.

04

Deep Dive

Bananas. The single most reordered grocery item in the Instacart dataset — 557,763 reorders across 206,209 customers. Not protein powder, not premium coffee, not organic kale. Bananas. And the data makes the pattern impossible to ignore once you start looking.

This project was my first real encounter with messy, real-world data at scale. Five CSV files, 4.5 million order-product records, and a mandate to clean everything before drawing a single conclusion. The kind of work that separates someone who can run a notebook from someone who can actually trust their results.

Sprint 2 Purpose: The purpose of Sprint 2 is to conduct an Exploratory Data Analysis (EDA) by developing the skills of cleaning and analyzing data with an actual dataset from Instacart, a grocery delivery platform.

This was Sprint 2 of my TripleTen AI and Machine Learning Bootcamp — my first project using the full EDA workflow: load, inspect, preprocess, analyze, visualize. The dataset was publicly released by Instacart in 2017 for a Kaggle competition. The goal: understand customer shopping behavior well enough to surface actionable insights for the business.

Why This Project?

Sprint 2 introduced two skills that underpin everything else in data science: cleaning data you didn't collect and asking the right question before touching the analysis. The Instacart dataset is large enough that lazy preprocessing produces subtly wrong results — and small enough that careful work is actually feasible.

The business framing mattered too. Instacart's goal isn't to run notebooks — it's to increase revenue. Every question I asked (when do people shop, what do they reorder, how long between orders) maps to something a product or operations team would genuinely act on: push notification timing, personalized reorder reminders, inventory optimization.

What I Learned

This was my first time managing project workflow with Jira — using board and list views to track each step. I also used groupby(), duplicated(), drop_duplicates(), unique(), and nunique() in anger for the first time. The discipline of loading libraries and datasets only once, structuring notebooks into logical sections, and commenting every analytical decision came from code review feedback in this sprint.

Working with Jira in Sprint 2 — Board and List views

Jira Board & List views used to track each sprint step from data loading through final feedback.

The Datasets

Five CSV files from Instacart's 2017 Kaggle release. Together they describe what customers ordered, when, and in what sequence — at a scale that forces systematic preprocessing.

Dataset	Rows	Columns	Key Fields
instacart_orders	478,967	6	`order_id`, `user_id`, `order_dow`, `order_hour_of_day`, `days_since_prior_order`
products	49,694	4	`product_id`, `product_name`, `aisle_id`, `department_id`
order_products	4,545,007	4	`order_id`, `product_id`, `add_to_cart_order`, `reordered`
aisles	134	2	`aisle_id`, `aisle`
departments	21	2	`department_id`, `department`

Project Initialization

The notebook began by loading all required libraries and importing the five datasets in a single initialization cell — a discipline reinforced during Sprint 2 to keep dependencies explicit and reproducible.

# Upload Libraries
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

# Upload Datasets
df_orders        = pd.read_csv('/datasets/instacart_orders.csv',  sep=';')
df_products      = pd.read_csv('/datasets/products.csv',          sep=';')
df_aisles        = pd.read_csv('/datasets/aisles.csv',            sep=';')
df_departments   = pd.read_csv('/datasets/departments.csv',       sep=';')
df_order_products = pd.read_csv('/datasets/order_products.csv',   sep=';')

# Define different paths to set up the loop in the next code block
file_paths = {
    "Orders":         "/datasets/instacart_orders.csv",
    "Products":       "/datasets/products.csv",
    "Aisles":         "/datasets/aisles.csv",
    "Departments":    "/datasets/departments.csv",
    "Order_Products": "/datasets/order_products.csv"
}
dfs = {name: pd.read_csv(path, sep=';') for name, path in file_paths.items()}
for name, df in dfs.items():
    print(f"Dataset: {name}")
    df.info(show_counts=True)
    print("\n" * 2)

Phase 1 — Data Preprocessing

Before any analysis, every table needed to pass a two-part check: no duplicate rows, no unexpected missing values. The results were more interesting than a clean dataset would suggest.

Orders Table

15 Duplicate Rows Removed

The orders table contained 15 fully duplicate rows — same order ID, same user, same timestamp. After drop_duplicates(), a double-check with duplicated().any() confirmed the table was clean. The days_since_prior_order column had 28,819 nulls, all correctly corresponding to first-time orders where no prior order exists.

Products Table

1,258 Missing Product Names → Filled with "Unknown"

All 1,258 missing product names mapped exclusively to aisle_id = 100 and department_id = 21 — both labeled "missing" in their respective lookup tables. This wasn't data entry noise; it was a structural pattern in how Instacart handled unclassified items. The right call was filling with 'Unknown' rather than dropping rows, preserving the order-product relationships downstream.

Duplicate product names also appeared — 1,465 rows with names that matched another row after lowercasing. These were kept because they had distinct product_id values, indicating genuinely different SKUs with similar names (e.g., store-brand vs. name-brand variants).

Order Products Table

836 Missing Cart Position Values — Left as NaN

The add_to_cart_order column had 836 missing values. Investigating the affected orders revealed they all contained 65 or more items — the minimum affected order had 65 items, the maximum had 127. The most plausible explanation: Instacart's system only tracked cart-add position up to a certain threshold. Filling with a sentinel value like 999 would pollute any cart-position analysis; leaving the nulls preserves analytical integrity.

# Verify all missing cart positions come from large orders
orders_with_missing = df_order_products.loc[
    df_order_products['add_to_cart_order'].isna(), 'order_id'
].unique()

order_counts = df_order_products['order_id'].value_counts()
missing_order_counts = order_counts.loc[orders_with_missing]

# All affected orders have > 64 items — confirmed
print((missing_order_counts > 64).all())  # True
print(missing_order_counts.min())          # 65

Phase 2 — Behavioral Analysis

With clean data, four core questions drove the analysis — each with a direct business application for Instacart's product and ops teams.

Question A

When Do People Shop?

Orders concentrated heavily between 9 AM and 4 PM, peaking at 10 AM with 40,578 orders. The pre-dawn hours (2–4 AM) saw fewer than 1,000 orders each. Wednesday and Saturday showed nearly identical hourly distributions — both peak between the 9th and 17th hours, with the busiest single hour exceeding 5,000 orders.

Hour	Orders	Hour	Orders
6 AM	4,215	12 PM	38,034
7 AM	13,043	1 PM	39,007
8 AM	25,024	2 PM	39,631
9 AM	35,896	3 PM	39,789
10 AM	40,578	4 PM	38,112
11 AM	40,032	5 PM	31,930

Bar chart: Grocery Orders by Hour of Day

Orders by hour of day — peak volume concentrates between 10 AM and 4 PM.

Overlapping histogram: Orders by Hour on Wednesday vs Saturday

Wednesday vs Saturday hourly distributions — nearly identical shapes, both peaking mid-morning.

Question B

What Day Do People Shop Most?

Sunday and Monday drove the highest order volumes — 84,090 and 82,185 orders respectively. Mid-week (Wednesday–Thursday) was consistently the slowest, with Thursday seeing only 59,810 orders. This bimodal weekly pattern — heavy at the start, lighter mid-week, recovering toward the weekend — suggests Instacart users plan grocery orders around their weekly routines rather than shopping spontaneously.

Day	Orders
Sunday	84,090
Monday	82,185
Tuesday	65,833
Wednesday	60,897
Thursday	59,810
Friday	63,488
Saturday	62,649

Bar chart: Grocery Shopping by Day of the Week

Sunday and Monday lead order volume; mid-week (Wednesday–Thursday) is slowest.

Question C

How Long Between Orders?

Two spikes dominated the reorder gap distribution: 7 days (44,577 orders) and 30 days (51,337 orders — the single largest bucket). Weekly shoppers and monthly shoppers are clearly distinct behavioral segments. The days between these spikes showed steady decline, suggesting most users commit to a cadence rather than ordering ad hoc.

The practical implication: push notification timing and email reminders should be calibrated to these two rhythms. A generic "you haven't ordered in a while" message at day 14 misses both peaks.

Bar chart: Number of Orders by Days Since Prior Order

Reorder gap distribution — two clear spikes at 7 days (weekly shoppers) and 30 days (monthly shoppers).

Question D

Top Reordered Products

Merging the order_products and products tables, then aggregating by product_id on reordered = 1, surfaced the top 20 most-reordered items. The list is almost entirely fresh produce — a signal that Instacart's stickiest customers are buying perishables on a schedule, not stocking up on pantry goods.

Rank	Product	Product ID	Reorders
1	Banana	24852	557,763
2	Bag of Organic Bananas	13176	444,450
3	Organic Strawberries	21137	286,239
4	Organic Baby Spinach	21903	262,233
5	Organic Hass Avocado	47209	236,229
6	Organic Avocado	47766	187,743
7	Organic Whole Milk	27845	162,251
8	Large Lemon	47626	150,244
9	Organic Raspberries	27966	147,248
10	Strawberries	16797	139,245

Key Takeaways

Peak shopping is 9 AM–4 PM. Order volume drops sharply after 5 PM and stays below 10,000 from 9 PM–6 AM. Scheduling promotions or push notifications outside these hours means reaching users at low-intent moments.
Sunday and Monday are the heavy shopping days. A product or growth team optimizing delivery capacity or promotional spend should weight these two days disproportionately.
Two distinct reorder segments: weekly and monthly. The 7-day and 30-day spikes are not noise — they represent separable customer cohorts with different loyalty and retention dynamics.
Fresh produce dominates reorders. All top 10 reordered items are produce. Instacart's most engaged customers are building a weekly fresh-produce habit, not stocking pantries.
Missing data isn't always random. Both the 1,258 missing product names and the 836 missing cart positions had structural causes identifiable from the data itself. Treating them as random errors would have been wrong.

Conclusion & Reflections

This project confirmed something that's easy to underestimate early in a data career: preprocessing isn't setup — it's analysis. The decisions I made about how to handle missing product names, duplicate orders, and structural nulls in the cart position column directly shaped every conclusion downstream. Getting those decisions wrong wouldn't have produced obvious errors; it would have produced subtly biased results that looked fine.

The behavioral patterns themselves were clear and actionable. But the harder skill — knowing why data is missing before deciding what to do about it — is what made the analysis trustworthy.

Growth into Sprint 3+

Sprint 2 built the preprocessing habits that every subsequent project depended on. By Sprint 6 (Video Game Sales), I was handling missing scores and platform lifecycles with the same systematic approach. By Sprint 9 (Churn Prediction), the same discipline applied to feature engineering and class imbalance. The fundamentals compound.

Project Requirement	Status
Libraries and datasets loaded once at notebook start	DONE ✓
Duplicate rows found and removed in all 5 tables	DONE — 15 duplicate orders removed ✓
Missing values identified and handled with justification	DONE — 1,258 product names, 836 cart positions ✓
[A] Easy: hour/day/reorder-gap analyses with visualizations	DONE — bar charts for all three ✓
[B] Medium: Wed vs Sat histogram, orders/customer, top 20 products	DONE ✓
[C] Hard: order size distribution, top 20 reordered items	DONE ✓
Project conclusion with business framing	DONE ✓

What Does a Grocery Cart Say About You?
EDA on 4.5M Instacart Orders

Problem

Solution

Skills Acquired

Deep Dive

Why This Project?

The Datasets

Project Initialization

Phase 1 — Data Preprocessing

Phase 2 — Behavioral Analysis

Key Takeaways

Conclusion & Reflections

Want to See the Full Notebook?

What Does a Grocery Cart Say About You?EDA on 4.5M Instacart Orders

Problem

Solution

Skills Acquired

Deep Dive

Why This Project?

The Datasets

Project Initialization

Phase 1 — Data Preprocessing

Phase 2 — Behavioral Analysis

Key Takeaways

Conclusion & Reflections

Want to See the Full Notebook?

What Does a Grocery Cart Say About You?
EDA on 4.5M Instacart Orders