Problem
The used car market is one of the most opaque pricing environments a consumer will encounter. The same make and model can list for $4,000 or $14,000 depending on year, odometer reading, condition, fuel type, and a dozen other factors — and most buyers have no systematic way to understand what is actually driving that price difference. Static reports and averages miss the interaction effects that matter most: a high-mileage car in excellent condition versus a low-mileage car in fair condition are not comparable on any single axis. The real challenge is not collecting the data — it is building a tool that lets non-technical users explore those interactions themselves, in real time, without needing to run a single line of code. Closing that gap between raw data and user understanding is what this project set out to do.
Solution
This project combined a full EDA workflow with a deployed Streamlit web application — giving anyone with a browser instant access to interactive visualizations of 51,525 used vehicle listings. The analysis covers price distributions, model year trends broken down by condition, price-versus-odometer relationships, and price-versus-model-year patterns across condition categories. Missing value analysis was performed before any visualization to ensure findings were grounded in clean data. The Streamlit app was deployed on Render so no local Python environment is required — the tool is publicly accessible with zero setup. This mirrors how data science work actually reaches its audience: not as a Jupyter notebook, but as a running application that answers questions on demand.
Skills Acquired
- Python — implementation language for the full EDA pipeline and Streamlit application. All data loading, preprocessing, analysis, and visualization logic is written in Python.
- Pandas — used for loading the vehicles dataset, inspecting data types and shape, identifying missing values column by column, and filtering/grouping records for visualization. Missing value analysis was a prerequisite — no chart was produced before the data quality was understood.
- Plotly Express — interactive charting library used for all four core visualizations: a price distribution histogram, a model year by condition histogram, a price vs. odometer scatterplot, and a price vs. model year scatterplot colored by condition. Plotly charts are web-native and render directly inside Streamlit without any additional configuration.
- Streamlit — the web framework used to convert the EDA notebook into a live, browser-accessible application. Streamlit's checkbox and header components provided the UI layer; its deployment model eliminated the need for a separate server or infrastructure configuration.
- Render (Cloud Deployment) — the platform used to deploy the Streamlit app publicly. Configuring a
requirements.txt, linking the GitHub repo, and managing the build process for a live deployment was a new workflow distinct from local Jupyter development — and closer to how production data tools actually ship. - EDA (Exploratory Data Analysis) — the structured methodology applied throughout: inspect, clean, then explore. The sequence of missing value analysis before visualization prevented misleading conclusions and established a baseline of data trustworthiness before any patterns were surfaced.
Deep Dive
The first chart you see in the app is a price distribution histogram, and it tells you something immediately: used car prices are right-skewed. The median is well below the mean because a long tail of high-value vehicles pulls the average up. Most listings cluster below $20,000, but the distribution does not simply trail off — there are meaningful spikes at round-number price points like $5,000, $10,000, and $15,000, which suggests that sellers are anchoring to psychological price floors rather than precise market valuations.
The most instructive chart is the model year by condition breakdown. When you separate vehicles by condition (new, like new, excellent, good, fair, salvage), a pattern emerges that flat price averages hide completely: condition is a stronger pricing signal than age for mid-range vehicles. A 2010 vehicle in excellent condition frequently lists above a 2014 vehicle in fair condition. The year tells you one thing; the condition tells you another — and the interaction between them is where the real pricing intelligence lives.
The price vs. odometer scatterplot confirms the expected negative correlation — higher mileage, lower price — but the relationship is noisier than intuition suggests. Vehicles with 150,000+ miles span a $1,000–$15,000 price range. The condition overlay explains most of that variance: a high-mileage vehicle in excellent condition retains value that a low-mileage vehicle in salvage condition does not. Odometer alone is an incomplete predictor.
What This Sprint Added
Sprint 4 was the first time in the bootcamp that the deliverable was not a notebook — it was a running application. That distinction matters. Connecting a requirements.txt, pushing to GitHub, configuring a Render deployment, and resolving the version conflicts that inevitably surface in a cloud build environment are skills that belong to the engineering side of data science. The analysis is only useful once it can reach the people who need it.
Why This Project?
EDA notebooks are easy to produce and impossible to share. Sprint 4 forced a discipline that every prior sprint skipped: shipping. The used cars dataset is rich enough that the interactive charts reveal genuinely interesting patterns — but the value of those patterns depends entirely on whether another person can access them without installing Python. This project answered that question by deploying a public Streamlit app on Render. The analysis became useful the moment the URL went live.
| Visualization | Key Finding | Business Implication |
|---|---|---|
| Price Distribution | Right-skewed; most listings under $20K; clustering at round numbers | Sellers use psychological anchoring — not market precision |
| Model Year by Condition | Condition dominates age as a pricing signal for mid-range vehicles | Buyers should filter by condition first, year second |
| Price vs. Odometer | Negative correlation, but wide variance at every mileage band | Odometer alone is an insufficient price predictor |
| Price vs. Model Year | Condition overlays reveal high-mileage excellent > low-mileage fair | Condition × year interaction is the core pricing variable |
Explore the Live App
The full interactive dashboard is publicly deployed — no account, no setup required.