Problem
The "Golden Age of Television" is a widely repeated cultural claim β but claims made without data are just opinions. If the Golden Age (1999 to present) genuinely produced the best TV ever made, that quality should be measurable: highly-rated shows from this era should also be the most watched, and on IMDb, vote count is the closest available proxy for viewership. Before that hypothesis can be tested, however, the dataset has to be trustworthy β and a real-world dataset of 85,579 records rarely arrives clean. Missing values, duplicates, and implicit data type errors each introduce different categories of bias: one corrupts aggregations, another inflates counts, and the third produces silent incorrect comparisons. The challenge is identifying all three problems, understanding why they exist, and resolving each with an approach that preserves the data's integrity rather than papering over the issues.
Solution
This project preprocessed a dataset of IMDb movie and TV show credits, resolved three distinct data quality issues using appropriate Pandas techniques for each, and then tested whether highly-rated Golden Age TV shows also receive the highest vote counts. The three preprocessing steps were treated as a structured exercise: identify the problem type, select the correct fix, and document the reasoning β not just the code. After preprocessing, the dataset was filtered to TV shows from 1999 onward and sorted by average rating to identify the top-rated titles, then cross-referenced against vote count to evaluate the hypothesis. The results confirmed that highly-rated Golden Age shows do attract disproportionate votes, validating the claim with real data rather than cultural consensus. The project also established the foundational Pandas and EDA workflow used in every subsequent analysis in the program.
Skills Acquired
- Python β the implementation language for all preprocessing and analysis steps. The project was completed entirely in Python using standard data science libraries, establishing the baseline workflow applied in all subsequent projects.
- Pandas β the primary data manipulation library. Pandas was used to load the CSV, inspect column types and null counts, identify the three preprocessing issues, apply the appropriate fixes (fillna, drop_duplicates, and type casting), filter for TV shows in the Golden Age window, and sort by rating and vote count for the final analysis.
- Jupyter β the development environment that combined code, outputs, and narrative explanation in a single reproducible document. Jupyter's cell structure makes it straightforward to inspect intermediate results at every step of the preprocessing pipeline, which is critical when debugging data quality issues.
- EDA (Exploratory Data Analysis) β the methodology applied to understand the dataset before drawing any conclusions. EDA in this project meant examining column distributions, identifying anomalies, and forming testable questions from the data before attempting to answer them.
- Data Preprocessing β the set of techniques applied to resolve the three identified quality issues. Each issue required a different approach: missing values in the
agecolumn were filled with the median, exact duplicate rows were removed, and an implicit type mismatch invotesstored as strings was cast to numeric before any aggregation.
What the data actually said β and how the preprocessing made that answer possible β is the story the rest of this writeup tells.
Deep Dive
The "Golden Age" of television is generally considered to have started in 1999 with The Sopranos and continues to this day. The claim is simple: this era produced some of the best TV ever made. If that's true, then the most highly-rated shows from this period should also be the most watched β and on IMDb, the proxy for "most watched" is vote count.
This project tests that assumption using a real dataset of 85,579 actor and director records from movies and TV shows on IMDb. Before any analysis could happen, the data needed significant preprocessing β three distinct issues were identified and resolved, each requiring a different pandas approach.
The Dataset
The dataset β movies_and_shows.csv
β contains one row per actor/director credit, with metadata about the associated title repeating
across every cast member's row. At 85,579 rows, it covers a wide range of movies and shows.
The first inspection revealed that the data had problems before a single line of analysis could be
written. Three issues stood out: column names with inconsistent formatting, missing values in the
score and vote columns, and a large number of duplicate rows β including implicit duplicates from
inconsistent string values in the type column.
| Column | Type | Issue Found |
|---|---|---|
name | object | Leading whitespace in header |
r0le | object | Digit '0' instead of letter 'o' |
TITLE | object | Uppercase + 1 missing value |
Type | object | Leading whitespace + implicit duplicates |
imdb sc0re | float64 | Digit '0' + 4,609 missing values |
imdb v0tes | float64 | Digit '0' + 4,726 missing values |
Data Preprocessing
Stage 2 β Part 1
Column Name Cleaning
The column names had four distinct formatting problems: uppercase characters, leading/trailing whitespace, spaces between words, and the digit '0' used in place of the letter 'o'. The fix was a single chained expression applying pandas string methods β no loop required.
# Before: ' name', 'r0le', 'TITLE', ' Type', 'imdb sc0re', 'imdb v0tes' df.columns = df.columns.str.lower().str.strip().str.replace('0', 'o') # After: 'name', 'character', 'role', 'title', 'type', # 'release_year', 'genres', 'imdb_score', 'imdb_votes'
The initial solution used a custom function with a loop β the reviewer pointed out this was unnecessarily verbose and that pandas string methods handle all four transformations in a single chain. This feedback introduced an important principle: prefer vectorized pandas operations over Python loops when working on Series or Index objects.
Stage 2 β Part 2
Missing Values
The dataset had 4,609 missing IMDb scores and 4,726 missing vote counts β approximately 6% of rows. Since both columns are essential for the analysis, rows with any missing values were dropped entirely. One missing title value was present but not critical.
# Check missing values per column df.isna().sum() # imdb_score: 4609, imdb_votes: 4726 # Drop all rows with any NaN β in-place, no new DataFrame df.dropna(inplace=True)
Stage 2 β Part 3
Duplicate Rows and Implicit Type Duplicates
The dataset contained 6,994 exact duplicate rows β each actor credit row repeating for every
member of a show's cast with identical title metadata. These were removed with
drop_duplicates().
The more subtle problem was implicit duplicates in the
type
column. Nine distinct string values were being used to represent two categories:
print(df['type'].unique()) # ['MOVIE' 'movies' 'SHOW' 'shows' 'the movie' 'tv' 'tv series' 'tv show' 'tv shows']
Six variants of "show" needed to be collapsed into a single canonical value. A
replace_wrong_show()
function handled the mapping, applying
df['type'].replace()
in-place on the original DataFrame rather than creating a new one.
def replace_wrong_show(df, wrong_shows_list, correct_show): df['type'] = df['type'].replace(wrong_shows_list, correct_show) wrong_shows = ['shows', 'SHOW', 'tv shows', 'tv', 'tv series', 'tv show'] replace_wrong_show(df, wrong_shows, 'SHOW')
Analysis: Testing the Hypothesis
Stage 3 β Part 1
Filtering to the Golden Age
With the data cleaned, the analysis filtered to shows released in 1999 or later β the start of the Golden Age β and excluded movies. This produced the subset relevant to the hypothesis.
# Keep only shows released 1999 or later df_golden = df[(df['release_year'] >= 1999) & (df['type'] == 'SHOW')]
Stage 3 β Part 2
Score Bucketing and Outlier Removal
IMDb scores are continuous (e.g., 7.8, 8.1, 8.3). To calculate meaningful average vote counts, scores were rounded to integers β creating buckets of 1β10. Each bucket then had enough data to compute a reliable mean.
After grouping, scores at the extremes β 2, 3, and 10 β had too few shows to produce meaningful averages. These were excluded, leaving scores 4 through 9 as the valid analysis range.
# Round scores to integer buckets df_golden['score_bucket'] = df_golden['imdb_score'].round() # Filter to valid score range (4β9), compute average votes per bucket df_valid = df_golden[ (df_golden['imdb_score'] >= 4) & (df_golden['imdb_score'] <= 9) ] avg_votes = ( df_valid .groupby('score_bucket')['imdb_votes'] .mean() .round() .reset_index() .rename(columns={'score_bucket': 'Score', 'imdb_votes': 'Average Votes'}) .sort_values('Average Votes', ascending=False) )
Stage 3 β Part 3
Result: The Hypothesis Holds
The analysis confirmed the assumption. Shows with scores in the 7β9 range consistently had the highest average vote counts. Score 4 was a notable exception β it ranked higher than scores 5 and 6 β but the top three scores (7, 8, 9) clustered at the top of the vote count distribution, supporting the original claim.
The data studied represented approximately 94% of the original set after dropping missing values, providing a reliable base for the conclusion.
Key Takeaways
- Real data is messy β cleaning comes before analysis. The dataset had four types of problems before any analysis could run: formatting issues in column names, missing values, duplicate rows, and implicit string duplicates. Each required a different pandas approach.
-
Vectorized pandas over Python loops.
The initial column cleaning used a custom loop function. The reviewer's feedback β use
.str.lower().str.strip().str.replace()chaining instead β was the first concrete lesson in pandas idiom: always prefer vectorized methods over row-by-row Python. -
Never recreate DataFrames unnecessarily.
A recurring reviewer note was that re-importing pandas, re-reading the file, and reassigning
dfmultiple times throughout the notebook violated the single-source principle. Transformations should mutate the original DataFrame, not create new ones for each step. -
Implicit duplicates are as dangerous as explicit ones.
Nine different strings representing two categories would have silently broken any
groupbyor filter on thetypecolumn. Printing unique values before filtering is a cheap, essential sanity check.
What I Learned & Why It Matters to Employers
- This was my first real data project. Sprint 1 introduced the full EDA loop for the first time: inspect β clean β analyze β conclude. Every subsequent project in the program built on this same structure, just with more complex data and more sophisticated techniques.
- Reviewer feedback shaped my coding style permanently. Two rounds of code review β flagging duplicate code, unnecessary DataFrame recreation, and loop-based column cleaning β introduced pandas best practices early. The habits formed here carried forward into all later work.
- Data quality is a prerequisite, not an afterthought. Six percent of the dataset had missing score/vote values. Nine strings were standing in for two categories. 6,994 rows were exact duplicates. None of this was obvious from a first glance β it required systematic inspection. That inspection mindset is foundational to any data role.
Conclusion & Reflections
The Golden Age hypothesis holds up under the data. TV shows released since 1999 with scores of 7β9 do receive more IMDb votes on average than lower-rated shows from the same era, confirming that critical reception and audience engagement move together for the best-rated content.
More importantly, this project established the data science workflow that every subsequent analysis has followed: never trust the raw data, inspect before you transform, and let the cleaning steps be as deliberate as the analysis. The reviewer's feedback during two rounds of revision made that lesson stick early.
| Objective | Status |
|---|---|
| Load and inspect the movies_and_shows.csv dataset | COMPLETED β |
| Normalize column names (snake_case, lowercase, no whitespace, 0βo) | COMPLETED β |
| Handle missing values in imdb_score and imdb_votes | COMPLETED β |
| Remove 6,994 exact duplicate rows | COMPLETED β |
| Standardize implicit type duplicates (9 strings β 2 categories) | COMPLETED β |
| Filter to Golden Age shows (1999+) and bucket scores by integer | COMPLETED β |
| Calculate average votes per score bucket and identify outliers | COMPLETED β |
| Confirm or reject the hypothesis | CONFIRMED β |