IMDb "Golden Age" TV Analysis

01

Problem

The "Golden Age of Television" is a widely repeated cultural claim — but claims made without data are just opinions. If the Golden Age (1999 to present) genuinely produced the best TV ever made, that quality should be measurable: highly-rated shows from this era should also be the most watched, and on IMDb, vote count is the closest available proxy for viewership. Before that hypothesis can be tested, however, the dataset has to be trustworthy — and a real-world dataset of 85,579 records rarely arrives clean. Missing values, duplicates, and implicit data type errors each introduce different categories of bias: one corrupts aggregations, another inflates counts, and the third produces silent incorrect comparisons. The challenge is identifying all three problems, understanding why they exist, and resolving each with an approach that preserves the data's integrity rather than papering over the issues.

02

Solution

This project preprocessed a dataset of IMDb movie and TV show credits, resolved three distinct data quality issues using appropriate Pandas techniques for each, and then tested whether highly-rated Golden Age TV shows also receive the highest vote counts. The three preprocessing steps were treated as a structured exercise: identify the problem type, select the correct fix, and document the reasoning — not just the code. After preprocessing, the dataset was filtered to TV shows from 1999 onward and sorted by average rating to identify the top-rated titles, then cross-referenced against vote count to evaluate the hypothesis. The results confirmed that highly-rated Golden Age shows do attract disproportionate votes, validating the claim with real data rather than cultural consensus. The project also established the foundational Pandas and EDA workflow used in every subsequent analysis in the program.

03

Skills Acquired

Python — the implementation language for all preprocessing and analysis steps. The project was completed entirely in Python using standard data science libraries, establishing the baseline workflow applied in all subsequent projects.
Pandas — the primary data manipulation library. Pandas was used to load the CSV, inspect column types and null counts, identify the three preprocessing issues, apply the appropriate fixes (fillna, drop_duplicates, and type casting), filter for TV shows in the Golden Age window, and sort by rating and vote count for the final analysis.
Jupyter — the development environment that combined code, outputs, and narrative explanation in a single reproducible document. Jupyter's cell structure makes it straightforward to inspect intermediate results at every step of the preprocessing pipeline, which is critical when debugging data quality issues.
EDA (Exploratory Data Analysis) — the methodology applied to understand the dataset before drawing any conclusions. EDA in this project meant examining column distributions, identifying anomalies, and forming testable questions from the data before attempting to answer them.
Data Preprocessing — the set of techniques applied to resolve the three identified quality issues. Each issue required a different approach: missing values in the age column were filled with the median, exact duplicate rows were removed, and an implicit type mismatch in votes stored as strings was cast to numeric before any aggregation.

What the data actually said — and how the preprocessing made that answer possible — is the story the rest of this writeup tells.

04

Deep Dive

The "Golden Age" of television is generally considered to have started in 1999 with The Sopranos and continues to this day. The claim is simple: this era produced some of the best TV ever made. If that's true, then the most highly-rated shows from this period should also be the most watched — and on IMDb, the proxy for "most watched" is vote count.

This project tests that assumption using a real dataset of 85,579 actor and director records from movies and TV shows on IMDb. Before any analysis could happen, the data needed significant preprocessing — three distinct issues were identified and resolved, each requiring a different pandas approach.

The assumption: highly-rated TV shows released during the "Golden Age" of television (1999–present) also receive the most votes. Testing this required cleaning messy real-world data first — a realistic preview of what data science actually looks like in practice.

The Dataset

The dataset — movies_and_shows.csv — contains one row per actor/director credit, with metadata about the associated title repeating across every cast member's row. At 85,579 rows, it covers a wide range of movies and shows.

The first inspection revealed that the data had problems before a single line of analysis could be written. Three issues stood out: column names with inconsistent formatting, missing values in the score and vote columns, and a large number of duplicate rows — including implicit duplicates from inconsistent string values in the type column.

Column	Type	Issue Found
`name`	object	Leading whitespace in header
`r0le`	object	Digit '0' instead of letter 'o'
`TITLE`	object	Uppercase + 1 missing value
`Type`	object	Leading whitespace + implicit duplicates
`imdb sc0re`	float64	Digit '0' + 4,609 missing values
`imdb v0tes`	float64	Digit '0' + 4,726 missing values

Data Preprocessing

Stage 2 — Part 1

Column Name Cleaning

The column names had four distinct formatting problems: uppercase characters, leading/trailing whitespace, spaces between words, and the digit '0' used in place of the letter 'o'. The fix was a single chained expression applying pandas string methods — no loop required.

column name normalization

# Before: '   name', 'r0le', 'TITLE', '  Type', 'imdb sc0re', 'imdb v0tes'
df.columns = df.columns.str.lower().str.strip().str.replace('0', 'o')

# After: 'name', 'character', 'role', 'title', 'type',
#        'release_year', 'genres', 'imdb_score', 'imdb_votes'

The initial solution used a custom function with a loop — the reviewer pointed out this was unnecessarily verbose and that pandas string methods handle all four transformations in a single chain. This feedback introduced an important principle: prefer vectorized pandas operations over Python loops when working on Series or Index objects.

Stage 2 — Part 2

Missing Values

The dataset had 4,609 missing IMDb scores and 4,726 missing vote counts — approximately 6% of rows. Since both columns are essential for the analysis, rows with any missing values were dropped entirely. One missing title value was present but not critical.

dropping missing values

# Check missing values per column
df.isna().sum()
# imdb_score: 4609, imdb_votes: 4726

# Drop all rows with any NaN — in-place, no new DataFrame
df.dropna(inplace=True)

Stage 2 — Part 3

Duplicate Rows and Implicit Type Duplicates

The dataset contained 6,994 exact duplicate rows — each actor credit row repeating for every member of a show's cast with identical title metadata. These were removed with drop_duplicates().

The more subtle problem was implicit duplicates in the type column. Nine distinct string values were being used to represent two categories:

unique type values before standardization

print(df['type'].unique())
# ['MOVIE' 'movies' 'SHOW' 'shows' 'the movie' 'tv' 'tv series' 'tv show' 'tv shows']

Six variants of "show" needed to be collapsed into a single canonical value. A replace_wrong_show() function handled the mapping, applying df['type'].replace() in-place on the original DataFrame rather than creating a new one.

standardizing implicit type duplicates

def replace_wrong_show(df, wrong_shows_list, correct_show):
    df['type'] = df['type'].replace(wrong_shows_list, correct_show)

wrong_shows = ['shows', 'SHOW', 'tv shows', 'tv', 'tv series', 'tv show']
replace_wrong_show(df, wrong_shows, 'SHOW')

Analysis: Testing the Hypothesis

Stage 3 — Part 1

Filtering to the Golden Age

With the data cleaned, the analysis filtered to shows released in 1999 or later — the start of the Golden Age — and excluded movies. This produced the subset relevant to the hypothesis.

filtering to Golden Age shows

# Keep only shows released 1999 or later
df_golden = df[(df['release_year'] >= 1999) & (df['type'] == 'SHOW')]

Stage 3 — Part 2

Score Bucketing and Outlier Removal

IMDb scores are continuous (e.g., 7.8, 8.1, 8.3). To calculate meaningful average vote counts, scores were rounded to integers — creating buckets of 1–10. Each bucket then had enough data to compute a reliable mean.

After grouping, scores at the extremes — 2, 3, and 10 — had too few shows to produce meaningful averages. These were excluded, leaving scores 4 through 9 as the valid analysis range.

score bucketing and vote averaging

# Round scores to integer buckets
df_golden['score_bucket'] = df_golden['imdb_score'].round()

# Filter to valid score range (4–9), compute average votes per bucket
df_valid = df_golden[
    (df_golden['imdb_score'] >= 4) & (df_golden['imdb_score'] <= 9)
]

avg_votes = (
    df_valid
    .groupby('score_bucket')['imdb_votes']
    .mean()
    .round()
    .reset_index()
    .rename(columns={'score_bucket': 'Score', 'imdb_votes': 'Average Votes'})
    .sort_values('Average Votes', ascending=False)
)

Stage 3 — Part 3

Result: The Hypothesis Holds

The analysis confirmed the assumption. Shows with scores in the 7–9 range consistently had the highest average vote counts. Score 4 was a notable exception — it ranked higher than scores 5 and 6 — but the top three scores (7, 8, 9) clustered at the top of the vote count distribution, supporting the original claim.

The data studied represented approximately 94% of the original set after dropping missing values, providing a reliable base for the conclusion.

Key Takeaways

Real data is messy — cleaning comes before analysis. The dataset had four types of problems before any analysis could run: formatting issues in column names, missing values, duplicate rows, and implicit string duplicates. Each required a different pandas approach.
Vectorized pandas over Python loops. The initial column cleaning used a custom loop function. The reviewer's feedback — use .str.lower().str.strip().str.replace() chaining instead — was the first concrete lesson in pandas idiom: always prefer vectorized methods over row-by-row Python.
Never recreate DataFrames unnecessarily. A recurring reviewer note was that re-importing pandas, re-reading the file, and reassigning df multiple times throughout the notebook violated the single-source principle. Transformations should mutate the original DataFrame, not create new ones for each step.
Implicit duplicates are as dangerous as explicit ones. Nine different strings representing two categories would have silently broken any groupby or filter on the type column. Printing unique values before filtering is a cheap, essential sanity check.

What I Learned & Why It Matters to Employers

This was my first real data project. Sprint 1 introduced the full EDA loop for the first time: inspect → clean → analyze → conclude. Every subsequent project in the program built on this same structure, just with more complex data and more sophisticated techniques.
Reviewer feedback shaped my coding style permanently. Two rounds of code review — flagging duplicate code, unnecessary DataFrame recreation, and loop-based column cleaning — introduced pandas best practices early. The habits formed here carried forward into all later work.
Data quality is a prerequisite, not an afterthought. Six percent of the dataset had missing score/vote values. Nine strings were standing in for two categories. 6,994 rows were exact duplicates. None of this was obvious from a first glance — it required systematic inspection. That inspection mindset is foundational to any data role.

Conclusion & Reflections

The Golden Age hypothesis holds up under the data. TV shows released since 1999 with scores of 7–9 do receive more IMDb votes on average than lower-rated shows from the same era, confirming that critical reception and audience engagement move together for the best-rated content.

More importantly, this project established the data science workflow that every subsequent analysis has followed: never trust the raw data, inspect before you transform, and let the cleaning steps be as deliberate as the analysis. The reviewer's feedback during two rounds of revision made that lesson stick early.

Objective	Status
Load and inspect the movies_and_shows.csv dataset	COMPLETED ✓
Normalize column names (snake_case, lowercase, no whitespace, 0→o)	COMPLETED ✓
Handle missing values in imdb_score and imdb_votes	COMPLETED ✓
Remove 6,994 exact duplicate rows	COMPLETED ✓
Standardize implicit type duplicates (9 strings → 2 categories)	COMPLETED ✓
Filter to Golden Age shows (1999+) and bucket scores by integer	COMPLETED ✓
Calculate average votes per score bucket and identify outliers	COMPLETED ✓
Confirm or reject the hypothesis	CONFIRMED ✓

Does IMDb Agree with the "Golden Age"?Ratings vs. Votes Analysis

Problem

Solution

Skills Acquired

Deep Dive

The Dataset

Data Preprocessing

Analysis: Testing the Hypothesis

Key Takeaways

Conclusion & Reflections

Does IMDb Agree with the "Golden Age"?
Ratings vs. Votes Analysis