← Back to Projects IMDb Golden Age TV Analysis demo
01

Problem

The "Golden Age of Television" is a widely repeated cultural claim β€” but claims made without data are just opinions. If the Golden Age (1999 to present) genuinely produced the best TV ever made, that quality should be measurable: highly-rated shows from this era should also be the most watched, and on IMDb, vote count is the closest available proxy for viewership. Before that hypothesis can be tested, however, the dataset has to be trustworthy β€” and a real-world dataset of 85,579 records rarely arrives clean. Missing values, duplicates, and implicit data type errors each introduce different categories of bias: one corrupts aggregations, another inflates counts, and the third produces silent incorrect comparisons. The challenge is identifying all three problems, understanding why they exist, and resolving each with an approach that preserves the data's integrity rather than papering over the issues.

02

Solution

This project preprocessed a dataset of IMDb movie and TV show credits, resolved three distinct data quality issues using appropriate Pandas techniques for each, and then tested whether highly-rated Golden Age TV shows also receive the highest vote counts. The three preprocessing steps were treated as a structured exercise: identify the problem type, select the correct fix, and document the reasoning β€” not just the code. After preprocessing, the dataset was filtered to TV shows from 1999 onward and sorted by average rating to identify the top-rated titles, then cross-referenced against vote count to evaluate the hypothesis. The results confirmed that highly-rated Golden Age shows do attract disproportionate votes, validating the claim with real data rather than cultural consensus. The project also established the foundational Pandas and EDA workflow used in every subsequent analysis in the program.

03

Skills Acquired

What the data actually said β€” and how the preprocessing made that answer possible β€” is the story the rest of this writeup tells.

04

Deep Dive

The "Golden Age" of television is generally considered to have started in 1999 with The Sopranos and continues to this day. The claim is simple: this era produced some of the best TV ever made. If that's true, then the most highly-rated shows from this period should also be the most watched β€” and on IMDb, the proxy for "most watched" is vote count.

This project tests that assumption using a real dataset of 85,579 actor and director records from movies and TV shows on IMDb. Before any analysis could happen, the data needed significant preprocessing β€” three distinct issues were identified and resolved, each requiring a different pandas approach.

The assumption: highly-rated TV shows released during the "Golden Age" of television (1999–present) also receive the most votes. Testing this required cleaning messy real-world data first β€” a realistic preview of what data science actually looks like in practice.

The Dataset

The dataset β€” movies_and_shows.csv β€” contains one row per actor/director credit, with metadata about the associated title repeating across every cast member's row. At 85,579 rows, it covers a wide range of movies and shows.

The first inspection revealed that the data had problems before a single line of analysis could be written. Three issues stood out: column names with inconsistent formatting, missing values in the score and vote columns, and a large number of duplicate rows β€” including implicit duplicates from inconsistent string values in the type column.

Column Type Issue Found
nameobjectLeading whitespace in header
r0leobjectDigit '0' instead of letter 'o'
TITLEobjectUppercase + 1 missing value
TypeobjectLeading whitespace + implicit duplicates
imdb sc0refloat64Digit '0' + 4,609 missing values
imdb v0tesfloat64Digit '0' + 4,726 missing values

Data Preprocessing

Stage 2 β€” Part 1

Column Name Cleaning

The column names had four distinct formatting problems: uppercase characters, leading/trailing whitespace, spaces between words, and the digit '0' used in place of the letter 'o'. The fix was a single chained expression applying pandas string methods β€” no loop required.

column name normalization
# Before: '   name', 'r0le', 'TITLE', '  Type', 'imdb sc0re', 'imdb v0tes'
df.columns = df.columns.str.lower().str.strip().str.replace('0', 'o')

# After: 'name', 'character', 'role', 'title', 'type',
#        'release_year', 'genres', 'imdb_score', 'imdb_votes'

The initial solution used a custom function with a loop β€” the reviewer pointed out this was unnecessarily verbose and that pandas string methods handle all four transformations in a single chain. This feedback introduced an important principle: prefer vectorized pandas operations over Python loops when working on Series or Index objects.

Stage 2 β€” Part 2

Missing Values

The dataset had 4,609 missing IMDb scores and 4,726 missing vote counts β€” approximately 6% of rows. Since both columns are essential for the analysis, rows with any missing values were dropped entirely. One missing title value was present but not critical.

dropping missing values
# Check missing values per column
df.isna().sum()
# imdb_score: 4609, imdb_votes: 4726

# Drop all rows with any NaN β€” in-place, no new DataFrame
df.dropna(inplace=True)

Stage 2 β€” Part 3

Duplicate Rows and Implicit Type Duplicates

The dataset contained 6,994 exact duplicate rows β€” each actor credit row repeating for every member of a show's cast with identical title metadata. These were removed with drop_duplicates().

The more subtle problem was implicit duplicates in the type column. Nine distinct string values were being used to represent two categories:

unique type values before standardization
print(df['type'].unique())
# ['MOVIE' 'movies' 'SHOW' 'shows' 'the movie' 'tv' 'tv series' 'tv show' 'tv shows']

Six variants of "show" needed to be collapsed into a single canonical value. A replace_wrong_show() function handled the mapping, applying df['type'].replace() in-place on the original DataFrame rather than creating a new one.

standardizing implicit type duplicates
def replace_wrong_show(df, wrong_shows_list, correct_show):
    df['type'] = df['type'].replace(wrong_shows_list, correct_show)

wrong_shows = ['shows', 'SHOW', 'tv shows', 'tv', 'tv series', 'tv show']
replace_wrong_show(df, wrong_shows, 'SHOW')

Analysis: Testing the Hypothesis

Stage 3 β€” Part 1

Filtering to the Golden Age

With the data cleaned, the analysis filtered to shows released in 1999 or later β€” the start of the Golden Age β€” and excluded movies. This produced the subset relevant to the hypothesis.

filtering to Golden Age shows
# Keep only shows released 1999 or later
df_golden = df[(df['release_year'] >= 1999) & (df['type'] == 'SHOW')]

Stage 3 β€” Part 2

Score Bucketing and Outlier Removal

IMDb scores are continuous (e.g., 7.8, 8.1, 8.3). To calculate meaningful average vote counts, scores were rounded to integers β€” creating buckets of 1–10. Each bucket then had enough data to compute a reliable mean.

After grouping, scores at the extremes β€” 2, 3, and 10 β€” had too few shows to produce meaningful averages. These were excluded, leaving scores 4 through 9 as the valid analysis range.

score bucketing and vote averaging
# Round scores to integer buckets
df_golden['score_bucket'] = df_golden['imdb_score'].round()

# Filter to valid score range (4–9), compute average votes per bucket
df_valid = df_golden[
    (df_golden['imdb_score'] >= 4) & (df_golden['imdb_score'] <= 9)
]

avg_votes = (
    df_valid
    .groupby('score_bucket')['imdb_votes']
    .mean()
    .round()
    .reset_index()
    .rename(columns={'score_bucket': 'Score', 'imdb_votes': 'Average Votes'})
    .sort_values('Average Votes', ascending=False)
)

Stage 3 β€” Part 3

Result: The Hypothesis Holds

The analysis confirmed the assumption. Shows with scores in the 7–9 range consistently had the highest average vote counts. Score 4 was a notable exception β€” it ranked higher than scores 5 and 6 β€” but the top three scores (7, 8, 9) clustered at the top of the vote count distribution, supporting the original claim.

The data studied represented approximately 94% of the original set after dropping missing values, providing a reliable base for the conclusion.

Key Takeaways

What I Learned & Why It Matters to Employers

  • This was my first real data project. Sprint 1 introduced the full EDA loop for the first time: inspect β†’ clean β†’ analyze β†’ conclude. Every subsequent project in the program built on this same structure, just with more complex data and more sophisticated techniques.
  • Reviewer feedback shaped my coding style permanently. Two rounds of code review β€” flagging duplicate code, unnecessary DataFrame recreation, and loop-based column cleaning β€” introduced pandas best practices early. The habits formed here carried forward into all later work.
  • Data quality is a prerequisite, not an afterthought. Six percent of the dataset had missing score/vote values. Nine strings were standing in for two categories. 6,994 rows were exact duplicates. None of this was obvious from a first glance β€” it required systematic inspection. That inspection mindset is foundational to any data role.

Conclusion & Reflections

The Golden Age hypothesis holds up under the data. TV shows released since 1999 with scores of 7–9 do receive more IMDb votes on average than lower-rated shows from the same era, confirming that critical reception and audience engagement move together for the best-rated content.

More importantly, this project established the data science workflow that every subsequent analysis has followed: never trust the raw data, inspect before you transform, and let the cleaning steps be as deliberate as the analysis. The reviewer's feedback during two rounds of revision made that lesson stick early.

Objective Status
Load and inspect the movies_and_shows.csv dataset COMPLETED βœ“
Normalize column names (snake_case, lowercase, no whitespace, 0β†’o) COMPLETED βœ“
Handle missing values in imdb_score and imdb_votes COMPLETED βœ“
Remove 6,994 exact duplicate rows COMPLETED βœ“
Standardize implicit type duplicates (9 strings β†’ 2 categories) COMPLETED βœ“
Filter to Golden Age shows (1999+) and bucket scores by integer COMPLETED βœ“
Calculate average votes per score bucket and identify outliers COMPLETED βœ“
Confirm or reject the hypothesis CONFIRMED βœ“
View on GitHub β†’ Connect on LinkedIn β†’ ← Back to Projects