overview

A data science project exploring whether online media presence affects soccer team performance. We analyzed ~60 Premier League matches across five top teams (Arsenal, Liverpool, Manchester United, Manchester City, Chelsea) to determine if the number of news articles published between games could predict goals scored. Using Selenium web scraping for ESPN match data and The Guardian API for article counts, we built linear and polynomial regression models to test this hypothesis.

data pipeline

Python Selenium pandas scikit-learn matplotlib NumPy
ESPN scraping: Selenium WebDriver navigates ESPN's team pages in headless Chrome, extracting fixtures, scores, and statistics from dynamically loaded tables using pd.read_html() on the page source.
Guardian API: RESTful calls to content.guardianapis.com retrieve articles matching team names within date ranges, returning headline, word count, and publication timestamp.
match window creation: Each game generates a time window from match end to next match start, and articles are binned into these windows to count pre-game media coverage.
data cleaning: Removed all rows with corrupted or lost data in important values such as match dates, team names, article counts, etc. Also standardized date formats across sources to make sure that the match window was perfectly matched.

match window logic

The core challenge was attributing articles to the correct game. We defined a "match window" as the period between one game's end time and the next game's start time.

Game N ends Window opens Articles counted Game N+1 starts

This approach captures the media narrative between matches: post-match analysis of Game N and preview coverage for Game N+1. The assumption is that this combined coverage could influence team morale, fan expectations, or perceived pressure.

machine learning

We tested whether article count (x) could predict goals scored (y) using two regression approaches:

linear regression: Fit a line through the data using least squares. Test MSE: 2.38, R² = -0.1087. The negative R² means predicting the mean would be more accurate.
polynomial regression (degree 2): Added a quadratic term to capture non-linear relationships. Test MSE: 2.38, R² = -0.1074. No improvement over linear.
train/test split: 80/20 split. Training MSE was 1.5 while test MSE was 2.38, indicating the model memorizes training data but fails to generalize.
Linear regression scatter plot Polynomial regression scatter plot

results

Both models performed worse than a naive baseline (predicting mean goals for every match). The negative R² scores definitively show that article count alone has no predictive power for team performance.

no correlation found: Media coverage volume does not predict goals scored. The relationship is essentially random noise. Not a very substantial conclusion, but considering that it was our first time working with real-world data and building a full ML pipeline, it was still awesome.
coverage follows success: More likely, successful teams generate more articles after wins, creating a lagging indicator rather than a leading one.
sample size limitation: 59 matches across 5 teams may be insufficient. A full season (380 matches) could reveal subtler patterns.

future work

sentiment analysis: Instead of counting articles, we could analyze their tone. For example, negative press might affect performance differently than positive coverage.
home vs away split: Testing if media pressure impacts home games differently than away games could be interesting as crowd dynamics is the major change, assuming the pitches are all up to professional level.
opponent strength: Adding features like opponent league position, head-to-head history, and rest days between matches could also be substantial factors to consider.

ethical considerations

If a strong relationship existed, it could imply media coverage affects player performance and raise suspicious questions about whether the competition was fair before the game even started. An unfair loop of rich and popular teams staying rich and popular, poorer and unpopular but talented teams staying where they are.

Additionally, using article counts disadvantages smaller clubs with less media presence, potentially making predictions less accurate for them regardless of actual performance quality.

construction

Built as a DS3000 (Foundations of Data Science) final project at Northeastern. The codebase spans three Jupyter notebooks handling scraping, cleaning, and modeling separately.

ESPN scraping: setup_driver() initializes headless Chrome, scrape_team_fixtures() extracts tables per team, outputs 30+ team-specific DataFrames.
Guardian API: get_guardian_articles() handles making the pages, get_article_data() parses JSON responses into flat dictionaries.
window matching: Pandas datetime operations create windows, pd.merge() joins article counts to matches on composite keys.
regression: scikit-learn's LinearRegression and PolynomialFeatures with manual train/test splits and MSE/R² evaluation.

Source? This project was built collaboratively with William Hon, Kinsey Bellerose, and Zaid Jilla. The notebooks demonstrate the full data science pipeline from web scraping through statistical modeling and interpretation.