Scraping Pro Football Reference with Python (2024)

Written on August 13th, 2019 by Steven Morse

Pro Football Reference is a stat-head’s dream — there is a wealth of football information, it is easily accessible directly on the site through built-in APIs, and it is cleanly formatted which makes data scraping a non-headache-inducing endeavor.

This post outlines how to grab historical fantasy points for individual players using Python. (Here’s an older, shorter post using R.) We’ll be able to do plots like this, which groups similar players based on their point average and variance.

Scraping Pro Football Reference with Python (1)

Scraping a single table

The hard way to scrape a table is manually grabbing the raw HTML with requests, then manually parsing the table structure with BeautifulSoup. But if we don’t need much fine control, there’s a better way.

The easy way to scrape a table is using pandas.read_html. Here’s what it looks like to scrape Russell Wilson’s fantasy data from 2018 with 3 lines of code:

import pandasurl = 'https://www.pro-football-reference.com/players/W/WilsRu00/fantasy/2018/'df = pd.read_html(url)[0]
 Unnamed: 0_level_0 Inside 20 Inside 10 Snap Counts Unnamed: 4_level_0 Unnamed: 0_level_1 Passing Rushing Passing Rushing ... Rk G# Date Tm Unnamed: 4_level_2 ...0 1.0 1.0 2018-09-09 SEA @ ...1 2.0 2.0 2018-09-17 SEA @ ...

That’s it folks! read_html returns a list of all the <table>’s on the page, in pandas.DataFrame form, and since we peeked at the page we know to just grab the first one and we’re off and running!

Obviously we have some seriously ugly column headers to deal with, but we can get to that later.

Scraping multiple pages

If we want to automate scraping multiple pages, though, we need to get a little more detailed. Now we’ll bring in the aforementioned requests, which is a base Python package that allows us to do lower-level HTML/XML requests, and BeautifulSoup, which is a package for manipulating and crawling raw HTML/XML.

(By the way, scraping webpages is an important tool in ethical (white-hat) hacking — here is a nice set of online courses in this skill that also introduce basic Python skills along the way.)

Notice that PFR’s naming scheme for players is a little odd. Russell Wilson’s stub is W/WilsRu00 but Antonio Brown’s is B/BrowAn04. How are we going to figure out what all these are and then loop through each page?

Check out the Fantasy Leaders page. This is a list of every active player in the season along with their season fantasy stats, ordered by overall fantasy points. Notice that each player in this table is hyperlinked to their individual page. Wouldn’t it be nice if we could crawl through this table and record each player’s URL stub? We can’t do this with our pandas.read_html trick anymore, because that strips the hyperlinks to the caption text. (That is, of <a href="url">Player</a> it only grabs Player.)

But we can do it: let’s use requests to grab the raw HTML, then BeautifulSoup to crawl through the Name column of the table, recording things like the hyperlink as we go.

Here’s how to grab the first table as a BeautifulSoup object:

from bs4 import BeautifulSoupurl = 'https://www.pro-football-reference.com'year = 2018r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')soup = BeautifulSoup(r.content, 'html.parser')parsed_table = soup.find_all('table')[0] 

The r = requests.get(...) line grabs the webpage, and we give the raw HTML r.content to BeautifulSoup and make the object soup. We can then do things like find_all instances of the 'table' tag, and grab the first one with [0].

Now that we have the raw HTML of the table stored in parsed_table, here’s how to loop through the rows of the table, grabbing the player name entry (which you may notice from trawling the page source is conveniently labeled with the attribute data-stat="player") and extracting the parts of the <a> tag that we want:

# first 2 rows are col headers so skip them with [2:]for i,row in enumerate(parsed_table.find_all('tr')[2:]): dat = row.find('td', attrs={'data-stat': 'player'}) name = dat.a.get_text() stub = dat.a.get('href') # NEXT: use `stub` to access the player page

This stub contains the nugget of information we came for. We’re now ready to combine our code so far into one enormous, glorious loop:

url = 'https://www.pro-football-reference.com'year = 2018maxp = 300 # grab fantasy playersr = requests.get(url + '/years/' + str(year) + '/fantasy.htm')soup = BeautifulSoup(r.content, 'html.parser')parsed_table = soup.find_all('table')[0] df = []# first 2 rows are col headersfor i,row in enumerate(parsed_table.find_all('tr')[2:]): if i % 10 == 0: print(i, end=' ') if i >= maxp: print('\nComplete.') break try: dat = row.find('td', attrs={'data-stat': 'player'}) name = dat.a.get_text() stub = dat.a.get('href') stub = stub[:-4] + '/fantasy/' + str(year) pos = row.find('td', attrs={'data-stat': 'fantasy_pos'}).get_text() # grab this players stats tdf = pd.read_html(url + stub)[0] # get rid of MultiIndex, just keep last row tdf.columns = tdf.columns.get_level_values(-1) # fix the away/home column tdf = tdf.rename(columns={'Unnamed: 4_level_2': 'Away'}) tdf['Away'] = [1 if r=='@' else 0 for r in tdf['Away']] # drop all intermediate stats tdf = tdf.iloc[:,[1,2,3,4,5,-3]] # drop "Total" row tdf = tdf.query('Date != "Total"') # add other info tdf['Name'] = name tdf['Position'] = pos tdf['Season'] = year df.append(tdf) except: pass df = pd.concat(df)df.head()
G# Date Tm Away Opp FantPt Name Position Season0 1.0 2018-09-10 LAR 1 OAK 20.7 Todd Gurley RB 20181 2.0 2018-09-16 LAR 0 ARI 29.3 Todd Gurley RB 20182 3.0 2018-09-23 LAR 0 LAC 19.6 Todd Gurley RB 20183 4.0 2018-09-27 LAR 0 MIN 21.6 Todd Gurley RB 20184 5.0 2018-10-07 LAR 1 SEA 29.3 Todd Gurley RB 2018

and since this will probably take a few minutes (depending on your maxp setting and your internet connection), I recommending saving the df to a CSV:

df.to_csv('fantasy2018.csv')

Plotting average vs. variance

So now you can easily grab one or more players’ fantasy point performance trajectories over the season and plot them if you so desire:

(df.query('Name == "Saquon Barkley"') .plot('Date', 'FantPt'))

Scraping Pro Football Reference with Python (2)

To get a feel for more of the data, let’s look at some summary statistics. My first thought is to reduce the trajectory of each player’s fantasy performance to a mean and a variance. The ideal player has a high mean/average point total, and doesn’t deviate too far away from it. A boom-or-bust player has, perhaps, a moderate to high mean but an extremely high variance.

Here’s the (brute force) code:

fig, ax = plt.subplots(1,4, sharey=True, figsize=(15,4))cols = ['k', 'c', 'g', 'b']for i, pos in enumerate(["QB", "RB", "WR", "TE"]): t = (df .query('Position == @pos') .groupby('Name') .agg({'FantPt': ['mean', 'std'], 'Position': 'min'})) ax[i].scatter(t[('FantPt','mean')], t[('FantPt', 'std')], c=cols[i], s=50, alpha=0.5, label=pos) ax[i].set(xlabel='Avg Pts', title=pos)# label some outliersax[0].annotate('P. Mahomes', xy=(26, 6.76), xytext=(16, 2), arrowprops={'facecolor':'black', 'width':0.1, 'shrink':0.08})ax[1].annotate('T. Gurley', xy=(22.36, 8.1), xytext=(15, 2), arrowprops={'facecolor':'black', 'width':0.1, 'shrink':0.08})ax[2].annotate('D. Adams', xy=(14.57, 4.2), xytext=(9, 2), arrowprops={'facecolor':'black', 'width':0.1, 'shrink':0.1})ax[3].annotate('T. Kelce', xy=(11.97, 7.5), xytext=(9, 2), arrowprops={'facecolor':'black', 'width':0.1, 'shrink':0.1}) ax[0].set(ylim=[1,13])plt.tight_layout()plt.show()

And the plot is the one at the beginning of the blog.

What’s next

It’s a bit silly to pull all this data just to compress it to two summary statistics, but this is a short post. It may be interesting to group similar players based on the dynamics of their entire time series, although this may not be very meaningful or predictive.

All that aside, hopefully this post gives you some basic tools to do simple web scraping in Python.

Written on August 13th, 2019 by Steven Morse

  • A Simple Handmade Jupyter Notebook to Markdown converter

  • Conway's Game of Life - Javascript

  • 2 - Peeking under the hood with a linear model

  • Measuring Roster Skill in Fantasy Football

  • Checking ESPN Fantasy Football Projections with Python

  • Fourier Transforms (with Python examples)

  • Bayesian Coin Flipping

Scraping Pro Football Reference with Python (2024)

FAQs

Can you scrape a pro football reference? ›

Licensee shall not “screen capture” or scrape any data points from the Licensor's website for publication elsewhere.

Is Python good for scraping? ›

Yes. There are many tools you can use for web scraping, including APIs and online services, but Python is one of the most efficient methods for many reasons. Using a Python library like Beautiful Soup to read and collect web data from HTML or XML is possible with just a few lines of code.

How hard is web scraping Python? ›

- Generally, it takes about one to six months to learn the fundamentals of Python, that means being able to work with variables, objects & data structures, flow control (conditions & loops), file I/O, functions, classes and basic web scraping tools such as requests ​​​​​ library.

What is the salary of web scraping engineer? ›

Web Scraper salaries in India

The estimated total pay for a Web Scraper is ₹6,99,319 per year, with an average salary of ₹6,00,000 per year. This number represents the median, which is the midpoint of the ranges from our proprietary Total Pay Estimate model and based on salaries collected from our users.

Does pro football reference have an API? ›

Pro Football Reference is a stat-head's dream — there is a wealth of football information, it is easily accessible directly on the site through built-in APIs, and it is cleanly formatted which makes data scraping a non-headache-inducing endeavor.

Does PFF have an API? ›

Currently, PFF FC data is available via our API for you to use at your convenience.

Is web scraping ever illegal? ›

While web scraping is not inherently illegal, how it is conducted and the data's subsequent use can raise legal and ethical concerns. Actions such as scraping copyrighted content and personal information without consent or engaging in activities that disrupt the normal functioning of a website may be deemed illegal.

What are the disadvantages of web scraping in Python? ›

Disadvantages of Using Python for Web Scraping

Using Python for web scraping can be a time-consuming process. Writing scripts for web scraping in Python can be a challenging task, necessitating the need to design and implement code that is able to access data from websites and store it properly.

How long will it take to learn web scraping with Python? ›

For beginners, learning web scraping can take anywhere from a few weeks to several months.

Is web scraping a valuable skill? ›

Web scraping is a technique for extracting data from websites using code or scripts. It can be a valuable skill for data collection, analysis, and automation.

Is web scraping a technical skill? ›

For web scraping one needs a good combination of technical skills and domain knowledge. It's great to have proficiency in Python and R. Especially R as it has good libraries and tools for scraping.

How much does a web scraper earn in USA? ›

Work From Home Web Data Scraping Salary
Annual SalaryMonthly Pay
Top Earners$96,000$8,000
75th Percentile$77,000$6,416
Average$60,172$5,014
25th Percentile$37,500$3,125

Can you use NFL footage? ›

Any use of NFL-controlled footage (whether acquired through NFL Films, from any other approved source, or captured by a third party with proper credentials) requires the express written consent of NFL Films (in the form of a contract) and must be used in compliance with all terms, requirements and restrictions stated ...

What happens if a player hits a referee? ›

Any Player, Coach, Team Official or Team Spectator committing or attempting to commit a referee ASSAULT is automatically suspended for one (1) year from the time of the assault. If serious injuries are inflicted then the MINIMUM suspension shall be for five (5) years.

Do NFL refs get reviewed? ›

Game officials are typically accurate on 98.9% of calls. The NFL's Officiating Department thoroughly evaluates officials each week. Officiating crews are rewarded for excellence and face consequences for not achieving expectations.

Top Articles
Latest Posts
Article information

Author: Aracelis Kilback

Last Updated:

Views: 5836

Rating: 4.3 / 5 (64 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.