Blog 3 -- Web scraping with scrapy
In this post, we’ll be making a web scraper - a tool that extracts data from webpages. Suppose we want to find out which movies share the most number of actors with our favorite movie, say, The Shawshank Redemption. A good place to find this information is IMDB, which has
- movie pages with a link to its credits page,
- credits pages containing the cast list, and
- actor pages that list their filmography.
How would we go about finding which movies share the largest number of actors with Shawshank? We would:
- Start from IMDB’s movie page for The Shawshank Redemption: https://www.imdb.com/title/tt0111161/
- For each actor in its cast list, go to the actors page and collect all the titles in their filmography.
- See which other movies appear the most frequently amongst the collected titles.
The scrapy
spider
described below will automate steps 1 and 2. The set-up is a bit different this time - instead of a notebook we’re writing this in a .py
script and running it in the terminal.
The first step is to open a repository in the location you want - mine is on GitHub at https://github.com/zhijianli9999/pic16b-scrape. Then we initialize this repository with the code below into the terminal.
conda activate PIC16B
scrapy startproject IMDB_scraper
cd IMDB_scraper
One of the folders created will be called spiders
. Under this folder, I created the python script imdb_spider.py
, in which I program my scraper. In a nutshell, the scraper is a class (mine I called ImdbSpider
) that inherits the scrapy.Spider
class, plus a few functions that defines the behaviors at it crawls the internet. In this project, these functions follow links to the actor pages and extract the titles in their filmographies.
First, we import the scrapy
package.
import scrapy
class ImdbSpider(scrapy.Spider):
name = 'imdb_spider'
# starting show: The Shawshank Redemption
start_urls = ['https://www.imdb.com/title/tt0111161/']
In the class we define class objects name
and start_urls
. These are important. The name
is how you call this thing from the terminal and the start_urls
tells the program where on the internet to start (it can be a list of multiple elements).
def parse(self, response):
"""
from a movie page, navigate to the Cast & Crew page
then call parse_full_credits(self,response) on the credits
"""
credit_url = response.url + "fullcredits"
yield scrapy.Request(credit_url, callback=self.parse_full_credits)
The first function in the class is parse
. This is called once, and all it does is call parse_full_credits
on the credits page. response
is a super important parameter and we’ll see it again - it holds many attributes (like url
here) and methods (like CSS selectors we’ll use later). yield scrapy.Request()
is also a common pattern. Here, we call the parse_full_credits
function using the URL for the credits page, which you know because you’re on IMDB.com the entire time you’re writing these functions.
def parse_full_credits(self, response):
"""
from Cast & Crew page, yield a scrapy.Request for the
page of each actor with parse_actor_page(self, response)
"""
actor_list = ["https://imdb.com" + a.attrib["href"]
for a in response.css("td.primary_photo a")]
for actor_url in actor_list:
yield scrapy.Request(actor_url, callback=self.parse_actor_page)
Once we’re on the credits page, parse_full_credits
figures out where the actors are, and calls parse_actor_page
on each link to an actor page. To figure out where things are on a webpage, you go to the webpage, then to the Developer Tools (Cmd+Opt+i on Mac & Chrome), and look for the CSS code for the target webpage element. It takes a bit of sleuthing and reading scrapy docs but eventually I figure out what to supply to response.css
. After I get the URL of the actor pages, I call parse_actor_page
on each actor page.
def parse_actor_page(self, response):
"""
for each movie/show on the actor page, return a dictionary
of the form {"actor" : actor_name, "movie_or_TV_name" :
movie_or_TV_name}.
"""
# select name of actor
n = response.css("div.article.name-overview span.itemprop::text").get()
# all_films includes all films credited as actor or other roles
all_films = response.css("div.filmo-row")
# filter only those credited as actor
films = [f.css('b a::text').get()
for f in all_films if f.attrib['id'].split('-')[0] == 'actor']
for film in films:
yield {"actor": n, "movie_or_TV_name": film}
# scrapy crawl imdb_spider -o results.csv
On each actor page (e.g. Tim Robbins), parse_actor_page
is called. I first select the name of the actor again with the CSS selector. Then I look at the filmography section, only to find out that actors’ writing credits are listed with a similar CSS tag to their acting credits. Since acting credits are what we want, I use the fact that the id
attribute of the film’s entry starts with actor
if and only if it’s an acting credit. This time, I yield
a dictionary in a format that factilitates output to a .csv
file.
Finally, I type scrapy crawl imdb_spider -o results.csv
into my terminal to get a nice CSV file with an entry for every time a movie shared an actor with Shawshank. Let’s take a look at this data by tabulating the movies.
import pandas as pd
import numpy as np
results = pd.read_csv('https://raw.githubusercontent.com/zhijianli9999/pic16b-scrape/main/IMDB_scraper/results.csv')
ranking = results.groupby("movie_or_TV_name").size().reset_index(name='counts')
ranking = ranking.sort_values(by='counts',ascending=False).reset_index(drop=True)
ranking[0:10]
movie_or_TV_name | counts | |
---|---|---|
0 | The Shawshank Redemption | 65 |
1 | ER | 11 |
2 | Law & Order | 10 |
3 | CSI: Crime Scene Investigation | 10 |
4 | The West Wing | 9 |
5 | NYPD Blue | 9 |
6 | Cold Case | 9 |
7 | The Practice | 9 |
8 | The Twilight Zone | 8 |
9 | L.A. Law | 8 |
Of course, the movie that shares the most actors with Shawshank is Shawshank. But besides that, we’ve mostly got long-running TV series that just mechanically have long cast lists.