Blog 3 -- Web scraping with scrapy

In this post, we’ll be making a web scraper - a tool that extracts data from webpages. Suppose we want to find out which movies share the most number of actors with our favorite movie, say, The Shawshank Redemption. A good place to find this information is IMDB, which has

  1. movie pages with a link to its credits page,
  2. credits pages containing the cast list, and
  3. actor pages that list their filmography.

How would we go about finding which movies share the largest number of actors with Shawshank? We would:

  1. Start from IMDB’s movie page for The Shawshank Redemption: https://www.imdb.com/title/tt0111161/
  2. For each actor in its cast list, go to the actors page and collect all the titles in their filmography.
  3. See which other movies appear the most frequently amongst the collected titles.

The scrapy spider described below will automate steps 1 and 2. The set-up is a bit different this time - instead of a notebook we’re writing this in a .py script and running it in the terminal.

The first step is to open a repository in the location you want - mine is on GitHub at https://github.com/zhijianli9999/pic16b-scrape. Then we initialize this repository with the code below into the terminal.

conda activate PIC16B
scrapy startproject IMDB_scraper
cd IMDB_scraper

One of the folders created will be called spiders. Under this folder, I created the python script imdb_spider.py , in which I program my scraper. In a nutshell, the scraper is a class (mine I called ImdbSpider) that inherits the scrapy.Spider class, plus a few functions that defines the behaviors at it crawls the internet. In this project, these functions follow links to the actor pages and extract the titles in their filmographies.

First, we import the scrapy package.

import scrapy
class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    # starting show: The Shawshank Redemption
    start_urls = ['https://www.imdb.com/title/tt0111161/']

In the class we define class objects name and start_urls. These are important. The name is how you call this thing from the terminal and the start_urls tells the program where on the internet to start (it can be a list of multiple elements).

    def parse(self, response):
        """
        from a movie page, navigate to the Cast & Crew page
        then call parse_full_credits(self,response) on the credits
        """
        credit_url = response.url + "fullcredits"
        yield scrapy.Request(credit_url, callback=self.parse_full_credits)

The first function in the class is parse. This is called once, and all it does is call parse_full_credits on the credits page. response is a super important parameter and we’ll see it again - it holds many attributes (like url here) and methods (like CSS selectors we’ll use later). yield scrapy.Request() is also a common pattern. Here, we call the parse_full_credits function using the URL for the credits page, which you know because you’re on IMDB.com the entire time you’re writing these functions.

    def parse_full_credits(self, response):
        """
        from Cast & Crew page, yield a scrapy.Request for the
        page of each actor with parse_actor_page(self, response)
        """
        actor_list = ["https://imdb.com" + a.attrib["href"]
                      for a in response.css("td.primary_photo a")]
        for actor_url in actor_list:
            yield scrapy.Request(actor_url, callback=self.parse_actor_page)

Once we’re on the credits page, parse_full_credits figures out where the actors are, and calls parse_actor_page on each link to an actor page. To figure out where things are on a webpage, you go to the webpage, then to the Developer Tools (Cmd+Opt+i on Mac & Chrome), and look for the CSS code for the target webpage element. It takes a bit of sleuthing and reading scrapy docs but eventually I figure out what to supply to response.css. After I get the URL of the actor pages, I call parse_actor_page on each actor page.

    def parse_actor_page(self, response):
        """
        for each movie/show on the actor page, return a dictionary
        of the form {"actor" : actor_name, "movie_or_TV_name" :
        movie_or_TV_name}.
        """
        # select name of actor
        n = response.css("div.article.name-overview span.itemprop::text").get()
        # all_films includes all films credited as actor or other roles
        all_films = response.css("div.filmo-row")
        # filter only those credited as actor
        films = [f.css('b a::text').get()
                 for f in all_films if f.attrib['id'].split('-')[0] == 'actor']
        for film in films:
            yield {"actor": n, "movie_or_TV_name": film}

# scrapy crawl imdb_spider -o results.csv

On each actor page (e.g. Tim Robbins), parse_actor_page is called. I first select the name of the actor again with the CSS selector. Then I look at the filmography section, only to find out that actors’ writing credits are listed with a similar CSS tag to their acting credits. Since acting credits are what we want, I use the fact that the id attribute of the film’s entry starts with actor if and only if it’s an acting credit. This time, I yield a dictionary in a format that factilitates output to a .csv file.

Finally, I type scrapy crawl imdb_spider -o results.csv into my terminal to get a nice CSV file with an entry for every time a movie shared an actor with Shawshank. Let’s take a look at this data by tabulating the movies.

import pandas as pd
import numpy as np
results = pd.read_csv('https://raw.githubusercontent.com/zhijianli9999/pic16b-scrape/main/IMDB_scraper/results.csv')
ranking = results.groupby("movie_or_TV_name").size().reset_index(name='counts')
ranking = ranking.sort_values(by='counts',ascending=False).reset_index(drop=True)
ranking[0:10]
movie_or_TV_name counts
0 The Shawshank Redemption 65
1 ER 11
2 Law & Order 10
3 CSI: Crime Scene Investigation 10
4 The West Wing 9
5 NYPD Blue 9
6 Cold Case 9
7 The Practice 9
8 The Twilight Zone 8
9 L.A. Law 8

Of course, the movie that shares the most actors with Shawshank is Shawshank. But besides that, we’ve mostly got long-running TV series that just mechanically have long cast lists.

Written on February 4, 2022