Scraping Government Search Registers (with BeautifulSoup)

September 30, 2020 7-minute read

Sadly no wacky data analysis projects to report this month (it’s been busy), so instead I’m going to go through a little web scraping project I did recently for my workplace. (In Python this time, largely for the purposes of my own learning and development).

The mission: From this search register, generate a list of schools plus the schools’ website domain name.

Step 1: Scrape the school website domain name for a single school

(e.g. this school)

First, grab the html of the entire page:

url = "https://get-information-schools.service.gov.uk/Establishments/Establishment/Details/137377?searchQueryString=tok%3D8TjW1JHQ"

response = requests.get(url)
html = BeautifulSoup(response.content, 'html.parser')

Now, we need a way to pull out the school website from the html without extracting all the other elements on the page.

An inspection of the html on the page reveals that the web domain is located in a division with an id = “details-summary”. This division contains a list of terms, tagged with the <dt> tag. Each list term is proceeded by a value, tagged with the <dd> tag.

For instance, the school’s website has a <dt> element that says ‘Website:’ and a <dd> element that has the school website url in it, as an ‘href’ attribute:

Now, I could extract every element with a <dd> tag, but then I’d end up with much more than I want, and I’d still need to come up with a reliable way to identify the element corresponding to the school website.

However, we can identify the correct element by using the corresponding <dt> element (since we know that the school website is always preceded by a <dt> element whose content will say ‘Website:’.

This means that we can grab the school website with the following method:

Find all the hmtl located in the <div> element with an id = “details-summary”
Find all the <dt> elements in this div
If the contents of the <dt> element says “Website:”, find the subsequent <dd> element, using BeautifulSoup’s find_next_sibling() function
Get the href for this element.

html_summary = html.find(id = "details-summary")

cols = html_summary.find_all("dt")

for col in cols:
    if col.get_text() == "Website:":
        tag = col.find_next_sibling("dd")
        website = tag.find("a").get('href')

Finally, since we know we’re going to want to do this for all of the schools in the search register, let’s convert these steps into some functions to use later:

def get_school_details(url):

    response = requests.get(url)
    html = BeautifulSoup(response.content, 'html.parser')
    html_summary = html.find(id = "details-summary")

    return html_summary

def find_website(html):

    if not html:
        return None

    cols = html.find_all("dt")

    for col in cols:
        if col.get_text() == "Website:":
            tag = col.find_next_sibling("dd")
            
            try: 
                website = tag.find("a").get('href')
            except AttributeError:
                website = None 
            return website

Step 2: Scrape the first main search page to get a list of links to each individual school page

So, we’ve scraped the school website domain off one page. But, we need to do that for all ~3700 schools, meaning that we need the url of the pages on the government website for each of these schools.

Now, the main search page is actually split into 75 pages, and there are only 50 schools on the first page of the search. For now, let’s just focus on getting the links to the pages for the first 50 schools.

We start by inspecting the html of the main search page, to see where the school page links are stored.

This time, our job is a bit easier, as it turns out that the links to the school pages are stored in <a> elements with a class = “bold-small”. The school name is located in the content of the element, and the school page link in the ‘href’.

This means that we can directly grab the href and contents for each of the elements in a for loop. I’m choosing to store the school name and page for each school in a separate dictionary. I then append each school dictionary into a list.


resp = requests.get(index_page)
html = BeautifulSoup(resp.content, 'html.parser')
    
school_links = html.find_all('a', class_="bold-small")

output = []

for school in school_links:
    details = {}

    name = school.contents[0]
    page_link = school.get('href')

    details["name"] = name
    details["link"] = page_link

    output.append(details)
        output.append(details)

Finally, again, since we’re going to have to eventually run this on 75 pages worth of search, let’s turn the steps into a function:


def get_school_pages(index_page):

    resp = requests.get(index_page)
    html = BeautifulSoup(resp.content, 'html.parser')
    school_links = html.find_all('a', class_="bold-small")

    output = []

    for school in school_links:
        details = {}

        name = school.contents[0]
        page_link = school.get('href')

        details["name"] = name
        details["link"] = f"{ROOT}{page_link}"

        output.append(details)
    
    return output

Step 3: Get a list of URLs for all 75 search pages

We now have:

a function to get the links to individual school pages from a main search page
some functions to get the school website from the individual school page

But, we don’t just want to do this for one search page, we want to be able to run our functions on all 75 search pages. This means that instead of just one url corresponding to the first search page, we need the urls of all 75 search pages.

Examining the urls of the search pages gives us clues as to how we could do this:

The only difference between each of these page urls is the startIndex, which increments by 50.

So, no scraping required here - from a template url in the form ‘https://get-information-schools.service.gov.uk/Establishments/Search?tok=8TjW1JHQ&startIndex={{something}}&Count=3700’ we can generate 75 urls with a startIndex beginning at 0 and incrementing by 50 until we reach the last page.

ROOT = "https://get-information-schools.service.gov.uk"
TEMPLATE = f"{ROOT}/Establishments/Search?tok=8TjW1JHQ&startIndex={{startIndex}}&Count=3702"

# get urls for all 75 search pages
def get_index_pages():
    return [TEMPLATE.format(startIndex=startIndex) for startIndex in range(0, 3750, 50)]

pages = get_index_pages()

Step 5: Scrape all 75 search pages to get a list of links to all of the government school pages

We now have a list of search page urls. Next, we need to scrape each page to get the links to all ~3700 school pages. Thankfully, we’ve already written a function to help us to this, which we just need to run on each search page in a for loop.

schools = []
for index_page in pages:
    schools = schools + get_school_pages(index_page)
    print(f"trying {index_page}")
    time.sleep(1)

(The time.sleep() function here tells python to wait 1 second between requests so as not to overload the web server.)

With this for loop, we now have a list containing the links to all ~3700 school pages. This leaves us with one final step…

Step 5: Putting it all together

Our last step is to scrape the school website domain off each individual school page. Again, we’ve already written functions to help us do this, so it’s simply a case of running the functions on all of the school pages.


for school in schools:
    print(f"trying school: {school['name']}")
    url = school["link"]
    
    details = get_school_details(url)
    website = find_website(details)

    school["website"] = website
    
    time.sleep(1)

And with that, we’re done! We now have a list containing, for each school: * the school’s name * a link to the government page describing the school * the website domain of the school

Full code for the web scraping below:


from bs4 import BeautifulSoup 
import requests 
import pandas as pd
import time

ROOT = "https://get-information-schools.service.gov.uk"
TEMPLATE = f"{ROOT}/Establishments/Search?tok=8TjW1JHQ&startIndex={{startIndex}}&Count=3702"

# get urls for all 75 search pages
def get_index_pages():
    return [TEMPLATE.format(startIndex=startIndex) for startIndex in range(0, 3750, 50)]

pages = get_index_pages()


# retrieve url of each school page from search pages
def get_school_pages(index_page):

    resp = requests.get(index_page)
    html = BeautifulSoup(resp.content, 'html.parser')
    school_links = html.find_all('a', class_="bold-small")

    output = []

    for school in school_links:
        details = {}

        name = school.contents[0]
        page_link = school.get('href')

        details["name"] = name
        details["link"] = f"{ROOT}{page_link}"

        output.append(details)
    
    return output

schools = []
for index_page in pages:
    schools = schools + get_school_pages(index_page)
    print(f"trying {index_page}")
    time.sleep(1)
    

# get school's website domain
def get_school_details(url):

    response = requests.get(url)
    html = BeautifulSoup(response.content, 'html.parser')
    html_summary = html.find(id = "details-summary")

    return html_summary

def find_website(html):

    if not html:
        return None

    cols = html.find_all("dt")

    for col in cols:
        if col.get_text() == "Website:":
            tag = col.find_next_sibling("dd")
            # TODO: Change me
            try: 
                website = tag.find("a").get('href')
            except AttributeError:
                website = None 
            return website

for school in schools:
    print(f"trying school: {school['name']}")
    url = school["link"]
    
    details = get_school_details(url)
    website = find_website(details)

    school["website"] = website
    
    time.sleep(1)


# save as csv
schools_df = pd.DataFrame(schools)

schools_df.to_csv("school-page-links.csv")