Scraping Course Schedule with Beautifulsoup

BY Yichi

Tuesday January 15, 2019 # 0 Comments # Category: code # Tags: code, python, scraper

Class registration is no easy task. Getting all the courses one likes often takes a lot of researching and planning. To complicate the matter, some of the sections do not have instructor listed. Today (well, actually a month ago) I am going to deal with this problem.

The other day, I heard two friends of mine discussing course registration. One of the courses they had to register for, unfortunately, did not list instructor under sections. They tried to figure out who teaches witch section by exclusion: if prof. A teaches section x at 1 pm on Friday, he cannot possibly teach section y at that same time. This would at least give us more information to make our decisions, but it was quite labor intensive for my taste, so I decided to automate it.

What I did was to write a scraper to pull data from my university’s course catalog. The scraper itself is quite simple. Here is a short list of what it does:

open course catalog
find links to all program pages (e.g. Computer Science, Electrical Engineering, etc)
loop through program pages, go to one at a time
loop through all sections on that page
grab the list values for one section using beautifulsoup, print out these values to stdout in csv format

Since all the list items containing links to program pages have data-type=”department” attribute, step 2 is easily done. I spent some time trying to figure out how to select sections and the section id, type, time, instructor, etc for each section, which involved some CSS selector magic (which is not difficult at all).

One thing to note is that I used selenium to navigate the web pages, which I thought was necessary since items in the sections table are folded by default, and to expand an item one has to click on the item title. It turned out all data I want were visible regardless of items being folded or not, but since efficiency was not a priority in this project, I left it that way. If you want to replicate this project, you can easily replace selenium with request library, which would run a bit faster.

python -u scraper.py > path-to-file

To write to a file, use IO redirection. The -u flag tells python to not cache stdout outputs so anything printed to stdout gets displayed on console immediately. Once you save the list as a csv file, there is a wide selection of software to process the data.

It turned out that getting all this information is not as useful as expected. Apparently many professors do not have a busy course schedule, so there were not many conflicts to be found that would let us infer who teach which section. Nevertheless, now we have the full list of sections, there’s more fun stuff to do. How about an auto scheduler that makes course schedule for you given simple inputs? Well, that’s for another day…

The python script can also be found on my Github repo.

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import time
from bs4 import BeautifulSoup
from selenium import webdriver

opts = webdriver.ChromeOptions()
opts.add_argument('headless')
opts.add_argument('no-sandbox')
browser = webdriver.Chrome(options=opts)
browser.get('https://classes.usc.edu/term-20191/#')
# browser.find_element_by_css_selector('a[data-sort-by="data-title"]').click()
soup = BeautifulSoup(browser.page_source, "html.parser")
programs = soup.select('li[data-type="department"] a')
for program in programs:
    time.sleep(30)
    browser.get(program['href'])
    soup = BeautifulSoup(browser.page_source, "html.parser")
    sessions = soup.select('tr[data-section-id]')
    for session in sessions:
        current = {}
        current['section'] = session.find(class_='section').get_text()
        current['type'] = session.find(class_='type').get_text()
        current['time'] = session.find(class_='time').get_text()
        current['days'] = session.find(class_='days').get_text()
        current['instructor'] = session.find(class_='instructor').get_text()
        print('"'+current['section']+'", "'+current['type']+'", "' +
              current['time']+'", "'+current['days']+'", "'+current['instructor']+'"')

　Text published under CC BY-SA 4.0

Recent Posts

Archives

Leave a Reply Cancel reply