Codemate TV

Extracting Data from Stanford's Computer Science Program 🧑‍💻

Introduction 🎓

This blog post explains how to extract data from Stanford’s Computer Science program website, specifically focusing on retrieving faculty information. We use Python libraries such as httpx for making HTTP requests and BeautifulSoup for parsing HTML content. The extracted data is saved in a JSON file for easy access and further analysis.

If you’d like, you can watch the video tutorial along with this blog.

Imports 📥

The first step involves importing the necessary libraries:

				
					import httpx
from bs4 import BeautifulSoup
import json

httpx: This is a powerful HTTP client for making requests to web servers. We use it to fetch the HTML content of the Stanford Computer Science faculty page.

BeautifulSoup: This library is used to parse HTML and extract specific elements from the webpage.

json: This standard Python library is used to save the extracted data in JSON format.

URL and HTTPX Request 🌐

Next, we define the URL of the webpage we want to scrape:

				
					url = "https://www.cs.stanford.edu/people-cs/faculty-name"

We then make an HTTP GET request to this URL using httpx.get:

				
					response = httpx.get(url, timeout=httpx.Timeout(30.0))

The timeout parameter ensures that the request does not hang indefinitely. It specifies the maximum time allowed for the entire request, in this case, 30 seconds.

Parsing HTML with BeautifulSoup 🛠️

Once we have the HTML content from the response, we use BeautifulSoup to parse it:

				
					soup = BeautifulSoup(response.text, 'html.parser')

Here, we pass the raw HTML content (response.text) to BeautifulSoup and specify that we want to use the built-in HTML parser. This allows us to navigate and search the HTML structure easily.

Extracting Data 📋

We then start the data extraction process:

				
					professors = []

for li in soup.find_all('li'):
    name_tag = li.find('h3')
    if name_tag:
        name = name_tag.get_text(strip=True)
    else:
        continue  

    title_tag = li.find('div', class_='views-field views-field-su-person-short-title')
    if title_tag:
        title = title_tag.get_text(strip=True)
    else:
        title = "N/A"  

    link_tag = li.find('a', href=True)
    if link_tag:
        profile_link = "https://www.cs.stanford.edu" + link_tag['href']
    else:
        profile_link = "N/A"  

    professors.append({
        'name': name,
        'title': title,
        'profile_link': profile_link
    })

We initialize an empty list called professors to store the extracted information. The for loop iterates over each <li> (list item) element found on the page. For each <li> element:

We attempt to find an <h3> tag, which contains the professor’s name. If found, we extract the text.
Next, we look for a <div> with the class views-field views-field-su-person-short-title to retrieve the professor’s title. If it’s not found, we assign “N/A”.
We also search for an <a> tag with an href attribute, which contains the link to the professor’s profile. If present, we prepend the base URL to create the full link.

Finally, we append a dictionary containing the professor’s name, title, and profile link to the professors list.

Saving Data to JSON 💾

After extracting the data, we save it to a JSON file:

				
					filename = 'professors.json'

with open(filename, 'w', encoding='utf-8') as file:
    json.dump(professors, file, ensure_ascii=False, indent=4)

We specify the filename professors.json and open it in write mode. The json.dump function is used to write the list of professor dictionaries to the file. We set ensure_ascii=False to properly handle any non-ASCII characters, and indent=4 to make the JSON file more readable.

Finally, the script prints a confirmation message indicating that the data has been successfully saved:

				
					print(f"Data has been saved to {filename}")

Conclusion 📝

In this blog post, we’ve walked through a Python script that scrapes data from Stanford’s Computer Science faculty page, processes it, and saves it in a structured format. This script demonstrates the basics of web scraping and data extraction, which can be applied to various other projects that require automated data collection from web pages.

you can find the detailed code here : https://github.com/itishosseinian/Extracting-Data-from-Stanford-s-Computer-Science-Program.git

Codemate TV

Extracting Data from Stanford's Computer Science Program 🧑‍💻

Introduction 🎓

Imports 📥

URL and HTTPX Request 🌐

Parsing HTML with BeautifulSoup 🛠️

Extracting Data 📋

Saving Data to JSON 💾

Conclusion 📝

Table of Contents

How to Scrape TikTok Comments With Python Requests

Scraping Facebook Profiles with Python

How to scrape an iranian ecommerce website, Digikala

Extracting Data from Stanford’s CS Program: What You Need to Know

Leave a Reply Cancel reply

Pages

Codemate TV

All rights reserved