This blog post explains how to extract data from Stanford’s Computer Science program website, specifically focusing on retrieving faculty information. We use Python libraries such asĀ httpx
Ā for making HTTP requests andĀ BeautifulSoup
Ā for parsing HTML content. The extracted data is saved in a JSON file for easy access and further analysis.
Ā
If you’d like, you can watch the video tutorial along with this blog.
The first step involves importing the necessary libraries:
import httpx
from bs4 import BeautifulSoup
import json
httpx:Ā This is a powerful HTTP client for making requests to web servers. We use it to fetch the HTML content of the Stanford Computer Science faculty page.
BeautifulSoup:Ā This library is used to parse HTML and extract specific elements from the webpage.
json:Ā This standard Python library is used to save the extracted data in JSON format.
Next, we define the URL of the webpage we want to scrape:
url = "https://www.cs.stanford.edu/people-cs/faculty-name"
We then make an HTTP GET request to this URL usingĀ httpx.get
:
response = httpx.get(url, timeout=httpx.Timeout(30.0))
TheĀ timeout
Ā parameter ensures that the request does not hang indefinitely. It specifies the maximum time allowed for the entire request, in this case, 30 seconds.
Once we have the HTML content from the response, we use BeautifulSoup to parse it:
soup = BeautifulSoup(response.text, 'html.parser')
Here, we pass the raw HTML content (response.text
) to BeautifulSoup and specify that we want to use the built-in HTML parser. This allows us to navigate and search the HTML structure easily.
We then start the data extraction process:
professors = []
for li in soup.find_all('li'):
name_tag = li.find('h3')
if name_tag:
name = name_tag.get_text(strip=True)
else:
continue
title_tag = li.find('div', class_='views-field views-field-su-person-short-title')
if title_tag:
title = title_tag.get_text(strip=True)
else:
title = "N/A"
link_tag = li.find('a', href=True)
if link_tag:
profile_link = "https://www.cs.stanford.edu" + link_tag['href']
else:
profile_link = "N/A"
professors.append({
'name': name,
'title': title,
'profile_link': profile_link
})
We initialize an empty list calledĀ professors
Ā to store the extracted information. TheĀ for
Ā loop iterates over eachĀ <li>
Ā (list item) element found on the page. For eachĀ <li>
Ā element:
<h3>
Ā tag, which contains the professor’s name. If found, we extract the text.<div>
Ā with the classĀ views-field views-field-su-person-short-title
Ā to retrieve the professor’s title. If it’s not found, we assign “N/A”.<a>
Ā tag with anĀ href
Ā attribute, which contains the link to the professor’s profile. If present, we prepend the base URL to create the full link.Finally, we append a dictionary containing the professor’s name, title, and profile link to theĀ professors
Ā list.
After extracting the data, we save it to a JSON file:
filename = 'professors.json'
with open(filename, 'w', encoding='utf-8') as file:
json.dump(professors, file, ensure_ascii=False, indent=4)
We specify the filenameĀ professors.json
Ā and open it in write mode. TheĀ json.dump
Ā function is used to write the list of professor dictionaries to the file. We setĀ ensure_ascii=False
Ā to properly handle any non-ASCII characters, andĀ indent=4
Ā to make the JSON file more readable.
Finally, the script prints a confirmation message indicating that the data has been successfully saved:
print(f"Data has been saved to {filename}")
In this blog post, we’ve walked through a Python script that scrapes data from Stanford’s Computer Science faculty page, processes it, and saves it in a structured format. This script demonstrates the basics of web scraping and data extraction, which can be applied to various other projects that require automated data collection from web pages.
you can find the detailed code here : https://github.com/itishosseinian/Extracting-Data-from-Stanford-s-Computer-Science-Program.git