Search

Extracting Data from Stanford's Computer Science Program šŸ§‘ā€šŸ’»

Introduction šŸŽ“

This blog post explains how to extract data from Stanford’s Computer Science program website, specifically focusing on retrieving faculty information. We use Python libraries such asĀ httpxĀ for making HTTP requests andĀ BeautifulSoupĀ for parsing HTML content. The extracted data is saved in a JSON file for easy access and further analysis.

Ā 

If you’d like, you can watch the video tutorial along with this blog.

Imports šŸ“„

The first step involves importing the necessary libraries:

				
					import httpx
from bs4 import BeautifulSoup
import json
				
			

httpx:Ā This is a powerful HTTP client for making requests to web servers. We use it to fetch the HTML content of the Stanford Computer Science faculty page.

BeautifulSoup:Ā This library is used to parse HTML and extract specific elements from the webpage.

json:Ā This standard Python library is used to save the extracted data in JSON format.

URL and HTTPX Request šŸŒ

Next, we define the URL of the webpage we want to scrape:

				
					url = "https://www.cs.stanford.edu/people-cs/faculty-name"

				
			

We then make an HTTP GET request to this URL usingĀ httpx.get:

				
					response = httpx.get(url, timeout=httpx.Timeout(30.0))

				
			

TheĀ timeoutĀ parameter ensures that the request does not hang indefinitely. It specifies the maximum time allowed for the entire request, in this case, 30 seconds.

Parsing HTML with BeautifulSoup šŸ› ļø

Once we have the HTML content from the response, we use BeautifulSoup to parse it:

				
					soup = BeautifulSoup(response.text, 'html.parser')

				
			

Here, we pass the raw HTML content (response.text) to BeautifulSoup and specify that we want to use the built-in HTML parser. This allows us to navigate and search the HTML structure easily.

Extracting Data šŸ“‹

We then start the data extraction process:

				
					professors = []

for li in soup.find_all('li'):
    name_tag = li.find('h3')
    if name_tag:
        name = name_tag.get_text(strip=True)
    else:
        continue  

    title_tag = li.find('div', class_='views-field views-field-su-person-short-title')
    if title_tag:
        title = title_tag.get_text(strip=True)
    else:
        title = "N/A"  

    link_tag = li.find('a', href=True)
    if link_tag:
        profile_link = "https://www.cs.stanford.edu" + link_tag['href']
    else:
        profile_link = "N/A"  

    professors.append({
        'name': name,
        'title': title,
        'profile_link': profile_link
    })
				
			

We initialize an empty list calledĀ professorsĀ to store the extracted information. TheĀ forĀ loop iterates over eachĀ <li>Ā (list item) element found on the page. For eachĀ <li>Ā element:

  • We attempt to find anĀ <h3>Ā tag, which contains the professor’s name. If found, we extract the text.
  • Next, we look for aĀ <div>Ā with the classĀ views-field views-field-su-person-short-titleĀ to retrieve the professor’s title. If it’s not found, we assign “N/A”.
  • We also search for anĀ <a>Ā tag with anĀ hrefĀ attribute, which contains the link to the professor’s profile. If present, we prepend the base URL to create the full link.

Finally, we append a dictionary containing the professor’s name, title, and profile link to theĀ professorsĀ list.

Saving Data to JSON šŸ’¾

After extracting the data, we save it to a JSON file:

				
					filename = 'professors.json'

with open(filename, 'w', encoding='utf-8') as file:
    json.dump(professors, file, ensure_ascii=False, indent=4)
				
			

We specify the filenameĀ professors.jsonĀ and open it in write mode. TheĀ json.dumpĀ function is used to write the list of professor dictionaries to the file. We setĀ ensure_ascii=FalseĀ to properly handle any non-ASCII characters, andĀ indent=4Ā to make the JSON file more readable.

Finally, the script prints a confirmation message indicating that the data has been successfully saved:

				
					print(f"Data has been saved to {filename}")

				
			

Conclusion šŸ“

In this blog post, we’ve walked through a Python script that scrapes data from Stanford’s Computer Science faculty page, processes it, and saves it in a structured format. This script demonstrates the basics of web scraping and data extraction, which can be applied to various other projects that require automated data collection from web pages.

you can find the detailed code here : https://github.com/itishosseinian/Extracting-Data-from-Stanford-s-Computer-Science-Program.git

Leave a Reply

Your email address will not be published. Required fields are marked *