Search

Scraping Facebook Profiles with Python & Playwright 📊

Introduction 🎓

Web scraping is an incredibly useful technique for extracting data from websites, and in this article, we’re diving into how to scrape Facebook profile data. Using Python, we’ll leverage two powerful libraries: Playwright and BeautifulSoup.

  • Playwright: A cutting-edge browser automation library that makes it easy to navigate web pages and interact with dynamic content. It’s perfect for scraping pages like Facebook, which use JavaScript to load elements.

  • BeautifulSoup: One of Python’s most popular libraries for parsing HTML. It allows us to extract key pieces of information from the web page structure with ease.

In this tutorial, we’ll scrape several Facebook profiles to collect details such as cover images, likes, followers, and profile photos. By the end, you’ll have a script that can automate this process across multiple profiles and output the data in a structured format like JSON.

Why Scrape Facebook Profiles? 🤔

Facebook profiles often contain valuable information for marketing, research, or analysis. With Playwright and BeautifulSoup, you can easily extract these details, whether you need to analyze social media trends, track competitor activity, or simply automate data collection from Facebook profiles.

⚠️ Note: While scraping can be a powerful tool, it’s important to respect the website’s terms of service and avoid violating any legal restrictions. Always check the terms before scraping.

Let’s jump in and explore how to automate Facebook scraping in Python! 🚀

If you’d like, you can watch the video tutorial along with this blog.

🛠️ Step 1: Importing Libraries

				
					from playwright.sync_api import sync_playwright
import time
from bs4 import BeautifulSoup
import json, re

				
			
  • 💡 Explanation:

    First things first, we need to import the necessary Python libraries:

    • Playwright: A powerful library to automate web browsers like Chromium. This is great for scraping because many modern sites (like Facebook) are dynamic and load elements using JavaScript. Playwright helps us load these pages just like a real browser! 🌐
    • time: We’ll use this for adding small pauses in our script to ensure pages load fully before we start scraping data.
    • BeautifulSoup: This is our HTML parser. Once the page is loaded, BeautifulSoup will help us easily navigate the HTML structure and find the specific data we need. 🥣
    • json: We’ll use this to store the scraped data in JSON format, which is ideal for structured storage and later use. 💾
    • re: This module lets us use regular expressions (patterns) to find specific text within the HTML. It’s useful for finding numbers like “followers” or “likes” that follow certain word patterns. 🔍

🌐 Step 2: Defining the URLs to Scrape

				
					urls = [
    "https://www.facebook.com/adidas/",
    "https://www.facebook.com/Cristiano",
    "https://www.facebook.com/nasaearth"
]

				
			
  • 💡 Explanation:

    Here, we’re defining a list of URLs that we want to scrape. In this case, we are scraping the Facebook profiles for Adidas, Cristiano Ronaldo, and NASA Earth. 🚀 This list can be expanded or modified to scrape other profiles as well. We’re storing them in a Python list so that we can loop through them later in our script.

🚀 Step 3: Launching the Playwright Browser

We’re launching a headless browser using Playwright, which means it runs in the background without opening a visible window. Let’s break this down:

  • sync_playwright(): This starts Playwright and makes sure that once the scraping is done, everything is cleaned up properly.
  • p.chromium.launch(headless=True): This launches a Chromium browser in headless mode (invisible). If you set headless=False, the browser would pop up visibly while scraping! 👀
  • context = browser.new_context(): This creates a new browser context, like opening a new private tab in your browser. Each context is isolated, so cookies, cache, and other session details don’t carry over between different tabs. 🔒
				
					with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

				
			

🔄 Step 4: Looping Through Each URL

Now we’re looping through each URL in our list to scrape data from each profile:

  • page = context.new_page(): Opens a new page (or tab) in our browser.
  • print(f”Going to page: {url}”): This prints a message so we know which URL is being scraped.
  • page.goto(url): Navigates to the given Facebook profile page.
  • time.sleep(3): We wait for 3 seconds to allow the page to fully load. Without this, some elements might not load in time, causing our scraping to fail! 🕒
				
					for url in urls:
    page = context.new_page()
    print(f"Going to page: {url}")
    page.goto(url)
    time.sleep(3)

				
			

🍜 Step 5: Parsing the Page with BeautifulSoup

Once the page is loaded, we pass its HTML content into BeautifulSoup for parsing. The html.parser is a built-in parser that works well with BeautifulSoup. This allows us to navigate and search through the HTML tree to find the data we need. 🕸️

				
					soup = BeautifulSoup(page.content(), 'html.parser')

				
			

📸 Step 6: Scraping the Profile Data

				
					cover_image = soup.find('img', {'data-imgperflogname': 'profileCoverPhoto'})['src']
logo_image = soup.select_one('g image')['xlink:href']

				
			

Here, we’re extracting two key images from the Facebook profile:

  • cover_image: This finds the profile’s cover image by looking for the HTML tag <img> that has the attribute data-imgperflogname="profileCoverPhoto". The src attribute contains the image URL. 📷
  • logo_image: This grabs the logo image (usually the profile picture) using an SVG selector. The xlink:href gives us the link to the image. 🎨

Next, we extract the numbers of likes, followers, and following:

				
					num_like_tag = soup.find('a', string=re.compile(r' likes$'))
num_like = num_like_tag.get_text() if num_like_tag else "none"

num_follower_tag = soup.find('a', string=re.compile(r' followers$'))
num_follower = num_follower_tag.get_text() if num_follower_tag else "none"

num_following_tag  = soup.find('a', string=re.compile(r' following$'))
num_following = num_following_tag.get_text() if num_following_tag else "none"

				
			
  • num_like: Finds the text containing the number of likes by searching for a link (<a>) that ends with the word “likes”. If it exists, we extract the text, otherwise we return “none”.
  • num_follower: Similarly, we find the number of followers by searching for the link that contains “followers”.
  • num_following: This extracts the number of people the profile is following by searching for “following”.

🖼️ Step 7: Scraping Additional Profile Photos

This part collects additional images posted on the profile:

  • photo: Finds the relevant HTML container for photos by its class name ('x1yztbdb').
  • photos: We extract the URLs of the photos from the <img> tags inside that container. 📸
				
					photo = soup.find_all('div', class_='x1yztbdb')[1]
photos = [img['src'] for img in photo.select('img')]

				
			

💾 Step 8: Storing the Data

Now, we gather all the scraped information into a dictionary called data:

  • We store the URL, images, numbers of likes, followers, following, and photos in this dictionary.
  • all_data.append(data): We then append this dictionary to the all_data list, which will store the data for all the profiles we scrape.
				
					data = {
    'url': url,
    'cover_image': cover_image,
    'logo_image': logo_image,
    'num_like': num_like,
    'num_follower': num_follower,
    'num_following': num_following,
    'detail': detail,
    'post_photos': photos
}
all_data.append(data)

				
			

Step 9: Saving the Data to a JSON File

After scraping all the profiles, we save the data into a JSON file:

  • json.dump(all_data, f, ensure_ascii=False, indent=4): This writes the all_data list to a JSON file called out_all_urls.json. We ensure that non-ASCII characters (like emoji or special symbols) are properly encoded, and the indent=4 option makes the file human-readable. 🗂️
				
					with open('out_all_urls.json', 'w', encoding='utf-8') as f:
    json.dump(all_data, f, ensure_ascii=False, indent=4)

				
			

✅ Step 10: Wrapping Up

				
					print("Scraping completed for all URLs.")
browser.close()

				
			

We print a message indicating that the scraping process is complete and then close the browser:

  • print(): Outputs a success message so you know the process is done.
  • browser.close(): This closes the Chromium browser and frees up system resources. It’s always a good practice to close your browser after automation is complete. 🖥️

Complete Code

Here’s the full code for you to copy and use:

				
					from playwright.sync_api import sync_playwright
import time
from bs4 import BeautifulSoup
import json, re

urls = [
    "https://www.facebook.com/adidas/",
    "https://www.facebook.com/Cristiano",
    "https://www.facebook.com/nasaearth"
]

all_data = []

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

    for url in urls:
        page = context.new_page()
        print(f"Going to page: {url}")
        page.goto(url)
        time.sleep(3)  

        soup = BeautifulSoup(page.content(), 'html.parser')

        cover_image = soup.find('img', {'data-imgperflogname': 'profileCoverPhoto'})['src'] 
        logo_image = soup.select_one('g image')['xlink:href'] 

        num_like_tag = soup.find('a', string=re.compile(r' likes$'))
        num_like = num_like_tag.get_text() if num_like_tag else "none"

        num_follower_tag = soup.find('a', string=re.compile(r' followers$'))
        num_follower = num_follower_tag.get_text() if num_follower_tag else "none"

        num_following_tag  = soup.find('a', string=re.compile(r' following$'))
        num_following = num_following_tag.get_text() if num_following_tag else "none"

        photo = soup.find_all('div', class_='x1yztbdb')[1]
        photos = [img['src'] for img in photo.select('img')]
        
        detail = soup.find('meta', {'name': 'description'})['content'] if soup.find('meta', {'name': 'description'}) else "none"

        data = {
            'url': url,
            'cover_image': cover_image,
            'logo_image': logo_image,
            'num_like': num_like,
            'num_follower': num_follower,
            'num_following': num_following,
            'detail': detail,
            'post_photos': photos
        }

        all_data.append(data)

        print(f"Finished scraping {url}")
        page.close()

    browser.close()

with open('out_all_urls.json', 'w', encoding='utf-8') as f:
    json.dump(all_data, f, ensure_ascii=False, indent=4)

print("Scraping completed for all URLs.")

				
			

✨ Want to dive deeper into web scraping? Check out my free web scraping course and unlock the secrets of extracting data from the web like a pro! 🌐 Here

8 Responses

  1. These are really great ideas in concerning blogging. You have
    touched some fastidious factors here. Any way keep up wrinting.

  2. You are so interesting! I don’t think I’ve read anything like this before.
    So good to discover somebody with a few genuine thoughts on this subject.
    Really.. thanks for starting this up. This site is one thing that’s needed on the web, someone with a
    little originality!

  3. I don’t know whether it’s just me or if perhaps everybody else
    encountering problems with your website. It appears as if some of the
    text on your posts are running off the screen. Can someone else please comment and let me know
    if this is happening to them as well? This could be a problem with my browser
    because I’ve had this happen before. Many
    thanks

  4. These are really wonderful ideas in concerning blogging.
    You have touched some pleasant factors here. Any
    way keep up wrinting.
    ⭐ Feel free to surf to my article about buying email list database:
    purchased email lists

    Gather targeted email list database from social
    media channels and gmaps

    #usa emails list
    #how to create an email list

  5. Hey there would you mind stating which blog platform you’re using?
    I’m going to start my own blog soon but I’m having a tough time selecting between BlogEngine/Wordpress/B2evolution and Drupal.
    The reason I ask is because your layout seems different then most
    blogs and I’m looking for something unique. P.S Apologies for getting off-topic
    but I had to ask!

    My post connected to language courses

Leave a Reply

Your email address will not be published. Required fields are marked *