Web scraping is an incredibly useful technique for extracting data from websites, and in this article, we’re diving into how to scrape Facebook profile data. Using Python, we’ll leverage two powerful libraries: Playwright and BeautifulSoup.
Playwright: A cutting-edge browser automation library that makes it easy to navigate web pages and interact with dynamic content. It’s perfect for scraping pages like Facebook, which use JavaScript to load elements.
BeautifulSoup: One of Python’s most popular libraries for parsing HTML. It allows us to extract key pieces of information from the web page structure with ease.
In this tutorial, we’ll scrape several Facebook profiles to collect details such as cover images, likes, followers, and profile photos. By the end, you’ll have a script that can automate this process across multiple profiles and output the data in a structured format like JSON.
Facebook profiles often contain valuable information for marketing, research, or analysis. With Playwright and BeautifulSoup, you can easily extract these details, whether you need to analyze social media trends, track competitor activity, or simply automate data collection from Facebook profiles.
⚠️ Note: While scraping can be a powerful tool, it’s important to respect the website’s terms of service and avoid violating any legal restrictions. Always check the terms before scraping.
Let’s jump in and explore how to automate Facebook scraping in Python! 🚀
If you’d like, you can watch the video tutorial along with this blog.
from playwright.sync_api import sync_playwright
import time
from bs4 import BeautifulSoup
import json, re
First things first, we need to import the necessary Python libraries:
urls = [
"https://www.facebook.com/adidas/",
"https://www.facebook.com/Cristiano",
"https://www.facebook.com/nasaearth"
]
💡 Explanation:
Here, we’re defining a list of URLs that we want to scrape. In this case, we are scraping the Facebook profiles for Adidas, Cristiano Ronaldo, and NASA Earth. 🚀 This list can be expanded or modified to scrape other profiles as well. We’re storing them in a Python list so that we can loop through them later in our script.
We’re launching a headless browser using Playwright, which means it runs in the background without opening a visible window. Let’s break this down:
headless=False
, the browser would pop up visibly while scraping! 👀
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
Now we’re looping through each URL in our list to scrape data from each profile:
for url in urls:
page = context.new_page()
print(f"Going to page: {url}")
page.goto(url)
time.sleep(3)
Once the page is loaded, we pass its HTML content into BeautifulSoup for parsing. The html.parser
is a built-in parser that works well with BeautifulSoup. This allows us to navigate and search through the HTML tree to find the data we need. 🕸️
soup = BeautifulSoup(page.content(), 'html.parser')
cover_image = soup.find('img', {'data-imgperflogname': 'profileCoverPhoto'})['src']
logo_image = soup.select_one('g image')['xlink:href']
Here, we’re extracting two key images from the Facebook profile:
<img>
that has the attribute data-imgperflogname="profileCoverPhoto"
. The src
attribute contains the image URL. 📷xlink:href
gives us the link to the image. 🎨Next, we extract the numbers of likes, followers, and following:
num_like_tag = soup.find('a', string=re.compile(r' likes$'))
num_like = num_like_tag.get_text() if num_like_tag else "none"
num_follower_tag = soup.find('a', string=re.compile(r' followers$'))
num_follower = num_follower_tag.get_text() if num_follower_tag else "none"
num_following_tag = soup.find('a', string=re.compile(r' following$'))
num_following = num_following_tag.get_text() if num_following_tag else "none"
<a>
) that ends with the word “likes”. If it exists, we extract the text, otherwise we return “none”.This part collects additional images posted on the profile:
'x1yztbdb'
).<img>
tags inside that container. 📸
photo = soup.find_all('div', class_='x1yztbdb')[1]
photos = [img['src'] for img in photo.select('img')]
Now, we gather all the scraped information into a dictionary called data
:
all_data
list, which will store the data for all the profiles we scrape.
data = {
'url': url,
'cover_image': cover_image,
'logo_image': logo_image,
'num_like': num_like,
'num_follower': num_follower,
'num_following': num_following,
'detail': detail,
'post_photos': photos
}
all_data.append(data)
After scraping all the profiles, we save the data into a JSON file:
all_data
list to a JSON file called out_all_urls.json
. We ensure that non-ASCII characters (like emoji or special symbols) are properly encoded, and the indent=4
option makes the file human-readable. 🗂️
with open('out_all_urls.json', 'w', encoding='utf-8') as f:
json.dump(all_data, f, ensure_ascii=False, indent=4)
print("Scraping completed for all URLs.")
browser.close()
We print a message indicating that the scraping process is complete and then close the browser:
Here’s the full code for you to copy and use:
from playwright.sync_api import sync_playwright
import time
from bs4 import BeautifulSoup
import json, re
urls = [
"https://www.facebook.com/adidas/",
"https://www.facebook.com/Cristiano",
"https://www.facebook.com/nasaearth"
]
all_data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
for url in urls:
page = context.new_page()
print(f"Going to page: {url}")
page.goto(url)
time.sleep(3)
soup = BeautifulSoup(page.content(), 'html.parser')
cover_image = soup.find('img', {'data-imgperflogname': 'profileCoverPhoto'})['src']
logo_image = soup.select_one('g image')['xlink:href']
num_like_tag = soup.find('a', string=re.compile(r' likes$'))
num_like = num_like_tag.get_text() if num_like_tag else "none"
num_follower_tag = soup.find('a', string=re.compile(r' followers$'))
num_follower = num_follower_tag.get_text() if num_follower_tag else "none"
num_following_tag = soup.find('a', string=re.compile(r' following$'))
num_following = num_following_tag.get_text() if num_following_tag else "none"
photo = soup.find_all('div', class_='x1yztbdb')[1]
photos = [img['src'] for img in photo.select('img')]
detail = soup.find('meta', {'name': 'description'})['content'] if soup.find('meta', {'name': 'description'}) else "none"
data = {
'url': url,
'cover_image': cover_image,
'logo_image': logo_image,
'num_like': num_like,
'num_follower': num_follower,
'num_following': num_following,
'detail': detail,
'post_photos': photos
}
all_data.append(data)
print(f"Finished scraping {url}")
page.close()
browser.close()
with open('out_all_urls.json', 'w', encoding='utf-8') as f:
json.dump(all_data, f, ensure_ascii=False, indent=4)
print("Scraping completed for all URLs.")
✨ Want to dive deeper into web scraping? Check out my free web scraping course and unlock the secrets of extracting data from the web like a pro! 🌐 Here
8 Responses
These are really great ideas in concerning blogging. You have
touched some fastidious factors here. Any way keep up wrinting.
You are so interesting! I don’t think I’ve read anything like this before.
So good to discover somebody with a few genuine thoughts on this subject.
Really.. thanks for starting this up. This site is one thing that’s needed on the web, someone with a
little originality!
I don’t know whether it’s just me or if perhaps everybody else
encountering problems with your website. It appears as if some of the
text on your posts are running off the screen. Can someone else please comment and let me know
if this is happening to them as well? This could be a problem with my browser
because I’ve had this happen before. Many
thanks
These are really wonderful ideas in concerning blogging.
You have touched some pleasant factors here. Any
way keep up wrinting.
⭐ Feel free to surf to my article about buying email list database:
purchased email lists
Gather targeted email list database from social
media channels and gmaps
#usa emails list
#how to create an email list
Hi, I log oon to your blogs regularly. Your story-telling style is awesome, keep up the
good work!
My page vclub. tel
thanks, glad you liked this
Hey there would you mind stating which blog platform you’re using?
I’m going to start my own blog soon but I’m having a tough time selecting between BlogEngine/Wordpress/B2evolution and Drupal.
The reason I ask is because your layout seems different then most
blogs and I’m looking for something unique. P.S Apologies for getting off-topic
but I had to ask!
My post connected to language courses
i am using WordPress, its great