Search

How to scrape an iranian ecommerce website, Digikala🧑‍💻

Introduction 🎓

Welcome to this exciting Python tutorial! Today, we’re diving into how to scrape incredible offers from the Digikala API, one of Iran’s top online retailers. Whether you’re just starting out or looking to enhance your skills, this step-by-step guide will walk you through each part of the process. 📦💻

If you’d like, you can watch the video tutorial along with this blog.

Importing Libraries and Setting Up the URL 📥

Let’s kick things off by importing the necessary libraries and setting up our base URL. This URL will serve as the foundation for our API requests:

				
					import httpx
import json

base_url = "https://api.digikala.com/v1/incredible-offers/products/?page="

				
			
  • 📚 httpx: Handles our HTTP requests.
  • 📚 json: Manages JSON data.
  • 🔗 base_url: The base URL with a placeholder for the page number.

Fetching Data from the First Page

Before diving into pagination, let’s fetch data from the first page to understand the API’s response structure and find out how many pages of data are available:

				
					response = httpx.get(base_url + "1", timeout=httpx.Timeout(30.0))
data = json.loads(response.text)

total_pages = data['data']['pager']['total_pages']

				
			
  • 🌐 httpx.get: Requests data from the API.
  • 🔍 json.loads: Converts JSON response to a Python dictionary.
  • 📊 total_pages: Gets the total number of pages from the response.

Implementing Pagination

With the total number of pages in hand, we can now loop through each page to collect all available data:

				
					parsed_data = []

for page in range(1, total_pages + 1):  # +1 to include the last page
    url = base_url + str(page)
    print(f"Working on {url}")
    response = httpx.get(url, timeout=httpx.Timeout(30.0))
    data = json.loads(response.text)
    products = data['data']['products']

				
			
  • 🔄 Loop: Iterates through each page from 1 to total_pages.
  • 🔗 url: Constructs the URL for the current page.
  • 🖨️ print: Shows the current page URL for tracking.

Extracting Product Information

For each page, we need to extract relevant product details and append them to our list:

				
					    for pro in products:
        parsed_data.append({
            'id': pro['id'],
            'title': pro['title_fa'],
            'product_url': pro['url'],
            'images': pro['images'],
            'colors': pro['colors'],
            'org_price': str(pro['default_variant']['price']['rrp_price']) + ' toman' if pro['default_variant'] else 'None',
            'incredible_price': str(pro['default_variant']['price']['selling_price']) + ' toman' if pro['default_variant'] else 'None'
        })

				
			
  • 🛒 parsed_data: List for storing product information.
  • 📋 Details: Extracts and formats product ID, title, URL, images, colors, original price, and discounted price.

Saving the Data to a JSON File

Finally, we’ll save the collected data to a JSON file, making it easy to access and analyze later:

				
					filename = 'digikala.json'

with open(filename, 'w', encoding='utf-8') as file:
    json.dump(parsed_data, file, ensure_ascii=False, indent=4)

print("Data has been saved to " + filename)

				
			
  • 💾 json.dump: Writes the data to a JSON file.
  • 📂 filename: Specifies the name of the file.

Conclusion

Congratulations! 🎉 You’ve just learned how to scrape data from the Digikala API using Python. Here’s a quick recap:

  • 📚 Imported Libraries
  • 🌐 Fetched Initial Data
  • 🔄 Handled Pagination
  • 🛒 Extracted Product Information
  • 💾 Saved Data

Complete Code

Here’s the full code for you to copy and use:

				
					import httpx
import json

# Base URL for API requests
base_url = "https://api.digikala.com/v1/incredible-offers/products/?page="

# Fetch data from the first page to determine the number of pages
response = httpx.get(base_url + "1", timeout=httpx.Timeout(30.0))
data = json.loads(response.text)
total_pages = data['data']['pager']['total_pages']

# List to store parsed data
parsed_data = []

# Loop through all pages and collect data
for page in range(1, total_pages + 1):  # +1 to include the last page
    url = base_url + str(page)
    print(f"Working on {url}")
    response = httpx.get(url, timeout=httpx.Timeout(30.0))
    data = json.loads(response.text)
    products = data['data']['products']
    
    for pro in products:
        parsed_data.append({
            'id': pro['id'],
            'title': pro['title_fa'],
            'product_url': pro['url'],
            'images': pro['images'],
            'colors': pro['colors'],
            'org_price': str(pro['default_variant']['price']['rrp_price']) + ' toman' if pro['default_variant'] else 'None',
            'incredible_price': str(pro['default_variant']['price']['selling_price']) + ' toman' if pro['default_variant'] else 'None'
        })

# Save data to a JSON file
filename = 'digikala.json'

with open(filename, 'w', encoding='utf-8') as file:
    json.dump(parsed_data, file, ensure_ascii=False, indent=4)

print("Data has been saved to " + filename)

				
			

Leave a Reply

Your email address will not be published. Required fields are marked *