Scraping Comic Books Episodes: Automating Image Downloads from Webtoons

Scraping Comic Books Episodes: Automating Image Downloads from Webtoons

As a web scraping freelancer, one of the common tasks you may encounter is extracting data from websites and automating certain processes. In this blog post, we'll explore a Python script that scrapes Comic Books episodes from Webtoons and downloads the associated images.

Setting up the Environment

Before we dive into the script, let's make sure we have the necessary libraries installed. We'll be using pandas, httpx, asyncio, and BeautifulSoup (bs4). You can install them using pip:

pip install pandas httpx beautifulsoup4

The Script

The script starts by importing the required libraries and setting up a semaphore to limit the number of concurrent requests made to the website. Let's take a look at the initial part of the script:

import os
import pandas as pd
import httpx
import asyncio
from bs4 import BeautifulSoup as bs


semaphore = asyncio.Semaphore(100)


# ...

Next, we define an asynchronous function check_and_create_folder that checks if a folder exists and creates it if not. This function will be used later to create the necessary directory structure for saving the downloaded images:

async def check_and_create_folder(folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f"Folder '{folder_path}' created successfully.")
    else:
        print(f"Folder '{folder_path}' already exists.")

Moving on, we define another asynchronous function downloadImage that downloads an image given its URL and saves it to a file. We utilize the httpx library for making asynchronous HTTP requests:

async def downloadImage(url, filename, client):
    querystring = {"type": "q90"}
    headers = {"referer": "https://www.webtoons.com/"}


    async with semaphore:
        response = await client.get(url, headers=headers)


    if response.status_code == 200:
        with open(f"{filename}.jpg", "wb") as file:
            file.write(response.content)

The main functionality of the script is wrapped in the main function. Here's an overview of what it does:

  1. Reads a CSV file named Episodes.csv that contains the URLs of Comic Books episodes.

  2. Iterates over the URLs and checks if they belong to the Webtoons website.

  3. Extracts the Book name and episode number from the URL.

  4. Creates the necessary folder structure for saving the images.

  5. Scrapes the HTML of the episode page and finds the image URLs using BeautifulSoup.

  6. Initiates parallel downloads of the images using httpx.AsyncClient.

async def main():
    urls = pd.read_csv("Episodes.csv").to_dict('records')
    for d, url in enumerate(urls):
        url = url['urls']
        if "webtoons.com" not in url:
            continue
        name = url.split('/')[5]
        episode = url.split('_no=')[-1]
        name_path = f"book/{name}"
        episode_path = f"book/{name}/{episode}"
        images_path = f"{episode_path}/images"
        if os.path.exists(images_path):
            continue


        async with httpx.AsyncClient() as client:
            response = await client.get(url)
            if response.status_code != 200:
                continue


            await check_and_create_folder(name_path)
            await check_and_create_folder(episode_path)
            await check_and_create_folder(images_path)


            parsedHtml = bs(response.text, "html.parser").find_all('img', class_='_images')
            tasks = []


            async with httpx.AsyncClient() as client:
                for i, image in enumerate(parsedHtml):
                    url = image.get('data-url')
                    tasks.append(downloadImage(url, f"{images_path}/{i}", client))


                await asyncio.gather(*tasks)


asyncio.run(main())

Conclusion

In this blog post, we explored a Python script that automates the process of scraping Comic Books episodes from Webtoons and downloading the associated images.

Feel free to modify and enhance the script to suit your specific needs. Happy web scraping!