Extremely Fast Python Web Scraping

Haris Bin Nasir Avatar

·

·

A web scraper is a tool that extracts structured data from a website. Using BeautifulSoup, requests, and other Python modules, you may create an efficient web scraper. These solutions, however, are not rapid enough. In this post, I’ll show you how to use Python to create a super fast web scraping tool.

Don’t use BeautifulSoup4

BeautifulSoup4 is nice and easy to use, however it is slow. It is still slow even if you use an external extractor like lxml for HTML parsing or cchardet to detect the encoding.

Use selectolax instead of BeautifulSoup4 for HTML parsing

selectolax is a Python binding to Modest and Lexbor engines.

To install selectolax with pip:

pip install selectolax

The usage of selectolax is similar to BeautifulSoup4.

from selectolax.parser import HTMLParser

html = """
<body>
    <h1 class='>Welcome to selectolax tutorial</h1>
    <div id="text">
        <p class='p3'>Lorem ipsum</p>
        <p class='p3'>Lorem ipsum 2</p>
    </div>
    <div>
        <p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
    </div>
</body>
"""
# Select all elements with class 'p3'
parser = HTMLParser(html)
parser.select('p.p3')

# Select first match
parser.css_first('p.p3')

# Iterate over all nodes on the current level
for node in parser.css('div'):
    for cnode in node.iter():
        print(cnode.tag, cnode.html)

For more information, please visit selectolax walkthrough tutorial

Use httpx instead of requests

Python requests is a human-friendly HTTP client. It’s simple to use, but it’s not quick. Only synchronous requests are supported.

httpx is a full-featured HTTP client for Python 3 that supports both HTTP/1.1 and HTTP/2 and includes sync and async APIs. It comes with a conventional synchronous API by default, but you can optionally use an async client if necessary. To use pip to install httpx, follow these steps:

pip install httpx

httpx offers the same api with requests:

import httpx
async def main():
    async with httpx.AsyncClient() as client:
        response = await client.get('https://httpbin.org/get')
        print(response.status_code)
        print(response.json())

import asyncio
asyncio.run(main())

For examples and usage, please visit httpx home page

Use aiofiles for file IO

aiofiles is a Python package that allows you to perform asyncio-based file I/O. It provides a high-level API for file manipulation. To use pip to install aiofiles, follow these steps:

pip install aiofiles

Basic usage:

import aiofiles
async def main():
    async with aiofiles.open('test.txt', 'w') as f:
        await f.write('Hello world!')

    async with aiofiles.open('test.txt', 'r') as f:
        print(await f.read())

import asyncio
asyncio.run(main())

For more information, please visit aiofiles repository

Conclusion

Basic web scraping in Python is simple, but it can take a long time. When you Google something like “rapid web scraping in python,” multiprocessing appears to be the simplest option, but it can only do so much. Hopefully, the above mentioned solutions can help you create a faster and more efficent Python Web Scraper.

Happy Coding…!!!

Leave a Reply

Your email address will not be published. Required fields are marked *