A web scraper is a tool that extracts structured data from a website. Using BeautifulSoup, requests, and other Python modules, you may create an efficient web scraper. These solutions, however, are not rapid enough. In this post, I’ll show you how to use Python to create a super fast web scraping tool.
Don’t use BeautifulSoup4
BeautifulSoup4 is nice and easy to use, however it is slow. It is still slow even if you use an external extractor like lxml
for HTML parsing or cchardet
to detect the encoding.
Use selectolax instead of BeautifulSoup4 for HTML parsing
selectolax
is a Python binding to Modest and Lexbor engines.
To install selectolax
with pip:
pip install selectolax
The usage of selectolax
is similar to BeautifulSoup4
.
from selectolax.parser import HTMLParser
html = """
<body>
<h1 class='>Welcome to selectolax tutorial</h1>
<div id="text">
<p class='p3'>Lorem ipsum</p>
<p class='p3'>Lorem ipsum 2</p>
</div>
<div>
<p id='stext'>Lorem ipsum dolor sit amet, ea quo modus meliore platonem.</p>
</div>
</body>
"""
# Select all elements with class 'p3'
parser = HTMLParser(html)
parser.select('p.p3')
# Select first match
parser.css_first('p.p3')
# Iterate over all nodes on the current level
for node in parser.css('div'):
for cnode in node.iter():
print(cnode.tag, cnode.html)
For more information, please visit selectolax walkthrough tutorial
Use httpx instead of requests
Python requests
is a human-friendly HTTP client. It’s simple to use, but it’s not quick. Only synchronous requests are supported.
httpx is a full-featured HTTP client for Python 3 that supports both HTTP/1.1 and HTTP/2 and includes sync and async APIs. It comes with a conventional synchronous API by default, but you can optionally use an async client if necessary. To use pip to install httpx, follow these steps:
pip install httpx
httpx
offers the same api with requests
:
import httpx
async def main():
async with httpx.AsyncClient() as client:
response = await client.get('https://httpbin.org/get')
print(response.status_code)
print(response.json())
import asyncio
asyncio.run(main())
For examples and usage, please visit httpx home page
Use aiofiles for file IO
aiofiles
is a Python package that allows you to perform asyncio-based file I/O. It provides a high-level API for file manipulation. To use pip to install aiofiles
, follow these steps:
pip install aiofiles
Basic usage:
import aiofiles
async def main():
async with aiofiles.open('test.txt', 'w') as f:
await f.write('Hello world!')
async with aiofiles.open('test.txt', 'r') as f:
print(await f.read())
import asyncio
asyncio.run(main())
For more information, please visit aiofiles repository
Conclusion
Basic web scraping in Python is simple, but it can take a long time. When you Google something like “rapid web scraping in python,” multiprocessing appears to be the simplest option, but it can only do so much. Hopefully, the above mentioned solutions can help you create a faster and more efficent Python Web Scraper.
Happy Coding…!!!
Leave a Reply