Guides 6 мин чтения 12 просмотров

Scrapy Proxy Setup

Configure Scrapy with proxy middleware for efficient web scraping. Rotate proxies to avoid IP blocks and maintain anonymity. Learn how!

Parsing Python

An HTTP proxy server acts as an intermediary between your web scraping script and the target website. Instead of your Scrapy spider connecting directly to the target, it connects to the proxy server, which then forwards the request to the target. This allows you to mask your IP address, bypass geographical restrictions, and avoid getting blocked by websites employing anti-scraping measures. This article provides a practical guide to setting up and rotating proxies using Scrapy middleware.

Setting Up Proxies in Scrapy with Middleware

Scrapy's middleware system provides a flexible way to handle requests and responses. We can leverage this system to implement proxy support. The process involves creating a custom middleware that intercepts requests and assigns a proxy server to them.

Creating a Custom Proxy Middleware

First, create a new Python file (e.g., proxy_middleware.py) in your Scrapy project. This file will contain the code for your custom proxy middleware.

import random

class ProxyMiddleware:
    def __init__(self, proxies):
        self.proxies = proxies

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('PROXIES'))

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        print(f"Using proxy: {proxy}")

    def process_response(self, request, response, spider):
         # Optional: Handle response codes to retry with a different proxy
        if response.status in [403, 429]:
            print(f"Proxy {request.meta['proxy']} blocked, retrying with another proxy.")
            return self._retry_request(request, spider)
        return response

    def _retry_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        new_request = request.copy()
        return new_request

Explanation:

  • __init__(self, proxies): The constructor takes a list of proxies as input.
  • from_crawler(cls, crawler): This class method is used by Scrapy to create an instance of the middleware. It retrieves the list of proxies from the Scrapy settings.
  • process_request(self, request, spider): This method is called before Scrapy sends a request. It randomly selects a proxy from the list and assigns it to the request's meta['proxy'] attribute. This tells Scrapy to use the specified proxy for this request.
  • process_response(self, request, response, spider): This method allows you to handle the response received from the server. Here, it checks for status codes like 403 (Forbidden) or 429 (Too Many Requests), which often indicate that the proxy is blocked. If a blocking code is found, it retries the request with a different proxy.
  • _retry_request(self, request, spider): This method creates a new request with a different proxy assigned.

Configuring Scrapy Settings

Next, you need to configure your Scrapy settings to enable the middleware and provide a list of proxies. Open your settings.py file and add the following:

# settings.py

# Enable the ProxyMiddleware
DOWNLOADER_MIDDLEWARES = {
    'your_project_name.proxy_middleware.ProxyMiddleware': 350,  # Adjust priority as needed
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, # Disable the default HttpProxyMiddleware
}

# List of proxies
PROXIES = [
    'http://user1:pass1@proxy1.example.com:8080',
    'http://user2:pass2@proxy2.example.com:8080',
    'http://user3:pass3@proxy3.example.com:8080',
    'https://user4:pass4@proxy4.example.com:8080',
]

# Retry many times since proxies often fail
RETRY_TIMES = 10

# Retry on most error codes since proxies fail a lot
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408]

# Disable default user agent middleware and use a custom one
DOWNLOADER_MIDDLEWARES.update({
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
})

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

Explanation:

  • DOWNLOADER_MIDDLEWARES: This dictionary enables and configures downloader middlewares. The key is the path to your middleware class, and the value is the middleware's priority. Lower numbers indicate higher priority (middleware is executed earlier). The default HttpProxyMiddleware is disabled to avoid conflicts with the custom middleware.
  • PROXIES: This list contains the proxy servers you want to use. The format is protocol://user:password@host:port. Both HTTP and HTTPS proxies can be used. If no username and password are required, the format is simply protocol://host:port.
  • RETRY_TIMES and RETRY_HTTP_CODES: These settings configure Scrapy's retry middleware. Since proxies can be unreliable, it's good practice to increase the number of retries and include common HTTP error codes that might indicate a proxy issue.
  • DOWNLOADER_MIDDLEWARES.update(...): This section disables the default User Agent middleware and enables scrapy_user_agents to rotate User Agents. This helps prevent your scraper from being easily identified. You'll need to install scrapy_user_agents using pip install scrapy-user-agents.

Running the Spider

Now you can run your Scrapy spider as usual. The middleware will automatically assign a proxy to each request.

scrapy crawl your_spider_name

Proxy Rotation Strategies

Rotating proxies is crucial for preventing your scraper from being blocked. Here are some common strategies:

  • Random Selection: As implemented in the example above, randomly selecting a proxy from the list for each request. This is the simplest approach but may not be the most effective.
  • Sequential Rotation: Cycling through the list of proxies in a sequential manner. This can be useful if you want to ensure that each proxy gets used an equal number of times.
  • Intelligent Rotation: Implementing logic to track the performance of each proxy and prioritize proxies that are working well. This can involve monitoring response times, error rates, and other metrics.
  • Using a Proxy API: Utilizing a proxy service API that automatically handles proxy rotation and management. These services often provide features like geo-targeting and IP address reputation management.

Sequential Proxy Rotation

Here's an example of implementing sequential proxy rotation in your middleware:

import itertools

class SequentialProxyMiddleware:
    def __init__(self, proxies):
        self.proxies = itertools.cycle(proxies) # Use cycle to rotate proxies

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('PROXIES'))

    def process_request(self, request, spider):
        proxy = next(self.proxies) # Get the next proxy from the cycle
        request.meta['proxy'] = proxy
        print(f"Using proxy: {proxy}")

    def process_response(self, request, response, spider):
        if response.status in [403, 429]:
            print(f"Proxy {request.meta['proxy']} blocked, rotating to the next proxy.")
            return self._retry_request(request, spider)
        return response

    def _retry_request(self, request, spider):
        proxy = next(self.proxies)
        request.meta['proxy'] = proxy
        new_request = request.copy()
        return new_request

Key Change:

  • itertools.cycle(proxies): This creates an iterator that loops indefinitely through the list of proxies. The next() function is used to get the next proxy in the sequence.

Remember to update your DOWNLOADER_MIDDLEWARES setting to point to the SequentialProxyMiddleware.

Proxy API Integration

Integrating with a proxy API typically involves making requests to the API to retrieve a proxy and handling the API's authentication and error responses. The specifics will depend on the API you choose. Many proxy providers offer Python SDKs to simplify this process.

Proxy Types

Here's a comparison of different proxy types:

Feature HTTP Proxy HTTPS Proxy SOCKS Proxy
Protocol HTTP HTTPS SOCKS (various versions)
Encryption No encryption between client and proxy Encryption between client and proxy Encryption depends on SOCKS version
Use Cases Web browsing, scraping HTTP sites Web browsing, scraping HTTPS sites General-purpose, supports various protocols
Anonymity Can be less anonymous Can be more anonymous Can be highly anonymous
Configuration Typically configured in web browsers Typically configured in web browsers Requires SOCKS client or library support
Example URL http://host:port https://host:port socks5://host:port or socks4://host:port
Authentication Basic authentication (username/password) Basic authentication (username/password) Username/password authentication supported

Common Issues and Troubleshooting

  • Proxies Not Working: Verify that the proxy server is online and accessible. Check the proxy's authentication credentials (username and password). Ensure that the proxy format in settings.py is correct.
  • Blocked Proxies: Implement proxy rotation and consider using a proxy service with a large pool of IP addresses. Monitor response codes (403, 429) and automatically retry requests with different proxies.
  • Slow Performance: Choose proxies that are geographically close to the target server. Test different proxy providers to find one with reliable performance.
  • HTTPS Errors: Ensure your proxy supports HTTPS connections. Some HTTP proxies only support HTTP traffic.
  • DNS Leaks: Use a SOCKS proxy or configure your system to use the proxy's DNS server to prevent DNS leaks.

Conclusion

Setting up and rotating proxies in Scrapy is essential for building robust and reliable web scrapers. By using custom middleware, implementing effective rotation strategies, and understanding the different types of proxies, you can significantly reduce the risk of getting blocked and improve the performance of your scraping projects. Remember to continually monitor your proxies and adapt your strategy as needed to maintain optimal scraping efficiency.

Remember to test your proxies regularly and monitor their performance to ensure that your scraper continues to function effectively. Consider using a proxy management service for more advanced features and easier management.

Обновлено: 26.01.2026
Назад к категории

Попробуйте наши прокси

20,000+ прокси в 100+ странах мира