Skip to content

Using Scrapy and Selenium with GProxy.net Proxies for Price Parsing

Tools & Software
Using Scrapy and Selenium with GProxy.net Proxies for Price Parsing

Price parsing at scale requires a hybrid approach that balances raw speed with the ability to bypass sophisticated anti-bot mechanisms. By integrating Scrapy’s asynchronous crawling framework with Selenium’s browser automation and GProxy.net’s residential proxy network, developers can reliably extract real-time pricing data from even the most protected e-commerce environments. This combination ensures that you can handle JavaScript-heavy rendering while maintaining a low detection profile through rotating residential IPs.

The Architecture of a High-Performance Price Parser

Building a price parser is no longer a simple matter of sending a GET request and parsing the HTML. Modern e-commerce platforms like Amazon, Zalando, and Walmart utilize dynamic price rendering, where the actual cost of an item is injected into the DOM via JavaScript after the initial page load. This necessitates a two-tiered architectural approach.

Scrapy serves as the backbone of the operation. Its asynchronous nature allows it to handle thousands of requests simultaneously, making it ideal for crawling category pages and discovering product URLs. However, when Scrapy encounters a page protected by PerimeterX or Akamai, or a page that requires heavy JS execution to display "Buy Box" prices, it hands the task over to Selenium. Selenium operates a real browser instance (Chrome or Firefox), executing scripts and handling cookies just as a human user would.

The missing link in this architecture is the network identity. E-commerce sites monitor IP reputation and request patterns. If they detect a high volume of requests from a datacenter IP range, they will immediately serve CAPTCHAs or incorrect pricing data (ghosting). GProxy.net provides the necessary residential infrastructure to mask these automated requests. By using GProxy’s rotating residential proxies, each request appears to originate from a unique home internet connection, making it nearly impossible for target servers to distinguish the scraper from a legitimate shopper.

Using Scrapy and Selenium with GProxy.net Proxies for Price Parsing

Integrating GProxy.net Proxies with Scrapy

To use GProxy.net with Scrapy, you must implement a custom middleware or use the built-in HttpProxyMiddleware. Since price parsing involves high-frequency requests, the best practice is to use GProxy's backconnect nodes, which handle rotation automatically on the server side.

Configuring Scrapy Settings

In your settings.py, you need to enable the proxy middleware and provide your GProxy credentials. Using residential proxies ensures that your requests are routed through real user devices, significantly reducing the chance of being blocked during large-scale crawls.


# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

# GProxy residential proxy configuration
# Format: http://username:password@gate.gproxy.net:port
PROXY_URL = "http://user-12345:password@residential.gproxy.net:8000"

Implementing the Proxy Logic in Spiders

You can pass the proxy meta-tag directly in your scrapy.Request. This is particularly useful when you want to switch between different GProxy zones (e.g., switching from US proxies to UK proxies for regional price comparison).


import scrapy

class PriceSpider(scrapy.Spider):
    name = 'gproxy_spider'
    
    def start_requests(self):
        urls = ['https://example-ecommerce.com/product-1']
        for url in urls:
            yield scrapy.Request(
                url=url, 
                callback=self.parse,
                meta={'proxy': 'http://user-12345:password@residential.gproxy.net:8000'}
            )

    def parse(self, response):
        price = response.css('.price-tag::text').get()
        yield {'price': price, 'url': response.url}

Handling Dynamic Content with Selenium and GProxy

When a website uses "Shadow DOM" or complex React/Vue state management to display prices, Scrapy’s Selector will return empty results. This is where Selenium becomes necessary. To maintain anonymity, Selenium must also be configured to use GProxy.net nodes.

Using selenium-wire is the recommended approach for proxy integration because it allows for easy header manipulation and supports proxies with authentication, which standard Selenium drivers often struggle with without manual extension loading.


from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options

def get_dynamic_price(url):
    proxy_options = {
        'proxy': {
            'http': 'http://user-12345:password@residential.gproxy.net:8000',
            'https': 'https://user-12345:password@residential.gproxy.net:8000',
            'no_proxy': 'localhost,127.0.0.1'
        }
    }
    
    chrome_options = Options()
    chrome_options.add_argument('--headless') # Run without a UI for performance
    
    driver = webdriver.Chrome(options=chrome_options, seleniumwire_options=proxy_options)
    
    try:
        driver.get(url)
        # Wait for the price element to be rendered via JS
        price_element = driver.find_element_by_css_selector('.dynamic-price')
        return price_element.text
    finally:
        driver.quit()

When using Selenium, resource management is critical. A single Chrome instance can consume 200MB to 500MB of RAM. If you are parsing 10,000 prices, you should not launch 10,000 instances. Instead, use a pool of workers or integrate Selenium into a Scrapy middleware (like scrapy-selenium) to reuse driver instances across multiple requests.

Using Scrapy and Selenium with GProxy.net Proxies for Price Parsing

Comparing Scrapy and Selenium for Price Extraction

Choosing the right tool depends on the target site's complexity. The following table compares the two methods when used in conjunction with GProxy.net residential proxies.

Feature Scrapy + GProxy Selenium + GProxy
Speed High (Asynchronous) Low (Browser Overhead)
Resource Usage Low (Memory efficient) High (CPU/RAM intensive)
JS Rendering No (Requires Splash/Playwright) Native Support
Anti-Bot Bypassing Medium (Relies on Headers/IP) High (Mimics Real User)
Best Use Case Catalog/Category Scraping Checkout/Dynamic Price Logic

Advanced Proxy Strategies: Geo-Targeting and Sticky Sessions

Price parsing often requires seeing what a user in a specific location sees. For example, Amazon displays different shipping costs and localized prices based on the visitor's IP address. GProxy.net allows for granular geo-targeting, which is essential for accurate competitive intelligence.

Implementing Sticky Sessions

In some scenarios, you need to maintain the same IP address across multiple requests—for instance, when adding an item to a cart to see the final price including taxes. GProxy supports "sticky sessions" by appending a session ID to your username. This ensures that for the duration of that session (e.g., 10-30 minutes), all your requests go through the same residential exit node.

Example of a sticky session string: user-12345-session-uniqueid789:password@residential.gproxy.net:8000. By changing the uniqueid789 part, you can trigger a new IP whenever your logic dictates.

Regional Price Comparisons

If you are monitoring prices for a global brand, you must verify pricing across different markets. With GProxy, you can specify the country code in the proxy credentials. This is vital for verifying Minimum Advertised Price (MAP) compliance across different jurisdictions without being redirected to a global landing page.

  • US-based pricing: Use country-us in your GProxy configuration.
  • EU-based pricing: Use country-de or country-fr to see Euro-denominated prices.
  • Asia-Pacific: Target country-jp or country-sg for regional marketplaces like Rakuten or Shopee.

Handling Anti-Bot Systems and Fingerprinting

Proxies are the first line of defense, but sophisticated sites also look at browser fingerprints. When using Selenium, you must modify the navigator.webdriver property to undefined. Sites check this flag to see if the browser is being controlled by automation software.

Additionally, pay attention to the User-Agent. If you use GProxy's residential IPs but send a Scrapy default User-Agent (Scrapy/2.x (+https://scrapy.org)), you will be blocked instantly. Always use a library like python-user-agents to generate realistic, modern strings that match the browser version you are simulating in Selenium.

  1. Rotate User-Agents: Ensure the User-Agent matches the proxy's perceived device type (Mobile vs. Desktop).
  2. Manage Cookies: Clear cookies between sessions unless you are using sticky sessions for a multi-step flow.
  3. Randomize Delays: Use time.sleep(random.uniform(1, 5)) in Selenium to mimic human reading patterns.
  4. Monitor Response Codes: If you see a 403 or 427, immediately rotate your GProxy session ID.

Key Takeaways

Successfully parsing prices at scale is a balancing act between efficiency and stealth. By combining Scrapy, Selenium, and GProxy.net, you create a robust pipeline capable of bypassing the most common hurdles in web scraping.

  • Hybrid Approach: Use Scrapy for high-volume URL discovery and Selenium only for pages that require JavaScript execution to reveal prices.
  • Residential Advantage: Always use residential proxies from GProxy.net for the final data extraction phase; datacenter IPs are too easily flagged by e-commerce CDN filters.
  • Session Management: Utilize sticky sessions when you need to maintain a consistent identity through a multi-step price discovery process (e.g., adding to cart, entering a zip code).

Practical Tip 1: Always monitor your "Success Rate" per proxy zone. If you notice a drop in successful extractions in a specific region, rotate your GProxy session IDs or switch from Datacenter to Residential nodes immediately.

Practical Tip 2: Implement a "Retry Middleware" in Scrapy that specifically looks for 429 (Too Many Requests) and 403 (Forbidden) errors. Configure it to automatically retry the request using a new GProxy residential IP, ensuring your crawl doesn't stall due to temporary blocks.

support_agent
GProxy Support
Usually replies within minutes
Hi there!
Send us a message and we'll reply as soon as possible.