In Ruby, Net::HTTP and Mechanize facilitate proxy usage by allowing direct configuration of proxy host, port, and authentication credentials during object initialization or connection establishment. These libraries enable Ruby applications to route network requests through intermediary proxy servers, supporting use cases such as IP rotation, geo-targeting, and circumventing rate limits.
Net::HTTP Proxy Configuration
Net::HTTP is Ruby's standard library for making HTTP requests. It provides direct control over connection parameters, including proxy settings.
Direct Proxy Parameters
To configure a proxy for a Net::HTTP request, specify the proxy host, port, username, and password when creating the Net::HTTP object or when starting the connection.
require 'net/http'
require 'uri'
# Proxy details
proxy_host = 'your_proxy_host.com'
proxy_port = 8080
proxy_user = 'proxy_username'
proxy_pass = 'proxy_password'
# Target URL
uri = URI('http://example.com/data')
# Method 1: Specify proxy parameters in Net::HTTP.new
http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port, proxy_user, proxy_pass)
http.use_ssl = (uri.scheme == 'https') # Required for HTTPS
request = Net::HTTP::Get.new(uri.request_uri)
begin
response = http.request(request)
puts "Method 1 Response Status: #{response.code}"
# puts response.body
rescue Net::HTTPClientException => e
puts "HTTP Error: #{e.message}"
rescue Net::ReadTimeout, Net::OpenTimeout => e
puts "Timeout Error: #{e.message}"
rescue StandardError => e
puts "An error occurred: #{e.message}"
ensure
http.finish if http.started? # Ensure connection is closed
end
# Method 2: Specify proxy parameters in Net::HTTP.start block
uri_https = URI('https://api.ipify.org?format=json') # A simple endpoint to check IP
Net::HTTP.start(uri_https.host, uri_https.port,
proxy_host, proxy_port, proxy_user, proxy_pass,
use_ssl: uri_https.scheme == 'https') do |http_with_proxy|
request_https = Net::HTTP::Get.new(uri_https.request_uri)
response_https = http_with_proxy.request(request_https)
puts "Method 2 Response Status: #{response_https.code}"
puts "External IP via proxy: #{response_https.body}" # Should show proxy's IP
rescue Net::HTTPClientException => e
puts "HTTP Error: #{e.message}"
rescue Net::ReadTimeout, Net::OpenTimeout => e
puts "Timeout Error: #{e.message}"
rescue StandardError => e
puts "An error occurred: #{e.message}"
end
Environment Variables
Net::HTTP can automatically detect proxy settings from environment variables. This approach is suitable for system-wide or application-wide proxy configuration without modifying code.
http_proxy: For HTTP requests (e.g.,http://user:pass@proxy.example.com:8080)https_proxy: For HTTPS requests (e.g.,https://user:pass@proxy.example.com:8080)no_proxy: A comma-separated list of hostnames that should bypass the proxy.
require 'net/http'
require 'uri'
# Set environment variables (example, typically done outside the script)
# ENV['http_proxy'] = 'http://proxy_username:proxy_password@your_proxy_host.com:8080'
# ENV['https_proxy'] = 'http://proxy_username:proxy_password@your_proxy_host.com:8080' # Note: https_proxy can also be an http proxy
# ENV['no_proxy'] = 'localhost,127.0.0.1'
uri = URI('http://example.com/data')
http = Net::HTTP.new(uri.host, uri.port) # No explicit proxy parameters
request = Net::HTTP::Get.new(uri.request_uri)
# If http_proxy/https_proxy are set, Net::HTTP will use them.
begin
response = http.request(request)
puts "Env Var Response Status: #{response.code}"
rescue StandardError => e
puts "An error occurred: #{e.message}"
end
# Clear environment variables after use if set programmatically
# ENV.delete('http_proxy')
# ENV.delete('https_proxy')
Mechanize Proxy Configuration
Mechanize is a Ruby gem that simplifies web scraping by emulating a web browser. It builds upon Net::HTTP and offers a higher-level API for handling proxies.
Direct Proxy Parameters
Mechanize allows setting proxy details during agent initialization or by calling the set_proxy method.
require 'mechanize'
# Proxy details
proxy_host = 'your_proxy_host.com'
proxy_port = 8080
proxy_user = 'proxy_username'
proxy_pass = 'proxy_password'
# Method 1: Initialize Mechanize with proxy parameters
agent = Mechanize.new do |a|
a.set_proxy(proxy_host, proxy_port, proxy_user, proxy_pass)
a.user_agent_alias = 'Mac Safari' # Recommended for web scraping
end
begin
page = agent.get('http://example.com/')
puts "Method 1 Mechanize Page Title: #{page.title}"
rescue Mechanize::ResponseCodeError => e
puts "Mechanize HTTP Error: #{e.response_code} - #{e.page.uri}"
rescue Mechanize::Error => e
puts "Mechanize Error: #{e.message}"
rescue StandardError => e
puts "An error occurred: #{e.message}"
end
# Method 2: Pass proxy parameters directly to Mechanize.new (simpler for one-off)
agent_direct = Mechanize.new(proxy_addr: proxy_host,
proxy_port: proxy_port,
proxy_user: proxy_user,
proxy_pass: proxy_pass)
agent_direct.user_agent_alias = 'Linux Firefox'
begin
page_direct = agent_direct.get('https://api.ipify.org?format=json')
puts "Method 2 Mechanize External IP via proxy: #{page_direct.body}"
rescue Mechanize::ResponseCodeError => e
puts "Mechanize HTTP Error: #{e.response_code} - #{e.page.uri}"
rescue Mechanize::Error => e
puts "Mechanize Error: #{e.message}"
rescue StandardError => e
puts "An error occurred: #{e.message}"
end
Proxy Rotation with Mechanize
For scenarios requiring frequent IP changes, such as large-scale data collection, proxy rotation is essential. Mechanize's set_proxy method can be called multiple times to change the proxy during an agent's lifecycle.
require 'mechanize'
proxies = [
{ host: 'proxy1.example.com', port: 8080, user: 'user1', pass: 'pass1' },
{ host: 'proxy2.example.com', port: 8080, user: 'user2', pass: 'pass2' },
# ... more proxies
]
agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
proxies.each_with_index do |p, i|
puts "Using proxy #{i+1}: #{p[:host]}"
agent.set_proxy(p[:host], p[:port], p[:user], p[:pass])
begin
page = agent.get('https://api.ipify.org?format=json')
puts "Current IP: #{page.body.strip}"
sleep(2) # Pause to avoid overwhelming the target
rescue Mechanize::ResponseCodeError => e
puts "Error with proxy #{p[:host]}: #{e.response_code}"
rescue Mechanize::Error, StandardError => e
puts "Connection error with proxy #{p[:host]}: #{e.message}"
end
end
Comparison: Net::HTTP vs. Mechanize for Proxies
| Feature | Net::HTTP | Mechanize |
|---|---|---|
| Level of Abstraction | Low-level HTTP client. Direct socket control. | High-level web scraping library. Emulates browser. |
| Proxy Configuration | Constructor arguments or Net::HTTP.start. |
set_proxy method or Mechanize.new options. |
| Ease of Use | More verbose for complex tasks. | Simpler for navigating websites, form submission. |
| Automatic Features | None beyond basic HTTP. | Cookie handling, redirects, JavaScript interpretation (limited), user-agent management. |
| Error Handling | Net::HTTP exceptions (e.g., Net::OpenTimeout). |
Mechanize::Error, Mechanize::ResponseCodeError. |
| Best for | Simple API calls, specific HTTP/S requests. | Web scraping, browser automation, complex navigation. |
| Dependency | Standard library. | Gem (mechanize). |
Proxy Types and Limitations
Net::HTTP (and by extension Mechanize) primarily supports HTTP and HTTPS proxy types. These proxies forward HTTP/HTTPS requests.
- HTTP Proxies: Used for unencrypted HTTP traffic.
- HTTPS Proxies (CONNECT): Used for encrypted HTTPS traffic.
Net::HTTPestablishes aCONNECTtunnel through the proxy to the target host. - SOCKS Proxies:
Net::HTTPdoes not natively support SOCKS proxies (SOCKS4, SOCKS5). To use SOCKS proxies in Ruby, an external gem likesocks-rubyis required. This gem can integrate withNet::HTTPby overriding its socket creation.
# Example of SOCKS proxy usage with socks-ruby (requires 'socks-ruby' gem)
# gem install socks-ruby
require 'net/http'
require 'socks-ruby'
require 'uri'
socks_proxy_host = 'your_socks_proxy.com'
socks_proxy_port = 1080
socks_proxy_user = 'socks_user'
socks_proxy_pass = 'socks_pass'
uri = URI('https://api.ipify.org?format=json')
# Override Net::HTTP's socket creation
Net::HTTP.class_eval do
def connect
if proxy_address
# Existing HTTP/HTTPS proxy logic
super
elsif ENV['SOCKS_PROXY'] # Custom environment variable for SOCKS
socks_uri = URI(ENV['SOCKS_PROXY'])
socks_socket = Socks::HTTP.new(socks_uri.host, socks_uri.port,
socks_uri.user, socks_uri.password)
@socket = socks_socket.connect(@address, @port)
@socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
if use_ssl?
@socket = Net::HTTP::SSL_SOCKET_CLASS.new(@socket, read_timeout: @read_timeout)
@socket.sync_close = true
@socket.connect
end
else
super
end
end
end
# Set a custom environment variable for SOCKS proxy
ENV['SOCKS_PROXY'] = "socks5://#{socks_proxy_user}:#{socks_proxy_pass}@#{socks_proxy_host}:#{socks_proxy_port}"
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true # Essential for HTTPS
http.verify_mode = OpenSSL::SSL::VERIFY_PEER # Recommended for security
request = Net::HTTP::Get.new(uri.request_uri)
begin
response = http.request(request)
puts "SOCKS Proxy Response Status: #{response.code}"
puts "External IP via SOCKS proxy: #{response.body}"
rescue StandardError => e
puts "SOCKS Proxy Error: #{e.message}"
end
ENV.delete('SOCKS_PROXY') # Clean up
Error Handling and Timeouts
When using proxies, network issues, proxy misconfigurations, or proxy server failures are common. Robust error handling is crucial.
Common Errors
Net::OpenTimeout: The connection to the proxy or target server timed out before being established.Net::ReadTimeout: The server (proxy or target) did not send data within the specified timeout period.Errno::ECONNREFUSED: The proxy server actively refused the connection.Net::HTTPBadResponse: The proxy server or target returned an invalid HTTP response.Mechanize::ResponseCodeError: Mechanize-specific error when the target server returns a non-2xx HTTP status code (e.g., 403 Forbidden, 404 Not Found, 500 Internal Server Error).
Timeout Configuration
Setting appropriate timeouts prevents scripts from hanging indefinitely.
require 'net/http'
require 'uri'
uri = URI('http://slow-api.example.com')
proxy_host = 'your_proxy_host.com'
proxy_port = 8080
http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port)
http.open_timeout = 5 # Timeout for establishing the connection (seconds)
http.read_timeout = 10 # Timeout for reading data from the connection (seconds)
http.use_ssl = (uri.scheme == 'https')
request = Net::HTTP::Get.new(uri.request_uri)
begin
response = http.request(request)
puts "Response Status: #{response.code}"
rescue Net::OpenTimeout
puts "Connection to proxy or target timed out."
rescue Net::ReadTimeout
puts "Reading data from proxy or target timed out."
rescue Errno::ECONNREFUSED
puts "Connection refused by proxy or target server."
rescue StandardError => e
puts "An unexpected error occurred: #{e.message}"
end
# Mechanize also supports timeouts
# agent = Mechanize.new
# agent.set_proxy(proxy_host, proxy_port)
# agent.open_timeout = 5
# agent.read_timeout = 10
Best Practices
- User-Agent Strings: When scraping, always set a realistic
User-Agentheader. Many websites block requests without one or with generic ones. Mechanize'suser_agent_aliassimplifies this. - Referer Headers: For some sites, a valid
Refererheader is necessary to mimic legitimate browser behavior. - Cookie Management: Mechanize handles cookies automatically, which is crucial for maintaining sessions through a proxy. With
Net::HTTP, manual cookie management is required. - Error Handling and Retries: Implement retry logic with exponential backoff for transient network errors or proxy issues.
- Proxy Health Checks: Before using a proxy, consider a quick check against a known endpoint (e.g.,
https://api.ipify.org) to verify its functionality and IP address. - Resource Management: Ensure
Net::HTTPconnections are properly closed, especially when not using thestartblock, to prevent resource leaks. Mechanize manages connections internally.