Scraping

Web Scraping at Scale: When and Why to Use Public Proxies

5 min read Published Updated 993 words

Proxy rotation is a crutch, not a solution. Most scraping operations fail because they treat IP addresses as the only signal anti-bot systems measure. The reality is that modern bot managers — Akamai Bot Manager, Cloudflare Turnstile, Datadome — fingerprint far more than your source IP. A rotating pool of free public proxies buys you almost nothing against those systems, and often makes things worse.

The Illusion of IP Rotation

When you rotate IPs on every request, you announce yourself as a scraper. Human browsing patterns show sticky sessions — a single IP for minutes or hours, consistent browser fingerprints, and predictable request intervals. Tools like requests with a Session object and a rotating proxy list break all of those signals. Akamai’s X-Akamai-Device-Fingerprint header and Cloudflare’s cf-request-id correlation can link requests from different IPs when TLS parameters, HTTP/2 settings, and timing remain identical. Datadome’s JavaScript challenge checks for headless browser artifacts that survive proxy changes. Rotating IPs without rotating the full client fingerprint is like changing your license plate but driving the same car — the toll cameras still flag you.

For low-rate, low-volume scraping against sites that use only basic IP-based rate limiting (e.g., a 10-request-per-minute throttle with no JavaScript challenges), a single residential IP often suffices. I have run scrapers for years against government data portals and public APIs using one static IP and a polite time.sleep(2). No proxy needed. The rule is simple: if the site does not serve a challenge page or a CAPTCHA after 50 requests, you do not need rotation.

Beyond the IP Address: Fingerprinting

Anti-bot systems now collect dozens of signals per request. The User-Agent string is trivial to spoof, but Accept-Language, Sec-CH-UA, Connection, and Accept-Encoding order are not. More critically, TLS fingerprinting — standardized in the JA3 hash (see JA3) — identifies the client library by the cipher suite order and TLS extension list. Python’s requests library (via urllib3) produces a JA3 hash that is distinct from Chrome 124. Cloudflare’s Turnstile and Datadome both check JA3. Rotating IPs while keeping the same TLS stack makes every request look like the same automated client, just hopping between exit nodes. Free proxies compound this because they often run outdated OpenSSL versions or use bot-like TLS configurations that are already blacklisted.

HTTP/2 fingerprinting goes further. The SETTINGS frame, window update values, and stream concurrency parameters form a unique “HTTP/2 fingerprint” that Akamai’s Bot Manager tracks across sessions. A rotating proxy pool that does not also rotate the HTTP/2 implementation is trivial to cluster. The only way to evade these checks is to use a real browser engine (Puppeteer, Playwright) or a carefully crafted TLS/HTTP stack that mimics a specific browser version — and even then, you need to persist the same fingerprint across requests from a given session.

The Economics of Free Public Proxy Pools

Free public proxy lists have a 60–80 percent failure rate in my testing. Most proxies are either dead on arrival, throttled by the host, or already flagged by major bot managers. The average lifespan of a free SOCKS5 proxy scraped from a public directory is under 15 minutes. Maintaining a rotating pool of 500 proxies means you burn through thousands of IPs per hour, and 80% of your requests either time out or return a 403. The bandwidth is unreliable, latency spikes are common, and many free proxies inject ads or modify response bodies. Paid residential proxy networks (e.g., Bright Data, Oxylabs) offer 95%+ success rates and sticky session options, but at a cost of $10–$20 per GB. For scale, the math favors residential proxies only when you need to bypass IP-based blocks on high-value targets. For everything else, a single clean IP with proper request pacing outperforms a chaotic free pool.

When Rotation Actually Works

Proxy rotation is effective against one specific threat: IP-based rate limits that reset per IP. If a site uses a simple X-Forwarded-For check or a token bucket per IP, rotating after each request bypasses the limit. This is common on smaller e-commerce sites and legacy APIs that never updated their bot detection. In those cases, even a free proxy pool works — but only if you implement retry logic that discards failed proxies and cycles through fresh ones quickly.

Here is a minimal Python example using requests and a retry-with-rotation loop. It assumes a list of proxy URLs in proxy_list and a target url:

import requests
from itertools import cycle

proxy_pool = cycle(proxy_list)
max_retries = 5

for attempt in range(max_retries):
    proxy = next(proxy_pool)
    try:
        resp = requests.get(
            url,
            proxies={"http": proxy, "https": proxy},
            timeout=10,
            headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."}
        )
        if resp.status_code == 200:
            break
    except (requests.ConnectionError, requests.Timeout):
        continue
else:
    raise RuntimeError("All proxies failed")

This pattern works only when the site’s detection is purely IP-based. Add a time.sleep(random.uniform(1,3)) between requests to mimic human timing. For sites running Turnstile or Datadome, this code will fail every time — the challenge page will return a 403 or a CAPTCHA regardless of the proxy. In those cases, you need a headless browser with a real fingerprint, not a rotating IP list.

Sticky sessions — keeping the same IP for a set of related requests — are often more effective than per-request rotation. Many e-commerce sites expect a single IP for a browsing session (e.g., adding items to a cart, checking out). Rotating mid-session triggers fraud flags. Use a pool of proxies but assign one IP per session, not per request. Free proxies rarely support sticky sessions because the same IP is reused by multiple users; you will see session data cross-contamination. Paid residential proxies offer sticky session durations (5–30 minutes) that align with natural browsing behavior.

Choose rotation only when you understand the target’s detection stack. Test with a single IP first. Add rotation only if you hit a rate limit. And never rely on free proxies for production — their failure rate will cost you more in engineering time and lost data than a cheap residential plan.