How to Handle CAPTCHAs When Scraping (Step-by-Step Guide)

Your scraper is humming along, pulling data page after page, then it stops cold. Instead of product listings or search results, you are staring at a grid of blurry traffic lights or a checkbox labeled "I am not a robot." CAPTCHA. The universal "go away, bot" message.

Here’s the thing: CAPTCHAs are not random. They are triggered when a site’s bot detection system gets suspicious about your traffic. The requests might be too fast. The IP might not change. The browser fingerprint might look odd. There might be no cookies.

If you only solve the symptom, the CAPTCHA, it will keep coming back. If you solve the cause, your traffic looking like a bot in the first place, you will rarely see one at all.

The complete answer is to handle CAPTCHAs at two levels: prevention first, solving second. Prevention means making your traffic look human enough that CAPTCHAs are not triggered. Solving means having a reliable fallback for the challenges that still appear. Get both right, and CAPTCHAs stop being a pipeline blocker and become an occasional speed bump.

Let’s walk through the full stack, step by step.

What Are CAPTCHAs and Why Do Scrapers Trigger Them?

CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart." The irony isn't lost on anyone — it's a test designed to be automated, yet supposedly only passable by humans.

Modern CAPTCHAs have evolved well beyond the squiggly-letter image of 2005. Today's systems include:

reCAPTCHA v2 — the classic "I'm not a robot" checkbox (and the image grid challenge it triggers)
reCAPTCHA v3 — invisible, no user interaction, assigns a risk score in the background
hCaptcha — increasingly common alternative to reCAPTCHA, used by Cloudflare and many others
Cloudflare Turnstile — newer, privacy-friendly CAPTCHA that relies heavily on browser fingerprinting
FunCaptcha / Arkose Labs — interactive puzzles (rotating images, slider challenges) common on high-value targets
Text/image CAPTCHAs — older systems still found on legacy sites

As Google's reCAPTCHA documentation explains, modern CAPTCHA systems like v3 don't just evaluate a single interaction — they analyze your entire browsing session's behavioral signals before deciding whether to challenge you. By the time you see a CAPTCHA, the system has usually already made up its mind about your traffic. The visible challenge is the consequence, not the evaluation.

That means the best CAPTCHA strategy starts before you write a single line of scraping code.

How CAPTCHA Triggering Works

Understanding what triggers CAPTCHAs makes them far easier to prevent. These are the most common causes:

Datacenter IP addresses — AWS, GCP, and DigitalOcean IP ranges are known bot infrastructure. Sites serving CAPTCHAs often trigger them automatically for any traffic from these ranges — no behavioral analysis required.

Identical browser fingerprints — The same canvas hash, WebGL renderer, and navigator properties appearing across hundreds of requests is statistically impossible for real users. Fingerprint consistency is a top-tier bot signal.

Request velocity — Making 50 requests per second, all from the same session, with perfect timing — that's not human. Even 10 requests per second with a fixed 100ms delay is too mechanical.

Missing session history — Real users don't arrive at deep pages cold. They have cookies from the homepage visit, referrer headers from Google, and session state accumulated over time. A scraper that jumps directly to /products/item-123 with zero session history looks like a bot because it is one.

No mouse or scroll events — Passive behavioral tracking on many sites logs mouse movement, scroll depth, and keyboard interaction. A session with zero behavioral events before a form submission or pagination click is a strong bot signal.

Step-by-Step Guide: Handling CAPTCHAs When Scraping

Step 1: Prevent CAPTCHAs Before They Appear

The most effective CAPTCHA handling strategy is never seeing one. That sounds obvious, but most developers jump straight to solving CAPTCHAs without addressing why they're being triggered.

Use residential proxies. This is the single highest-leverage change. Residential IPs have real ISP assignments and clean reputations — they don't automatically trigger CAPTCHA challenges the way datacenter IPs do. Tools like MrScraper's Scraping Browser include residential proxy rotation built in, so you never make requests from a flagged datacenter range.

Warm up your session. Visit the homepage before navigating to deep pages. Let cookies accumulate naturally. Set a realistic Referer header:

from playwright.async_api import async_playwright
import asyncio
import random

async def scrape_with_warm_session(target_url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = await context.new_page()

        # Step 1: Warm up — visit the homepage first
        await page.goto("https://example.com", wait_until="domcontentloaded")
        await asyncio.sleep(random.uniform(2.0, 4.0))  # Simulate reading time

        # Step 2: Scroll a bit — passive behavioral signal
        await page.evaluate("window.scrollBy(0, window.innerHeight * 0.6)")
        await asyncio.sleep(random.uniform(1.0, 2.5))

        # Step 3: Navigate to your target — now you have cookies and session history
        await page.goto(target_url, wait_until="domcontentloaded")
        await page.wait_for_selector(".product-card")

        data = await page.eval_on_selector_all(
            ".product-card",
            "els => els.map(el => el.textContent.trim())"
        )

        await browser.close()
        return data

asyncio.run(scrape_with_warm_session("https://example.com/products"))

That scroll and pause before navigating to your target page costs two seconds. It can save you hours of CAPTCHA debugging.

Slow down your request rate. Fixed delays are detectable — robots don't take 1.500 seconds between every request. Randomize:

import asyncio
import random

async def polite_request_loop(urls, page):
    results = []
    for i, url in enumerate(urls):
        await page.goto(url)
        await page.wait_for_selector(".content")
        results.append(await page.content())

        # Variable delay — not a fixed interval
        delay = random.uniform(2.0, 5.0)

        # Longer break every 15 pages — mimics human reading patterns
        if (i + 1) % 15 == 0:
            delay = random.uniform(20.0, 40.0)

        await asyncio.sleep(delay)
    return results

Rotate browser fingerprints. Use a different User-Agent, viewport, and timezone per session. Patch navigator.webdriver before the page loads:

async def create_stealthy_context(browser):
    context = await browser.new_context(
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        viewport={"width": 1440, "height": 900},
        locale="en-US",
        timezone_id="America/Los_Angeles",
    )
    page = await context.new_page()

    # Remove the most obvious headless signal before any page JS runs
    await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                { name: 'Chrome PDF Plugin' },
                { name: 'Chrome PDF Viewer' },
            ]
        });
    """)

    return page

Step 2: Detect When a CAPTCHA Has Appeared

Even the best prevention fails sometimes. You need to detect CAPTCHAs reliably so your pipeline can respond — not silently collect empty or incorrect data.

async def check_for_captcha(page):
    # Check for common CAPTCHA indicators in the page content
    captcha_signals = [
        "iframe[src*='recaptcha']",
        "iframe[src*='hcaptcha']",
        ".g-recaptcha",
        "#captcha",
        "[data-sitekey]",
        "iframe[title*='challenge']",
    ]

    for selector in captcha_signals:
        element = await page.query_selector(selector)
        if element:
            return True

    # Also check page title and URL for challenge pages
    title = await page.title()
    url = page.url

    if any(signal in title.lower() for signal in ["captcha", "challenge", "verify", "robot"]):
        return True
    if "challenge" in url or "captcha" in url:
        return True

    return False

async def scrape_with_captcha_detection(url, page):
    await page.goto(url, wait_until="domcontentloaded")

    if await check_for_captcha(page):
        print(f"CAPTCHA detected on {url} — triggering fallback handler")
        return None  # Hand off to your solving strategy

    # No CAPTCHA — proceed with normal extraction
    await page.wait_for_selector(".product-card")
    return await page.eval_on_selector_all(
        ".product-card",
        "els => els.map(el => el.textContent.trim())"
    )

Detecting silently matters. A scraper that returns empty data when it hits a CAPTCHA but doesn't log or flag it is poisoning your dataset without you knowing.

Step 3: Use a CAPTCHA Solving Service as a Fallback

When prevention fails and you've detected a CAPTCHA, you need a solving strategy. The two main approaches are third-party solving services and AI-based solvers.

Third-party solving services (2captcha, Anti-Captcha) — These services route your CAPTCHA to human workers or AI solvers and return a token you inject back into the page. Integration is straightforward:

import requests
import asyncio
import time

TWOCAPTCHA_API_KEY = "YOUR_2CAPTCHA_API_KEY"

async def solve_recaptcha_v2(page, site_key, page_url):
    # Submit the CAPTCHA to 2captcha
    submit_response = requests.post(
        "https://2captcha.com/in.php",
        data={
            "key": TWOCAPTCHA_API_KEY,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
            "json": 1,
        }
    )

    captcha_id = submit_response.json().get("request")
    if not captcha_id:
        print("Failed to submit CAPTCHA")
        return False

    print(f"CAPTCHA submitted. ID: {captcha_id}. Waiting for solution...")

    # Poll for the solution — typically takes 15–30 seconds
    for attempt in range(20):
        await asyncio.sleep(5)

        result_response = requests.get(
            f"https://2captcha.com/res.php?key={TWOCAPTCHA_API_KEY}&action=get&id={captcha_id}&json=1"
        )
        result = result_response.json()

        if result.get("status") == 1:
            token = result["request"]
            print(f"Solution received after {(attempt + 1) * 5} seconds")

            # Inject the token into the page's CAPTCHA response field
            await page.evaluate(f"""
                document.getElementById('g-recaptcha-response').innerHTML = '{token}';
                // Trigger the CAPTCHA callback if it exists
                if (typeof ___grecaptcha_cfg !== 'undefined') {{
                    Object.entries(___grecaptcha_cfg.clients).forEach(([key, client]) => {{
                        const callback = client['']['']['callback'];
                        if (callback) callback('{token}');
                    }});
                }}
            """)
            return True

        elif result.get("request") != "CAPCHA_NOT_READY":
            print(f"Solver error: {result.get('request')}")
            return False

    print("Timed out waiting for CAPTCHA solution")
    return False

The typical solve time is 15–30 seconds for reCAPTCHA v2. For hCaptcha, the workflow is nearly identical — just change method to "hcaptcha". Costs run roughly $0.50–$2.00 per 1,000 solved CAPTCHAs.

Recovering the site key — You need it for the API call above. Find it in the page source or DevTools:

async def get_recaptcha_site_key(page):
    # Method 1: From the data attribute
    element = await page.query_selector(".g-recaptcha, [data-sitekey]")
    if element:
        return await element.get_attribute("data-sitekey")

    # Method 2: From page source (reCAPTCHA v3)
    content = await page.content()
    import re
    match = re.search(r'sitekey["\s:=]+["\']([0-9A-Za-z_-]{40})["\']', content)
    return match.group(1) if match else None

Step 4: Use MrScraper for Transparent CAPTCHA Handling

Here's the most practical solution for production pipelines: use MrScraper's Scraping Browser, which handles CAPTCHAs transparently as part of the session — before your extraction logic ever runs. You never see a CAPTCHA. It never interrupts your pipeline. The solving happens in the browser layer, invisibly.

from playwright.async_api import async_playwright
import asyncio

async def scrape_captcha_protected_site(url):
    async with async_playwright() as p:
        # MrScraper handles CAPTCHAs, proxies, and fingerprinting automatically
        browser = await p.chromium.connect_over_cdp(
            "wss://browser.mrscraper.com?token=YOUR_API_TOKEN"
        )

        page = await browser.new_page()

        # Navigate normally — no CAPTCHA detection or solving code needed
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_selector(".product-card", timeout=20000)

        products = await page.eval_on_selector_all(
            ".product-card",
            """els => els.map(el => ({
                name: el.querySelector("h2")?.textContent.trim(),
                price: el.querySelector(".price")?.textContent.trim(),
            }))"""
        )

        await browser.close()
        return products

results = asyncio.run(scrape_captcha_protected_site("https://cloudflare-protected-example.com"))
print(results)

No CAPTCHA detection code. No solving service integration. No token injection. The same Playwright code you'd use on an unprotected site — because from your pipeline's perspective, the CAPTCHA doesn't exist.

Or use MrScraper's AI extraction SDK if you want structured data without writing selectors at all:

import asyncio
from mrscraper import MrScraperClient

async def extract_past_captcha():
    client = MrScraperClient(token="YOUR_MRSCRAPER_API_TOKEN")

    result = await client.create_scraper(
        url="https://protected-site.com/listings",
        message="Extract all listing titles, prices, and locations",
        agent="listing",
        proxy_country="US",
    )

    print("Extraction ID:", result["data"]["data"]["id"])

asyncio.run(extract_past_captcha())

CAPTCHAs, proxies, fingerprints — all handled. You just get the data.

Step 5: Build Retry Logic Around CAPTCHA Failures

For any pipeline that isn't using transparent CAPTCHA handling, you need retry logic. A failed CAPTCHA solve or an unexpected challenge page shouldn't kill your entire run:

import asyncio

async def scrape_with_captcha_retry(url, page, max_retries=3):
    for attempt in range(max_retries):
        await page.goto(url, wait_until="domcontentloaded")

        if await check_for_captcha(page):
            site_key = await get_recaptcha_site_key(page)
            if site_key:
                solved = await solve_recaptcha_v2(page, site_key, url)
                if solved:
                    # Wait for the page to respond to the solved CAPTCHA
                    await page.wait_for_navigation(wait_until="domcontentloaded")
                    # Now re-check — occasionally sites serve another challenge
                    if not await check_for_captcha(page):
                        break  # We're through — proceed with extraction

            # If solve failed or no site key — wait and retry with a fresh session
            print(f"CAPTCHA solve attempt {attempt + 1} failed, retrying...")
            await asyncio.sleep(10 * (attempt + 1))  # Progressive backoff
        else:
            break  # No CAPTCHA — continue normally

    # Check one final time before extracting
    if await check_for_captcha(page):
        print(f"Could not bypass CAPTCHA on {url} after {max_retries} attempts")
        return None

    await page.wait_for_selector(".product-card")
    return await page.eval_on_selector_all(
        ".product-card",
        "els => els.map(el => el.textContent.trim())"
    )

The progressive backoff — 10 * (attempt + 1) seconds between retries — gives the site's rate limiting system time to reset, and makes your retry pattern less mechanically detectable than immediate re-attempts.

Common Challenges and Limitations

reCAPTCHA v3 is invisible — and tricky. There's no visible checkbox or image grid. v3 runs silently in the background and returns a score (0.0 to 1.0) to the site — low scores trigger secondary challenges or outright blocks. You can't solve what you can't see. Prevention is the only real answer here: residential proxies, good fingerprints, and human-like behavior to keep your score above the site's threshold.

Cloudflare Turnstile is increasingly common. Turnstile uses browser fingerprinting and behavioral analysis rather than image challenges. Solving services sometimes support it, but coverage is inconsistent. MrScraper's infrastructure handles Turnstile transparently — it's one of the hardest CAPTCHAs to bypass with a DIY approach.

Token reuse is a myth. reCAPTCHA and hCaptcha tokens expire within 2 minutes and are bound to a specific session. You can't solve one token and reuse it across multiple requests — each page interaction that needs a verified token requires a fresh solve.

Solve times vary. Human-based CAPTCHA solving typically takes 15–30 seconds per challenge. During peak demand, it can reach 60 seconds or more. Build your pipeline around this latency — don't set extraction timeouts shorter than your worst-case solve time.

Some CAPTCHA types resist automated solving. FunCaptcha / Arkose Labs interactive puzzles, audio CAPTCHAs with heavy distortion, and some proprietary systems have low automation solve rates. For these targets, prevention (not triggering the CAPTCHA at all) is even more important.

Conclusion

CAPTCHAs feel like a dead end, but they're really a diagnostic. If you're seeing them constantly, your traffic is triggering suspicion — fix the underlying signals and the CAPTCHAs become rare. If you see them occasionally despite good prevention, a solving service gives you a reliable fallback.

The cleanest production architecture goes: residential proxies + session warming + fingerprint rotation to prevent CAPTCHAs, with MrScraper's Scraping Browser handling anything that slips through — transparently, without a single line of solving code in your pipeline.

Start with prevention. It's cheaper, faster, and more reliable than solving. Add a solving fallback for edge cases. And if you want to skip all of it — connect to MrScraper and let the infrastructure handle CAPTCHAs for you.

What We Learned

CAPTCHAs are a consequence, not the root problem — they appear because your traffic already looks suspicious; fixing the underlying signals (IP reputation, fingerprinting, request velocity) is always more effective than solving
reCAPTCHA v3 and Cloudflare Turnstile are invisible — they evaluate your entire session without a visible challenge, making prevention through residential proxies and realistic browser behavior the only reliable defense
Session warming dramatically reduces CAPTCHA triggers — visiting the homepage first, accumulating cookies, scrolling before navigating, and setting a realistic Referer header collectively make your traffic look like a real user
Third-party solving services (2captcha, Anti-Captcha) work for reCAPTCHA v2 and hCaptcha — expect 15–30 second solve times and costs around $0.50–$2.00 per 1,000 solves; build your timeouts accordingly
CAPTCHA tokens expire in ~2 minutes and are session-bound — there's no token reuse; each page that requires verification needs a fresh solve
MrScraper's Scraping Browser handles CAPTCHAs transparently at the infrastructure level — no detection code, no solving integration, no token injection; your Playwright pipeline sees a clean page regardless of what's behind it

FAQ

Why do I keep getting CAPTCHAs even after switching proxies? Switching proxies changes your IP — but not your browser fingerprint, your request velocity, or your session behavior. If your canvas fingerprint, WebGL renderer, and navigator properties are identical across every session, detection systems correlate your requests even across different IPs. Address the full fingerprint stack, not just the IP.
Can I solve reCAPTCHA v3 with a solving service? Not directly — v3 has no visible challenge to submit. Solving services that claim v3 support typically generate a token using a pre-solved browser session, but the reliability is lower than v2. The more practical approach for v3 is prevention: residential proxies, clean fingerprints, and human-like behavioral patterns to keep your risk score high enough.
How do I find the reCAPTCHA site key for the 2captcha API? Open Chrome DevTools → Elements tab → search for data-sitekey. It's an attribute on the .g-recaptcha div. Alternatively, search the page source for sitekey. For reCAPTCHA v3 loaded via script, look for the key in the grecaptcha.execute() call in the page's JavaScript.
Does MrScraper handle Cloudflare Turnstile? Yes — MrScraper's Scraping Browser handles Cloudflare Turnstile as part of its anti-bot infrastructure. Turnstile is one of the harder CAPTCHAs to bypass with DIY tooling because it relies on hardware-level browser fingerprinting rather than just image recognition. Infrastructure-level handling is the most reliable approach.
What's the difference between hCaptcha and reCAPTCHA for solving purposes? Functionally similar from a scraper's perspective — both use image challenges and return a token you inject into the page. Most solving services support both. The API calls differ slightly (method=userrecaptcha vs method=hcaptcha), but the token injection and solve flow are nearly identical. hCaptcha is slightly faster to solve on average because its challenges tend to be simpler image grids.