How to Scrape Product Reviews at Scale Without Getting Rate-Limited

Product reviews are some of the most valuable data on the internet. Sentiment analysis, competitor benchmarking, pricing intelligence, review monitoring — the use cases are real and the data is publicly visible. The problem: the sites hosting that data really don't want you collecting it at scale. Amazon, Walmart, Best Buy, Google Shopping — they all run serious anti-bot infrastructure that makes bulk review scraping one of the harder data collection challenges you'll face.

The rate-limiting and blocking you're hitting isn't random. It's a deliberate response to traffic patterns that look automated. The solution isn't to scrape faster or smarter — it's to scrape indistinguishably from a real user: the right IP type, the right request timing, the right browser identity, and a pipeline that detects and recovers from rate limits automatically.

Here's the complete approach: use residential proxies to pass IP reputation checks, rotate browser fingerprints to defeat behavioral fingerprinting, pace requests with human-like variability, build rate-limit detection into every request loop, and for the highest-protection targets, use a managed scraping browser that handles all of this at the infrastructure level. Let's build that pipeline step by step.

Why Product Review Sites Rate-Limit Scrapers

Before building the solution, understand exactly what's triggering the rate limits. Review platforms enforce multiple independent layers:

IP velocity detection — Too many requests from the same IP in a short window triggers a soft block (CAPTCHA) or hard block (403). Amazon's limits are notoriously aggressive: research on Amazon bot detection suggests that even 5–10 requests per minute from the same IP can trigger rate limiting on product pages.

Session pattern analysis — Real users don't navigate directly to 500 review pages in sequence. They come from search, browse around, maybe add something to cart. A session that hits only review endpoints with zero variation looks automated regardless of IP type.

Browser fingerprint consistency — If the same canvas hash, WebGL renderer, and navigator properties appear across thousands of requests, it's not 1,000 different users — it's one bot with a proxy list. Modern platforms fingerprint at the session level, not just the IP level.

Review-endpoint rate limits — Many platforms implement endpoint-specific rate limits stricter than their general page limits. Review pagination endpoints (/reviews?page=N) and review API calls often have lower thresholds than product landing pages because they're higher-value data targets.

CAPTCHA gating — Many review sites now gate review access behind invisible behavioral challenges (reCAPTCHA v3, Cloudflare Turnstile) that score your session before serving content. A low score means a CAPTCHA challenge or an outright block.

Step-by-Step Guide: Scraping Product Reviews at Scale

Step 1: Map Your Target's Review Structure

Before writing any scraping code, spend 20 minutes in Chrome DevTools understanding how your target serves review data. This determines whether you can scrape efficiently or need a full browser for every request.

Open DevTools → Network tab → filter by Fetch/XHR → navigate through a product's review pages → look for API calls.

Many major platforms (Amazon, Best Buy, Walmart) have a review API that the frontend calls in the background. If you find it, you can call it directly with requests — no browser rendering needed, much faster, lower cost.

import requests
import json

def check_for_review_api(product_url: str) -> None:
    """
    Manually inspectable function — shows what XHR calls a page makes.
    Run this, then check your browser's DevTools Network tab alongside it.
    """
    print(f"Inspect this URL in DevTools → Network → Fetch/XHR:")
    print(f"Navigate through review pages and look for calls like:")
    print(f"  /api/reviews?productId=...")
    print(f"  /reviews.json?asin=...")
    print(f"  /product/reviews/[product-id]")
    print(f"\nIf you find one, you can call it directly — no browser needed.")

# Once you've identified the API endpoint from DevTools:
def call_review_api_directly(product_id: str, page: int = 1, proxies: dict = None) -> dict:
    """
    Calling the review API directly is 10x faster than scraping rendered HTML.
    Only works if you've confirmed the endpoint is accessible without auth.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Language": "en-US,en;q=0.9",
        "X-Requested-With": "XMLHttpRequest",  # Many APIs require this header
        "Referer": f"https://example-shop.com/product/{product_id}",
    }

    response = requests.get(
        f"https://example-shop.com/api/reviews/{product_id}",
        params={"page": page, "pageSize": 50, "sortBy": "recent"},
        headers=headers,
        proxies=proxies,
        timeout=15,
    )

    if response.status_code == 200:
        return response.json()
    return {}

If a clean API endpoint exists, your pipeline just got dramatically simpler and cheaper. No browser rendering, no CAPTCHA, no JavaScript execution. If there's no API and content only loads via browser — proceed to Step 2.

Step 2: Set Up Residential Proxy Rotation

Review platforms on datacenter IPs = immediate blocks. You need residential proxies from the start, not as a fix when things break.

import requests
import random
import time
from typing import Optional

class ReviewScraperProxyManager:
    """Manages residential proxy rotation for review scraping."""

    def __init__(
        self,
        proxy_host: str,
        proxy_port: int,
        username: str,
        password: str,
        country: str = "US",
        max_requests_per_session: int = 8,
    ):
        self.host = proxy_host
        self.port = proxy_port
        self.username = username
        self.password = password
        self.country = country
        self.max_per_session = max_requests_per_session

        self.session_count = 0
        self.request_count = 0
        self.current_session = self._new_session_id()

    def _new_session_id(self) -> str:
        self.session_count += 1
        return f"rev-{self.session_count}-{random.randint(1000, 9999)}"

    def get_proxy(self) -> dict:
        """Get current proxy config, rotating if session is exhausted."""
        if self.request_count >= self.max_per_session:
            self.current_session = self._new_session_id()
            self.request_count = 0

        self.request_count += 1
        user = f"{self.username}-session-{self.current_session}-country-{self.country}"
        url = f"http://{user}:{self.password}@{self.host}:{self.port}"
        return {"http": url, "https": url}

    def force_rotate(self) -> None:
        """Immediately rotate on block signals — don't wait for quota."""
        self.current_session = self._new_session_id()
        self.request_count = 0

proxy_manager = ReviewScraperProxyManager(
    proxy_host="residential-proxy.provider.com",
    proxy_port=8080,
    username="your_username",
    password="your_password",
    country="US",
    max_requests_per_session=8,  # Conservative for review endpoints
)

Eight requests per session before rotating is conservative for review endpoints — which are often more aggressively rate-limited than product landing pages. Start conservative and loosen if your block rate is low.

Step 3: Build the Rate-Limit Aware Request Function

Every request in your pipeline needs to handle rate-limit signals — and respond correctly:

import requests
import time
import random
import logging
from typing import Optional

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

REVIEW_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def fetch_review_page(
    url: str,
    proxy_manager: ReviewScraperProxyManager,
    max_retries: int = 4,
) -> Optional[str]:
    """
    Fetch a review page with automatic proxy rotation and rate-limit handling.
    Returns HTML content or None if all retries fail.
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(
                url,
                headers=REVIEW_HEADERS,
                proxies=proxy_manager.get_proxy(),
                timeout=20,
                allow_redirects=True,
            )

            # Success
            if response.status_code == 200:
                # Quick sanity check — make sure we got real content, not a block page
                if is_valid_review_page(response.text):
                    return response.text
                else:
                    logger.warning(f"Got 200 but content looks like a block page — rotating proxy")
                    proxy_manager.force_rotate()
                    wait = random.uniform(5.0, 12.0)

            # Rate limited — back off and rotate
            elif response.status_code == 429:
                proxy_manager.force_rotate()
                retry_after = int(response.headers.get("Retry-After", 10))
                wait = max(retry_after, random.uniform(10.0, 20.0))
                logger.warning(f"429 rate limit. Waiting {wait:.0f}s (attempt {attempt + 1}/{max_retries})")

            # Blocked or CAPTCHA
            elif response.status_code in (403, 503):
                proxy_manager.force_rotate()
                wait = (2 ** attempt) * random.uniform(3.0, 6.0)
                logger.warning(f"{response.status_code} block signal. Waiting {wait:.0f}s")

            # Not found — don't retry
            elif response.status_code == 404:
                logger.info(f"404 on {url} — product may not exist")
                return None

            else:
                wait = random.uniform(3.0, 7.0)
                logger.warning(f"Unexpected {response.status_code} on {url}")

            time.sleep(wait)

        except requests.exceptions.ProxyError:
            proxy_manager.force_rotate()
            logger.warning(f"Proxy error on attempt {attempt + 1}")
            time.sleep(random.uniform(2.0, 5.0))

        except requests.exceptions.Timeout:
            logger.warning(f"Timeout on attempt {attempt + 1}")
            time.sleep(random.uniform(3.0, 8.0))

    logger.error(f"All {max_retries} attempts failed for: {url}")
    return None

def is_valid_review_page(html: str) -> bool:
    """
    Verify that the response is real review content, not a CAPTCHA or block page.
    Customize these signals for your specific target site.
    """
    block_signals = [
        "captcha",
        "robot",
        "unusual traffic",
        "verify you are human",
        "access denied",
        "cf-challenge",   # Cloudflare challenge
        "ray id",         # Cloudflare block page
    ]
    html_lower = html.lower()
    return not any(signal in html_lower for signal in block_signals)

The is_valid_review_page() function is easy to overlook but critical. Many anti-bot systems return HTTP 200 with a challenge page instead of a proper error code — your scraper happily accepts the "success" and stores garbage. Always validate content, not just status codes.

Step 4: Implement Human-Like Pacing

This is where most review scrapers fail. They add a time.sleep(2) and think they're done. Fixed delays are one of the strongest bot signals — real users are deeply inconsistent.

import asyncio
import random
from dataclasses import dataclass

@dataclass
class PacingConfig:
    base_min: float = 3.0       # Minimum delay between requests (seconds)
    base_max: float = 7.0       # Maximum delay between requests
    long_pause_prob: float = 0.12  # 12% chance of a much longer pause
    long_pause_min: float = 15.0   # Minimum long pause
    long_pause_max: float = 35.0   # Maximum long pause
    burst_break_every: int = 20    # Take an extended break every N requests
    burst_break_min: float = 45.0  # Minimum extended break
    burst_break_max: float = 90.0  # Maximum extended break

async def human_paced_delay(request_num: int, config: PacingConfig) -> None:
    """Generate a human-like delay pattern between review page requests."""

    # Extended break every N requests — real users don't browse indefinitely
    if request_num > 0 and request_num % config.burst_break_every == 0:
        pause = random.uniform(config.burst_break_min, config.burst_break_max)
        logger.info(f"Extended break after {request_num} requests ({pause:.0f}s)")
        await asyncio.sleep(pause)
        return

    # Occasional longer pause — reading a review, switching tabs, etc.
    if random.random() < config.long_pause_prob:
        pause = random.uniform(config.long_pause_min, config.long_pause_max)
        await asyncio.sleep(pause)
        return

    # Standard variable delay
    pause = random.uniform(config.base_min, config.base_max)
    await asyncio.sleep(pause)

async def scrape_product_reviews(
    product_ids: list[str],
    base_url: str,
    proxy_manager: ReviewScraperProxyManager,
    pacing: PacingConfig = PacingConfig(),
) -> list[dict]:
    """
    Scrape reviews for a list of products with human-like pacing.
    """
    all_reviews = []
    total_requests = 0

    for product_id in product_ids:
        product_reviews = []
        page = 1
        max_pages = 20  # Cap to avoid runaway loops on products with thousands of reviews

        while page <= max_pages:
            review_url = f"{base_url}/product/{product_id}/reviews?page={page}"
            html = fetch_review_page(review_url, proxy_manager)

            if html is None:
                logger.warning(f"Failed to fetch {product_id} page {page} — skipping")
                break

            # Parse reviews from HTML here (site-specific logic)
            page_reviews = parse_reviews_from_html(html, product_id, page)
            product_reviews.extend(page_reviews)

            logger.info(f"Product {product_id}: page {page} → {len(page_reviews)} reviews")

            # If we got fewer reviews than expected, we've hit the last page
            if len(page_reviews) < 10:  # Adjust based on your target's page size
                break

            page += 1
            total_requests += 1
            await human_paced_delay(total_requests, pacing)

        all_reviews.extend(product_reviews)
        logger.info(f"Completed {product_id}: {len(product_reviews)} total reviews")

        # Inter-product pause — simulate time between searching for new products
        await asyncio.sleep(random.uniform(5.0, 12.0))

    return all_reviews

def parse_reviews_from_html(html: str, product_id: str, page: int) -> list[dict]:
    """
    Site-specific review parsing — customize selectors for your target.
    """
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    reviews = []

    for review_el in soup.select(".review-item, .customer-review, [data-hook='review']"):
        review = {
            "product_id": product_id,
            "page": page,
            "rating": None,
            "title": None,
            "body": None,
            "date": None,
            "verified": False,
        }

        # Rating
        rating_el = review_el.select_one("[class*='rating'], [aria-label*='stars'], .star-rating")
        if rating_el:
            import re
            rating_match = re.search(r"(\d+\.?\d*)\s*(out of)?\s*5", rating_el.get("aria-label", rating_el.text))
            review["rating"] = float(rating_match.group(1)) if rating_match else None

        # Title
        title_el = review_el.select_one(".review-title, [data-hook='review-title'], h3")
        review["title"] = title_el.get_text(strip=True) if title_el else None

        # Body
        body_el = review_el.select_one(".review-text, [data-hook='review-body'], .review-content")
        review["body"] = body_el.get_text(strip=True) if body_el else None

        # Date
        date_el = review_el.select_one(".review-date, [data-hook='review-date'], time")
        review["date"] = date_el.get_text(strip=True) if date_el else None

        # Verified purchase
        verified_el = review_el.select_one("[data-hook='avp-badge'], .verified-purchase")
        review["verified"] = verified_el is not None

        if review["body"]:  # Only include reviews with actual text
            reviews.append(review)

    return reviews

Step 5: Handle JavaScript-Rendered Review Content

Many modern review sections load dynamically — the initial HTML has a skeleton, and reviews populate via an XHR call triggered by JavaScript. If your HTML scraper keeps returning empty review lists, this is why.

For JavaScript-rendered review content, you need a real browser. Here's how to scrape reviews that only appear after JavaScript runs:

from playwright.async_api import async_playwright
import asyncio
import random

async def scrape_js_rendered_reviews(product_url: str, proxy_config: dict) -> list[dict]:
    """
    For review sections that only appear after JavaScript renders them.
    Uses Playwright with residential proxy to bypass detection.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy_config,  # Your residential proxy config
        )

        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
        )

        page = await context.new_page()

        # Remove webdriver signal before anti-bot scripts execute
        await page.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', { get: () => undefined });"
        )

        # Visit the product page first — warm up the session
        await page.goto(product_url, wait_until="domcontentloaded")
        await asyncio.sleep(random.uniform(2.0, 4.0))

        # Scroll to simulate reading the product description
        await page.evaluate("window.scrollBy(0, window.innerHeight * 0.5)")
        await asyncio.sleep(random.uniform(1.0, 2.5))

        # Wait for reviews section to load
        try:
            await page.wait_for_selector(".review-item, [data-hook='review']", timeout=10000)
        except Exception:
            await browser.close()
            return []

        # Extract rendered reviews
        reviews = await page.eval_on_selector_all(
            ".review-item, [data-hook='review']",
            """els => els.map(el => ({
                rating: el.querySelector("[aria-label*='stars']")?.getAttribute("aria-label") || null,
                title: el.querySelector(".review-title, [data-hook='review-title']")?.textContent.trim() || null,
                body: el.querySelector(".review-text, [data-hook='review-body']")?.textContent.trim() || null,
                date: el.querySelector(".review-date, [data-hook='review-date']")?.textContent.trim() || null,
                verified: !!el.querySelector("[data-hook='avp-badge'], .verified-purchase"),
            }))"""
        )

        await browser.close()
        return [r for r in reviews if r.get("body")]  # Filter empty reviews

Step 6: Use MrScraper for Protected Review Sites at Scale

For the hardest targets — Amazon, Walmart, major review platforms with serious anti-bot infrastructure — the DIY approach above works but requires constant maintenance as detection systems update. This is where MrScraper's managed infrastructure takes over.

Connect your existing Playwright scraper to MrScraper's Scraping Browser and let the infrastructure handle residential proxies, fingerprinting, and CAPTCHA solving automatically:

from playwright.async_api import async_playwright
import asyncio

async def scrape_protected_reviews_at_scale(product_urls: list[str]) -> list[dict]:
    """
    For protected review sites — MrScraper handles proxies, fingerprinting, CAPTCHAs.
    Your extraction logic is unchanged from the local Playwright version.
    """
    all_reviews = []

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(
            "wss://browser.mrscraper.com?token=YOUR_API_TOKEN"
        )

        for product_url in product_urls:
            page = await browser.new_page()

            await page.goto(product_url, wait_until="domcontentloaded")
            await asyncio.sleep(2)  # Brief wait for dynamic content

            try:
                await page.wait_for_selector("[data-hook='review'], .review-item", timeout=12000)

                reviews = await page.eval_on_selector_all(
                    "[data-hook='review'], .review-item",
                    """els => els.map(el => ({
                        rating: el.querySelector("[aria-label*='stars']")?.getAttribute("aria-label"),
                        title: el.querySelector("[data-hook='review-title']")?.textContent.trim(),
                        body: el.querySelector("[data-hook='review-body']")?.textContent.trim(),
                        date: el.querySelector("[data-hook='review-date']")?.textContent.trim(),
                        verified: !!el.querySelector("[data-hook='avp-badge']"),
                    }))"""
                )

                all_reviews.extend(r for r in reviews if r.get("body"))
                print(f"Extracted {len(reviews)} reviews from {product_url}")

            except Exception as e:
                print(f"Failed on {product_url}: {e}")

            finally:
                await page.close()

            await asyncio.sleep(random.uniform(3.0, 6.0))

        await browser.close()

    return all_reviews

asyncio.run(scrape_protected_reviews_at_scale([
    "https://protected-shop.com/product/123/reviews",
    "https://protected-shop.com/product/456/reviews",
]))

Or use MrScraper's AI extraction SDK to skip the selector work entirely — describe what you want in plain English:

import asyncio
from mrscraper import MrScraperClient

async def ai_extract_reviews():
    client = MrScraperClient(token="YOUR_MRSCRAPER_API_TOKEN")

    result = await client.create_scraper(
        url="https://protected-shop.com/product/123/reviews",
        message="Extract all customer reviews with their star rating, title, review text, date, and whether it's a verified purchase",
        agent="listing",    # Reviews are repeated items — use listing agent
        proxy_country="US",
    )

    print("Review extraction job:", result["data"]["data"]["id"])

asyncio.run(ai_extract_reviews())

No selectors. No proxy management. No rate-limit handling code. You describe the data you want, and MrScraper handles the infrastructure — including JavaScript rendering, proxy rotation, and CAPTCHA solving — and returns structured review data.

Common Challenges and Limitations

Review content behind login walls. Some platforms only show full review content to authenticated users. Authenticated scraping requires session management (storing cookies after login) and significantly complicates the pipeline. Always check whether reviews are accessible without login before building an authenticated flow.

Review pagination with AJAX loading. Some sites load reviews via "Load More" buttons rather than traditional pagination URLs. You can't construct /reviews?page=2 — you have to click the button. Handle this in Playwright with page.click(".load-more-reviews") followed by wait_for_selector() for the new content.

Review count inconsistency. Platforms often display a total review count that doesn't match the actual number of pages available to scrape. Amazon, for example, limits public review access to 10 pages (100 reviews) regardless of how many thousands a product has. Know your target's caps before planning data volume expectations.

Identical fingerprints across sessions. Many scraping pipelines rotate IPs but forget to rotate browser fingerprints. If the same canvas hash appears across thousands of sessions on different IPs, the fingerprint correlation still identifies you. Rotate User-Agent, viewport, and screen resolution alongside each proxy rotation.

Respecting review data licensing. Review text, star ratings, and user-generated content may be subject to copyright or platform Terms of Service restrictions on commercial use. Scraping for personal research differs from scraping for a commercial product. Review your target's ToS and consult legal counsel for any commercial use case involving user-generated content.

Conclusion

Scraping product reviews at scale is a multi-layer problem — IP reputation, browser fingerprinting, request pacing, rate-limit detection, and content validation all need to work together. Missing any one of them produces a pipeline that works in testing and breaks in production.

The approach that holds up: residential proxies with aggressive session rotation, variable request pacing with long-pause probability built in, content validation on every response, and a Playwright layer for JavaScript-rendered review sections. For the highest-protection targets like Amazon or major retail platforms, connecting to MrScraper's Scraping Browser handles the fingerprinting, proxy rotation, and CAPTCHA layers at the infrastructure level — so your code focuses on extracting reviews, not on staying undetected.

Start with the API interception check — it's the fastest path to review data when it works. Fall back to the full browser pipeline when it doesn't. And when the target is protected enough to require managed infrastructure, one line change is all it takes.

What We Learned

Check for a review API first — many platforms (Amazon, Walmart, Best Buy) load reviews via background XHR calls that you can call directly with requests, skipping browser rendering entirely and dramatically simplifying your pipeline
Content validation is as important as status code checking — anti-bot systems frequently return HTTP 200 with a challenge page; is_valid_review_page() should be part of every response handler, not an optional check
Fixed delays are one of the strongest bot signals — use random.uniform() for base delays, a 12% probability long-pause, and an extended break every 20 requests to produce timing patterns that match real browsing behavior
Rotate proxies and fingerprints together — IP rotation without fingerprint rotation allows correlation via consistent canvas hashes, WebGL renderer strings, and navigator properties across different IPs
Platform-specific review caps are real constraints — Amazon limits public review access to 10 pages regardless of total count; know your target's caps before planning volume expectations, not after
MrScraper's connect_over_cdp() is a single-line upgrade for protected targets — your selectors, wait conditions, and extraction logic stay unchanged; the infrastructure underneath upgrades to handle residential proxies, fingerprinting, and CAPTCHA solving automatically

FAQ

Why do I get reviews for the first few pages but nothing after page 3? Most likely a session-level rate limit — not an IP ban. The platform is tracking your session, and after 2–3 pages in rapid succession, the session gets soft-blocked. Add longer inter-page delays (10–20 seconds), start a fresh browser session (new proxy + new cookies) for each product, and consider hitting page 1, 3, 5 in randomized order rather than sequential pagination which is a clear automated signal.
Can I scrape Amazon reviews without a browser? Sometimes — for certain products, Amazon's internal review API (which the review page itself calls) is accessible directly via requests with proper headers and cookies. But Amazon's API changes frequently, requires valid session cookies, and varies by product category. The browser-based approach via Playwright is more reliable long-term, even if it's slower.
How do I handle "Verified Purchase" filtering in my scraper? Most review pages expose verified status as a data attribute or specific class on the review element. In your parser, look for elements like [data-hook='avp-badge'] (Amazon) or .verified-purchase and set a boolean field. If you want to filter to verified-only reviews during scraping, look for a URL parameter like filterByStar=verified_purchase or equivalent on your target site.
What's the best way to store scraped reviews at scale? For large-scale review datasets, write to a database directly from your scraping loop rather than accumulating everything in memory. PostgreSQL handles structured review data well. Add a scraped_at timestamp and source_url field to every record so you can identify staleness and data provenance. For very high volume, consider a queue (Redis, SQS) between your scraper and database writer to decouple collection from storage.
Does MrScraper's AI extraction work for reviews with complex nested structures? Yes — the AI extraction reads the page semantically rather than targeting specific HTML elements. For a prompt like "extract all customer reviews with star rating, title, review text, date, and verified status," the AI identifies these fields from context regardless of how they're nested in the DOM. This is particularly useful for platforms that change their review HTML structure frequently, since the AI adapts automatically while CSS selector-based scrapers break.