How to Scrape Login-Protected Pages With a Scraping Browser (Step-by-Step Guide)
Article

How to Scrape Login-Protected Pages With a Scraping Browser (Step-by-Step Guide)

Guide

A concise overview of scraping login-protected websites using Playwright session persistence and automated re-authentication, while highlighting how MrScraper simplifies anti-bot–protected authentication workflows.

You've identified the data you need. It's right there — behind a login form. Could be a B2B platform, a job board that requires an account, a member directory, a dashboard, or an analytics tool. The data is technically accessible to you as a registered user. The problem is making your scraper authenticate and maintain that authenticated state across dozens or hundreds of pages.

Login-protected scraping is one of the more complex challenges in web automation — not because authentication is conceptually hard, but because modern login flows include multi-step forms, CSRF tokens, JavaScript-rendered form submissions, 2FA, and session management that breaks in subtle ways when you're using a headless browser.

The complete approach: perform the login flow in a real browser session (using Playwright), save the resulting browser state — cookies, localStorage, sessionStorage — and reuse that saved state for all subsequent requests without logging in again. For protected platforms where the login itself is guarded by CAPTCHA or bot detection, a cloud-based scraping browser handles the authentication transparently. Let's build this step by step.

Why Login-Protected Scraping Is Different

Regular scraping: send a request, get HTML, parse it. Done.

Authenticated scraping adds three new problems:

The authentication problem — You need to perform a login flow before you can access protected content. This might be simple (POST username + password to a form endpoint) or complex (OAuth redirect, JavaScript-rendered form with CSRF tokens, 2FA prompt, social login).

The session maintenance problem — After login, the server gives you a session cookie (or a JWT in localStorage, or a bearer token in sessionStorage). Every subsequent request must include this credential. If it expires or gets invalidated, you get silently redirected back to the login page — often with a 200 status code, meaning your scraper happily collects login page HTML instead of content.

The detection problem — Many platforms add additional bot detection specifically to their login flows. If the login itself fails because of CAPTCHA or fingerprint checks, no amount of session management helps.

Understanding which of these three problems you're dealing with determines which solution to apply.

Step-by-Step Guide: Scraping Login-Protected Pages

Step 1: Understand the Login Mechanism

Before writing any code, spend 10 minutes in Chrome DevTools inspecting how the site handles authentication.

Open DevTools → Network tab → clear the log → submit the login form → watch what happens.

# Quick diagnostic: try a simple POST login first
# If it works, you don't need a browser at all
import requests

def test_simple_post_login(login_url: str, credentials: dict) -> bool:
    """
    Test whether the site accepts a direct POST login.
    Many sites still use simple form POST — no browser needed.
    """
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Referer": login_url,
    })

    # GET the login page first to capture cookies and CSRF tokens
    login_page = session.get(login_url, timeout=15)

    # Check for CSRF token
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(login_page.text, "html.parser")
    csrf = soup.find("input", {"name": lambda n: n and "csrf" in n.lower()})
    if csrf:
        credentials[csrf["name"]] = csrf.get("value", "")
        print(f"CSRF token found: {csrf['name']}")

    # Attempt POST login
    response = session.post(
        login_url,
        data=credentials,
        allow_redirects=True,
        timeout=15,
    )

    # Success signals: redirect away from login page, or profile element appears
    is_logged_in = (
        "login" not in response.url.lower()
        and response.status_code in (200, 302)
    )
    print(f"Login attempt: {response.status_code} → {response.url}")
    return is_logged_in, session

success, session = test_simple_post_login(
    "https://example.com/login",
    {"email": "your@email.com", "password": "yourpassword"}
)
print("Simple POST login worked:", success)

If this works, you can scrape authenticated pages with just requests.Session() — no browser needed, much faster. If it fails (CSRF validation, JavaScript required, bot detection), proceed to the browser-based approach.

What to look for in DevTools:

  • Does the login submit to a simple form POST endpoint?
  • Is there a JavaScript-rendered form (React/Vue login component)?
  • Does the login redirect through an OAuth flow?
  • Is there a CAPTCHA on the login form?
  • Does the login require a one-time code (2FA)?

Step 2: Perform the Login With Playwright

For JavaScript-rendered logins, OAuth flows, or any site where simple POST doesn't work, Playwright is the right tool — it runs a real browser that handles all of these naturally.

from playwright.async_api import async_playwright
import asyncio
import os

async def login_and_save_state(
    login_url: str,
    username: str,
    password: str,
    state_file: str = "auth_state.json",
) -> bool:
    """
    Log into a site using a real browser and save the authenticated state.
    State file stores cookies + localStorage + sessionStorage for reuse.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = await context.new_page()

        # Remove webdriver signal before any page JS runs
        await page.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', { get: () => undefined });"
        )

        print(f"Navigating to login page: {login_url}")
        await page.goto(login_url, wait_until="domcontentloaded")

        # Wait for the login form to be ready
        await page.wait_for_selector(
            "input[type='email'], input[type='text'], input[name='email'], input[name='username']",
            timeout=10000,
        )

        # Fill credentials with small delays — mimics human typing
        email_field = await page.query_selector(
            "input[type='email'], input[name='email'], input[name='username']"
        )
        await email_field.click()
        await page.keyboard.type(username, delay=80)   # 80ms between keystrokes

        await asyncio.sleep(0.8)  # Brief pause between fields

        password_field = await page.query_selector("input[type='password']")
        await password_field.click()
        await page.keyboard.type(password, delay=60)

        await asyncio.sleep(0.5)

        # Click the submit button
        await page.click(
            "button[type='submit'], input[type='submit'], button:has-text('Sign in'), button:has-text('Log in')"
        )

        # Wait for successful login — URL changes away from login page
        try:
            await page.wait_for_url(
                lambda url: "login" not in url.lower() and "signin" not in url.lower(),
                timeout=15000,
            )
            print(f"Login successful — now at: {page.url}")
        except Exception:
            # Check if we're still on the login page with an error message
            error_el = await page.query_selector(".error, .alert-danger, [role='alert']")
            if error_el:
                error_text = await error_el.text_content()
                print(f"Login failed: {error_text}")
            else:
                print("Login status unclear — saving state anyway")

        # Save the full browser state: cookies + localStorage + sessionStorage
        await context.storage_state(path=state_file)
        print(f"Browser state saved to: {state_file}")

        await browser.close()
        return True

asyncio.run(login_and_save_state(
    login_url="https://example.com/login",
    username=os.getenv("SITE_USERNAME"),
    password=os.getenv("SITE_PASSWORD"),
))

The keyboard.type(text, delay=80) pattern — typing each character with a millisecond delay — is important. Login forms with keystroke-timing analysis detect bot-like instantaneous form fills. An 80ms delay per character looks human without being slow.

The storage_state(path=state_file) call is the key output of this step. It saves the entire browser authentication state — session cookies, localStorage tokens, sessionStorage data — to a JSON file. This is what you'll load for every subsequent scraping session.

Step 3: Reuse the Saved State for Scraping

With a saved auth_state.json, every subsequent scraping session loads this state instead of logging in again:

from playwright.async_api import async_playwright
import asyncio
import json
from pathlib import Path

async def scrape_authenticated_pages(
    urls: list[str],
    state_file: str = "auth_state.json",
) -> list[dict]:
    """
    Scrape login-protected pages using saved authentication state.
    Re-authenticates automatically if the session has expired.
    """
    if not Path(state_file).exists():
        raise FileNotFoundError(
            f"No saved auth state found at {state_file}. "
            "Run login_and_save_state() first."
        )

    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        # Load saved authentication state — no login needed
        context = await browser.new_context(
            storage_state=state_file,
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            viewport={"width": 1440, "height": 900},
        )

        for url in urls:
            page = await context.new_page()

            try:
                await page.goto(url, wait_until="domcontentloaded")

                # Check if we got redirected to login — session may have expired
                if "login" in page.url.lower() or "signin" in page.url.lower():
                    print(f"Session expired on {url} — need to re-authenticate")
                    await page.close()
                    await browser.close()
                    return None  # Signal to caller: re-login needed

                # Wait for authenticated content
                await page.wait_for_load_state("networkidle")

                # Extract data — customize selectors for your target
                data = await page.eval_on_selector_all(
                    ".data-row, .content-item, tr[data-id]",
                    """els => els.map(el => ({
                        text: el.textContent.trim(),
                        href: el.querySelector("a")?.href || null,
                    }))"""
                )

                results.append({
                    "url": url,
                    "items": data,
                    "count": len(data),
                })
                print(f"Scraped {len(data)} items from {url}")

            except Exception as e:
                print(f"Error on {url}: {e}")
            finally:
                await page.close()

            # Delay between pages — important for authenticated sessions
            # Rapid automated navigation on protected platforms gets flagged
            await asyncio.sleep(3.0)

        await browser.close()

    return results

# Load state and scrape
results = asyncio.run(scrape_authenticated_pages([
    "https://example.com/dashboard/data",
    "https://example.com/dashboard/reports",
    "https://example.com/dashboard/analytics",
]))

Step 4: Build Automatic Session Refresh

Sessions expire. Tokens rotate. Accounts get logged out. A production pipeline needs to detect session expiry and re-authenticate without manual intervention:

import asyncio
import os
from pathlib import Path
from datetime import datetime, timedelta

class AuthenticatedScraper:
    """
    Manages authenticated scraping with automatic session refresh.
    """

    def __init__(
        self,
        login_url: str,
        username: str,
        password: str,
        state_file: str = "auth_state.json",
        session_max_age_hours: int = 12,
    ):
        self.login_url = login_url
        self.username = username
        self.password = password
        self.state_file = state_file
        self.session_max_age = timedelta(hours=session_max_age_hours)
        self.session_created_at = None

    def _is_state_fresh(self) -> bool:
        """Check whether saved state is within its max age."""
        if not Path(self.state_file).exists():
            return False
        if self.session_created_at is None:
            return False
        return datetime.now() - self.session_created_at < self.session_max_age

    async def ensure_authenticated(self) -> None:
        """Log in if needed, or if session has expired."""
        if not self._is_state_fresh():
            print("Authenticating...")
            await login_and_save_state(
                self.login_url,
                self.username,
                self.password,
                self.state_file,
            )
            self.session_created_at = datetime.now()
        else:
            print("Using existing session state.")

    def _is_login_page(self, url: str) -> bool:
        return any(signal in url.lower() for signal in ["login", "signin", "auth", "session"])

    async def scrape_page(
        self,
        url: str,
        extract_selector: str,
        browser_context,
    ) -> list:
        """Scrape one page, detecting session expiry."""
        page = await browser_context.new_page()
        try:
            await page.goto(url, wait_until="domcontentloaded")

            # Session expiry detection
            if self._is_login_page(page.url):
                print(f"Redirected to login page — invalidating session state")
                Path(self.state_file).unlink(missing_ok=True)
                self.session_created_at = None
                return None  # Caller should re-authenticate and retry

            await page.wait_for_load_state("networkidle")
            items = await page.eval_on_selector_all(
                extract_selector,
                "els => els.map(el => el.textContent.trim())",
            )
            return items

        finally:
            await page.close()

    async def run(self, urls: list[str], extract_selector: str) -> list[dict]:
        """Full pipeline: authenticate → scrape → handle expiry."""
        await self.ensure_authenticated()
        all_results = []

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context(
                storage_state=self.state_file,
                user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            )

            for url in urls:
                items = await self.scrape_page(url, extract_selector, context)

                if items is None:
                    # Session expired mid-run — re-authenticate and retry
                    await browser.close()
                    await self.ensure_authenticated()
                    browser = await p.chromium.launch(headless=True)
                    context = await browser.new_context(storage_state=self.state_file)
                    items = await self.scrape_page(url, extract_selector, context) or []

                all_results.append({"url": url, "items": items})
                await asyncio.sleep(3.0)

            await browser.close()

        return all_results

# Usage
scraper = AuthenticatedScraper(
    login_url="https://example.com/login",
    username=os.getenv("SITE_USERNAME"),
    password=os.getenv("SITE_PASSWORD"),
    session_max_age_hours=8,
)

results = asyncio.run(scraper.run(
    urls=["https://example.com/members/1", "https://example.com/members/2"],
    extract_selector=".member-data",
))

Step 5: Handle Two-Factor Authentication (2FA)

Some platforms require 2FA after the password step. For TOTP-based 2FA (Google Authenticator, Authy), you can generate the code programmatically using the pyotp library with your TOTP secret:

import pyotp
import asyncio

async def login_with_2fa(
    login_url: str,
    username: str,
    password: str,
    totp_secret: str,         # Your TOTP secret key from the platform's 2FA setup
    state_file: str = "auth_state.json",
) -> bool:
    """
    Handle login flows that require TOTP-based two-factor authentication.
    Requires the TOTP secret (shown during 2FA setup as a QR code or text key).
    """
    totp = pyotp.TOTP(totp_secret)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        await page.goto(login_url, wait_until="domcontentloaded")

        # Step 1: Fill email/username
        await page.fill("input[type='email'], input[name='email']", username)
        await page.click("button[type='submit']")
        await asyncio.sleep(1.0)

        # Step 2: Fill password (some flows split email + password across pages)
        await page.fill("input[type='password']", password)
        await page.click("button[type='submit']")
        await asyncio.sleep(2.0)

        # Step 3: 2FA prompt appears
        try:
            await page.wait_for_selector(
                "input[name='otp'], input[name='code'], input[placeholder*='code']",
                timeout=8000,
            )

            # Generate current TOTP code — valid for 30 seconds
            current_code = totp.now()
            print(f"Generated 2FA code: {current_code}")

            await page.fill(
                "input[name='otp'], input[name='code'], input[placeholder*='code']",
                current_code,
            )
            await page.click("button[type='submit']")

            await page.wait_for_url(
                lambda url: "login" not in url.lower(),
                timeout=10000,
            )
            print("2FA login successful")

        except Exception:
            print("No 2FA prompt detected — may have logged in on password step")

        await context.storage_state(path=state_file)
        await browser.close()
        return True

# Note: pyotp requires your TOTP secret key, not the 6-digit code
# The secret is shown during 2FA setup as a text string like "JBSWY3DPEHPK3PXP"
asyncio.run(login_with_2fa(
    login_url="https://example.com/login",
    username=os.getenv("SITE_USERNAME"),
    password=os.getenv("SITE_PASSWORD"),
    totp_secret=os.getenv("TOTP_SECRET"),
))

For SMS-based 2FA or email code 2FA, automation requires an external service (SMS receiving API, email inbox access) to intercept the code — significantly more complex and often not worth pursuing for a scraping use case.

Step 6: Use MrScraper's fetch_html for Post-Login Pages

Once you have a valid authenticated session in auth_state.json, you can pass its cookies to MrScraper's fetch_html for pages that need the full scraping infrastructure (residential proxies, anti-bot bypass) alongside authenticated access.

For authenticated scraping on platforms without heavy anti-bot protection on the content pages (only on login), Playwright with storage_state is typically sufficient. For platforms that also protect content pages with Cloudflare or DataDome, pairing saved session cookies with MrScraper gives you both:

import asyncio
import os
import json
from playwright.async_api import async_playwright
from mrscraper import MrScraper
from mrscraper.exceptions import APIError

async def scrape_authenticated_with_mrscraper(
    url: str,
    state_file: str = "auth_state.json",
) -> str:
    """
    Fetch an authenticated, bot-protected page using saved session cookies
    passed through MrScraper's fetch_html for residential proxy coverage.
    """
    # Load saved auth state
    with open(state_file) as f:
        auth_state = json.load(f)

    # Extract session cookies from the saved state
    cookies = {
        cookie["name"]: cookie["value"]
        for cookie in auth_state.get("cookies", [])
    }

    client = MrScraper(token=os.getenv("MRSCRAPER_API_TOKEN"))

    try:
        # fetch_html with cookies passed as extra headers or via session config
        # Check MrScraper docs for cookie injection support in your SDK version
        result = await client.fetch_html(
            url,
            geo_code="US",
            timeout=120,
        )
        return result["data"]

    except APIError as e:
        print(f"API error {e.status_code}: {e}")
        return ""

asyncio.run(scrape_authenticated_with_mrscraper(
    "https://protected-platform.com/dashboard/data"
))

For the most complex authenticated scraping scenarios — where login is guarded by bot detection, content pages are also protected, and you need residential proxies throughout — connecting Playwright directly to MrScraper's Scraping Browser provides a fully managed browser session:

from playwright.async_api import async_playwright
import asyncio

async def full_authenticated_pipeline(
    login_url: str,
    username: str,
    password: str,
    content_urls: list[str],
):
    """
    Full authenticated pipeline via MrScraper's cloud scraping browser.
    Handles bot-protected login AND bot-protected content pages.
    """
    async with async_playwright() as p:
        # Connect to MrScraper's cloud browser
        # Residential proxies, fingerprinting, CAPTCHA solving all handled
        browser = await p.chromium.connect_over_cdp(
            f"wss://browser.mrscraper.com?token={os.getenv('MRSCRAPER_API_TOKEN')}"
        )

        context = await browser.new_context()
        page = await context.new_page()

        # Perform login through the cloud browser
        await page.goto(login_url, wait_until="domcontentloaded")
        await page.fill("input[type='email']", username)
        await page.fill("input[type='password']", password)
        await page.click("button[type='submit']")
        await page.wait_for_url(
            lambda url: "login" not in url.lower(),
            timeout=15000,
        )
        print(f"Logged in via scraping browser")

        # Scrape content pages — session cookies persist in the same context
        results = []
        for url in content_urls:
            await page.goto(url, wait_until="domcontentloaded")
            await page.wait_for_load_state("networkidle")

            data = await page.eval_on_selector_all(
                ".data-row",
                "els => els.map(el => el.textContent.trim())"
            )
            results.append({"url": url, "data": data})
            await asyncio.sleep(3.0)

        await browser.close()
        return results

asyncio.run(full_authenticated_pipeline(
    login_url="https://bot-protected-platform.com/login",
    username=os.getenv("SITE_USERNAME"),
    password=os.getenv("SITE_PASSWORD"),
    content_urls=[
        "https://bot-protected-platform.com/data/page1",
        "https://bot-protected-platform.com/data/page2",
    ],
))

Common Challenges and Limitations

Silent session expiry is the most common failure mode. Many platforms redirect expired sessions to the login page with HTTP 200 — not a 401 or 403. Your scraper happily collects login page HTML and stores it as data. Always validate that authenticated content (not a login form) is present in every response. The _is_login_page(url) check in the AuthenticatedScraper class catches URL-based redirects; also add a content check for platforms that keep the URL the same.

keyboard.type() with delay beats fill() for login forms. Playwright's fill() method sets the input value instantaneously — which is undetectable at the DOM level but fails keystroke-timing analysis that some platforms run. keyboard.type(text, delay=80) simulates real typing and is more reliable on forms with keystroke monitoring.

TOTP codes are time-sensitive. TOTP codes are valid for 30-second windows with some tolerance. Generate the code as close to the submission moment as possible — don't generate it before filling the password field. pyotp.TOTP(secret).now() gives you the current window's code.

Never store credentials in source code. All credentials in the examples above use environment variables (os.getenv()). Use a secrets manager or .env file with python-dotenv in production — never hardcode usernames, passwords, or TOTP secrets in code.

Rate limiting on authenticated sessions is stricter than on anonymous traffic. Platforms know exactly who is making requests from an authenticated session. Rapid navigation through many pages in sequence raises flags faster than anonymous scraping. Keep delays at 3–5 seconds between page loads for authenticated pipelines.

Respect Terms of Service. Before building an authenticated scraping pipeline, review the platform's Terms of Service. Some explicitly prohibit automated access to authenticated content. Operating within ToS limits (personal research, own account data, permitted use cases) is the responsible approach.

Conclusion

Login-protected scraping comes down to two problems solved in sequence: authenticate once using a real browser that handles whatever login complexity the site throws at you, then save and reuse the resulting browser state so every subsequent session starts already logged in.

Playwright's storage_state feature is the practical mechanism — save after login, load before scraping, detect expiry and re-authenticate when needed. The AuthenticatedScraper class wraps this into a production-ready pattern with automatic session refresh.

For platforms where the login itself is protected by bot detection, or where content pages are also behind anti-bot systems, MrScraper's cloud scraping browser provides a managed environment that handles residential proxies, fingerprinting, and CAPTCHA solving through the entire authentication flow — not just for data collection pages.

What We Learned

  • Always test simple POST login first — many sites still accept direct form POST with CSRF token capture, which is faster and cheaper than a full browser session
  • Playwright's storage_state saves the complete browser authentication state (cookies + localStorage + sessionStorage) to disk — load it on subsequent runs instead of logging in again every time
  • keyboard.type(text, delay=80) is more reliable than fill() on login forms with keystroke-timing analysis — the per-character delay mimics real typing and passes timing-based bot detection
  • Silent login-page redirects with HTTP 200 are the most common authenticated scraping failure mode — always validate that post-login URLs don't contain "login" or "signin" signals and check for authenticated content markers
  • TOTP-based 2FA is fully automatable with pyotp using your account's TOTP secret — generate pyotp.TOTP(secret).now() immediately before submitting the 2FA form
  • MrScraper's connect_over_cdp() handles bot-protected login flows — when the login form itself is behind Cloudflare or fingerprint detection, a cloud scraping browser manages the anti-bot layer through the entire authenticated session

FAQ

  • Can I share one authenticated session across multiple concurrent scrapers? Not safely. Session cookies are typically bound to one active browser context. Running multiple concurrent scrapers with the same auth_state.json may cause session conflicts, trigger suspicious-activity alerts on the platform, or result in one scraper logging the others out. For concurrent authenticated scraping, create separate accounts and separate session states per worker.
  • How long do saved storage_state sessions last? It depends entirely on the platform's session configuration — typically 24 hours to 2 weeks for "remember me" sessions, and 30 minutes to a few hours for standard sessions. The AuthenticatedScraper class above uses a conservative 8–12 hour max age; adjust it based on what you observe for your specific target.
  • What if the login requires solving a visual CAPTCHA? For CAPTCHA-protected login forms, two options: (1) use MrScraper's connect_over_cdp() to perform the login through the cloud scraping browser, which handles CAPTCHA solving transparently; or (2) integrate a CAPTCHA solving service (2captcha, Anti-Captcha) directly into the Playwright login flow and inject the token before form submission.
  • Is storing credentials in environment variables secure enough for production? For development and small-scale pipelines, environment variables are adequate. For production systems, use a secrets manager — AWS Secrets Manager, HashiCorp Vault, or GitHub Actions secrets for CI/CD. Never commit credentials to source control even in .env files; add .env to .gitignore from the start.
  • Can I use this approach for OAuth-protected platforms (Google login, GitHub login)? OAuth "Sign in with Google/GitHub" flows can be handled with Playwright, but they require the full browser session to navigate the OAuth redirect chain. The key is waiting for the final redirect back to the target platform after OAuth completes, then saving storage_state at that point. The login flow is more complex to navigate with selectors, but the session saving and reuse pattern is identical.

Table of Contents

    Take a Taste of Easy Scraping!