How to scrape infinite scrolling pages with Puppeteer

You've finally found the data you need. You open the page, and… it loads 20 items. You scroll down, and 20 more appear. Scroll again — 20 more. It keeps going. Welcome to infinite scrolling, the design pattern that's great for users and absolutely maddening for scrapers.

Here's the short answer: to scrape infinite scrolling pages with Puppeteer, you programmatically scroll the page in a loop, wait for new content to load after each scroll, then extract the data — repeating until no new content appears. That's the core loop. Everything else is just making it reliable.

Let's break down exactly how to do that.

Why Infinite Scrolling is Tricky (But Not Impossible)

Infinite scrolling pages don't serve all their content at once. Instead, they listen for scroll events and fire off API requests to load more data as you reach the bottom. Twitter, Instagram, LinkedIn job listings — they all do this.

The thing is, traditional HTTP scrapers don't scroll. They just grab the initial HTML and call it a day. That's why you need a browser automation tool like Puppeteer, which actually runs a real Chromium browser and can simulate user behavior — including scrolling.

As the Puppeteer team notes in their official docs, it's built specifically for tasks like "scraping SPAs (Single-Page Applications) and pre-rendered content." Infinite scroll pages are a textbook use case.

Setting Up Your Puppeteer Project

Before we scroll anything, let's get the basics in place.

mkdir infinite-scroll-scraper
cd infinite-scroll-scraper
npm init -y
npm install puppeteer

Now create a file called scraper.js. Here's your starting skeleton:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto('https://example-infinite-scroll-site.com', {
    waitUntil: 'networkidle2',
  });

  // scraping logic will go here

  await browser.close();
})();

Quick notes on this setup: headless: true means no browser window pops up — perfect for servers. The waitUntil: 'networkidle2' option tells Puppeteer to wait until the network is mostly quiet before proceeding. This gives the initial content time to render properly.

The Core Scroll Loop

Here's where it gets interesting. The strategy is simple: scroll to the bottom, wait for new content, check if anything new loaded, repeat. If nothing new shows up, we're done.

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 800; // pixels to scroll each step
      const delay = 1000;   // ms to wait between scrolls

      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, delay);
    });
  });
}

This runs inside the browser context using page.evaluate() — that's the magic part. You're literally executing JavaScript inside the Chromium browser, just like a real user's code would run.

Here's what each piece does: window.scrollBy(0, distance) scrolls down by 800px each tick. document.body.scrollHeight is the total scrollable height of the page. Once we've scrolled at least that far, we stop. The setInterval with a 1000ms delay gives the page time to fetch and render new content after each scroll.

Extracting Data After Scrolling

Once the scroll loop finishes, the page has fully loaded all its content (in theory — more on edge cases in a moment). Now you can extract everything at once:

const items = await page.evaluate(() => {
  const cards = document.querySelectorAll('.product-card'); // your actual selector here
  return Array.from(cards).map(card => ({
    title: card.querySelector('.title')?.innerText,
    price: card.querySelector('.price')?.innerText,
    link: card.querySelector('a')?.href,
  }));
});

console.log(`Scraped ${items.length} items`);

Putting it all together, your full scraper looks like this:

const puppeteer = require('puppeteer');

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 800;
      const delay = 1000;
      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;
        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, delay);
    });
  });
}

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto('https://example-infinite-scroll-site.com', {
    waitUntil: 'networkidle2',
  });

  await autoScroll(page);

  const items = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.product-card')).map(card => ({
      title: card.querySelector('.title')?.innerText,
      price: card.querySelector('.price')?.innerText,
    }));
  });

  console.log(items);
  await browser.close();
})();

Clean, simple, and it works surprisingly well for many sites.

Common Pitfalls (and How to Dodge Them)

The scroll finishes too fast. If your delay is too short, the page hasn't had time to fetch and render new items before you scroll again. Slow it down. 1000–1500ms is usually a safe bet, but some slow APIs need 2000ms or more.

Content never stops loading. Some sites have truly massive datasets. You'll want to add a max scroll limit or a max item count check to your loop — otherwise your scraper runs forever.

if (totalHeight >= scrollHeight || totalHeight > 50000) {
  clearInterval(timer);
  resolve();
}

New content loads lazily. Some sites use lazy image loading or skeleton screens. According to MDN Web Docs, the IntersectionObserver API is commonly used for this — meaning items only fully render when they're in the viewport. If your extracted data has empty fields, try adding extra page.waitForTimeout(2000) calls after scrolling.

Bot detection kicks in. Sites like LinkedIn are aggressive about detecting automated behavior. Chrome DevTools engineers note that headless browsers have detectable fingerprints. Consider using puppeteer-extra with the stealth plugin, and always add realistic delays.

Advanced Tips: Intercepting the API Instead

Here's something that'll change how you think about this problem. A lot of infinite scroll pages are actually just calling a REST or GraphQL API behind the scenes. Instead of scrolling and scraping HTML, you can intercept those network requests directly.

page.on('response', async (response) => {
  const url = response.url();
  if (url.includes('/api/items') && response.status() === 200) {
    const json = await response.json();
    console.log('Got batch:', json.items);
  }
});

This is significantly faster and more reliable than DOM scraping. Open Chrome DevTools on your target site, go to the Network tab, scroll down, and watch what requests fire. You'll often find a clean JSON API you can hit directly — no scrolling needed at all.

When to use this approach: When you need speed, when the HTML structure is messy, or when you're scraping large volumes of data. The trade-off is that these APIs can change without warning and may require auth tokens.

A Note on Rate Limiting and Ethics

Scraping infinite scroll pages can hammer a server pretty hard — you're essentially making it serve content continuously. Always add delays between requests, respect robots.txt, and don't scrape at a rate that degrades the site's performance for real users. A recent Radware Bot Traffic analysis showed that many sites block scrapers that make requests faster than a human could realistically browse.

If you're doing this at scale, it might be worth looking at a managed scraping API that handles browser fingerprinting, CAPTCHAs, and rate limiting for you automatically.

What We Learned

Infinite scroll works by firing API requests on scroll events — Puppeteer lets you simulate that scrolling inside a real browser, triggering the same content loads a human would trigger.
The core pattern is a scroll loop: scroll down → wait → check if new content loaded → repeat until the page stops growing.
Tune your scroll delay carefully — too fast and content doesn't load; 1000–1500ms is a solid starting point for most sites.
Always add a scroll limit to prevent infinite loops on sites with massive or truly endless datasets.
Intercepting the underlying API is often better than DOM scraping — check the Network tab in DevTools before writing a single line of scroll logic.
Bot detection is real — use stealth plugins, realistic delays, and rotate user agents if you're hitting sites that are protective about their data.

The scroll loop isn't magic — it's just simulating what a human does. Once you understand that, the whole thing clicks into place. Now go scrape something!

How to Scrape Infinite Scrolling Pages with Puppeteer

Why Infinite Scrolling is Tricky (But Not Impossible)

Setting Up Your Puppeteer Project

The Core Scroll Loop

Extracting Data After Scrolling

Common Pitfalls (and How to Dodge Them)

Advanced Tips: Intercepting the API Instead

A Note on Rate Limiting and Ethics

What We Learned

Table of Contents

Take a Taste of Easy Scraping!

Find more insights here

How to Avoid Triggering CAPTCHA Challenges

How to Use Residential Proxies to Scrape Social Media Without Getting Banned

LinkedIn Sales Navigator vs scraping for lead generation