How to Scrape Infinite Scrolling Pages with Puppeteer
GuideA practical guide to scraping infinite scroll websites using Puppeteer by simulating user-like scrolling behavior, waiting for dynamic content to load, and extracting data once the page stabilizes. It also explains how infinite scroll works under the hood, common pitfalls like bot detection and timing issues, and more efficient alternatives such as intercepting underlying API requests.
You've finally found the data you need. You open the page, and… it loads 20 items. You scroll down, and 20 more appear. Scroll again — 20 more. It keeps going. Welcome to infinite scrolling, the design pattern that's great for users and absolutely maddening for scrapers.
Here's the short answer: to scrape infinite scrolling pages with Puppeteer, you programmatically scroll the page in a loop, wait for new content to load after each scroll, then extract the data — repeating until no new content appears. That's the core loop. Everything else is just making it reliable.
Let's break down exactly how to do that.
Why Infinite Scrolling is Tricky (But Not Impossible)
Infinite scrolling pages don't serve all their content at once. Instead, they listen for scroll events and fire off API requests to load more data as you reach the bottom. Twitter, Instagram, LinkedIn job listings — they all do this.
The thing is, traditional HTTP scrapers don't scroll. They just grab the initial HTML and call it a day. That's why you need a browser automation tool like Puppeteer, which actually runs a real Chromium browser and can simulate user behavior — including scrolling.
As the Puppeteer team notes in their official docs, it's built specifically for tasks like "scraping SPAs (Single-Page Applications) and pre-rendered content." Infinite scroll pages are a textbook use case.
Setting Up Your Puppeteer Project
Before we scroll anything, let's get the basics in place.
mkdir infinite-scroll-scraper
cd infinite-scroll-scraper
npm init -y
npm install puppeteer
Now create a file called scraper.js. Here's your starting skeleton:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example-infinite-scroll-site.com', {
waitUntil: 'networkidle2',
});
// scraping logic will go here
await browser.close();
})();
Quick notes on this setup: headless: true means no browser window pops up — perfect for servers. The waitUntil: 'networkidle2' option tells Puppeteer to wait until the network is mostly quiet before proceeding. This gives the initial content time to render properly.
The Core Scroll Loop
Here's where it gets interesting. The strategy is simple: scroll to the bottom, wait for new content, check if anything new loaded, repeat. If nothing new shows up, we're done.
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 800; // pixels to scroll each step
const delay = 1000; // ms to wait between scrolls
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, delay);
});
});
}
This runs inside the browser context using page.evaluate() — that's the magic part. You're literally executing JavaScript inside the Chromium browser, just like a real user's code would run.
Here's what each piece does: window.scrollBy(0, distance) scrolls down by 800px each tick. document.body.scrollHeight is the total scrollable height of the page. Once we've scrolled at least that far, we stop. The setInterval with a 1000ms delay gives the page time to fetch and render new content after each scroll.
Extracting Data After Scrolling
Once the scroll loop finishes, the page has fully loaded all its content (in theory — more on edge cases in a moment). Now you can extract everything at once:
const items = await page.evaluate(() => {
const cards = document.querySelectorAll('.product-card'); // your actual selector here
return Array.from(cards).map(card => ({
title: card.querySelector('.title')?.innerText,
price: card.querySelector('.price')?.innerText,
link: card.querySelector('a')?.href,
}));
});
console.log(`Scraped ${items.length} items`);
Putting it all together, your full scraper looks like this:
const puppeteer = require('puppeteer');
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 800;
const delay = 1000;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, delay);
});
});
}
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example-infinite-scroll-site.com', {
waitUntil: 'networkidle2',
});
await autoScroll(page);
const items = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('.title')?.innerText,
price: card.querySelector('.price')?.innerText,
}));
});
console.log(items);
await browser.close();
})();
Clean, simple, and it works surprisingly well for many sites.
Common Pitfalls (and How to Dodge Them)
The scroll finishes too fast. If your delay is too short, the page hasn't had time to fetch and render new items before you scroll again. Slow it down. 1000–1500ms is usually a safe bet, but some slow APIs need 2000ms or more.
Content never stops loading. Some sites have truly massive datasets. You'll want to add a max scroll limit or a max item count check to your loop — otherwise your scraper runs forever.
if (totalHeight >= scrollHeight || totalHeight > 50000) {
clearInterval(timer);
resolve();
}
New content loads lazily. Some sites use lazy image loading or skeleton screens. According to MDN Web Docs, the IntersectionObserver API is commonly used for this — meaning items only fully render when they're in the viewport. If your extracted data has empty fields, try adding extra page.waitForTimeout(2000) calls after scrolling.
Bot detection kicks in. Sites like LinkedIn are aggressive about detecting automated behavior. Chrome DevTools engineers note that headless browsers have detectable fingerprints. Consider using puppeteer-extra with the stealth plugin, and always add realistic delays.
Advanced Tips: Intercepting the API Instead
Here's something that'll change how you think about this problem. A lot of infinite scroll pages are actually just calling a REST or GraphQL API behind the scenes. Instead of scrolling and scraping HTML, you can intercept those network requests directly.
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/items') && response.status() === 200) {
const json = await response.json();
console.log('Got batch:', json.items);
}
});
This is significantly faster and more reliable than DOM scraping. Open Chrome DevTools on your target site, go to the Network tab, scroll down, and watch what requests fire. You'll often find a clean JSON API you can hit directly — no scrolling needed at all.
When to use this approach: When you need speed, when the HTML structure is messy, or when you're scraping large volumes of data. The trade-off is that these APIs can change without warning and may require auth tokens.
A Note on Rate Limiting and Ethics
Scraping infinite scroll pages can hammer a server pretty hard — you're essentially making it serve content continuously. Always add delays between requests, respect robots.txt, and don't scrape at a rate that degrades the site's performance for real users. A recent Radware Bot Traffic analysis showed that many sites block scrapers that make requests faster than a human could realistically browse.
If you're doing this at scale, it might be worth looking at a managed scraping API that handles browser fingerprinting, CAPTCHAs, and rate limiting for you automatically.
What We Learned
- Infinite scroll works by firing API requests on scroll events — Puppeteer lets you simulate that scrolling inside a real browser, triggering the same content loads a human would trigger.
- The core pattern is a scroll loop: scroll down → wait → check if new content loaded → repeat until the page stops growing.
- Tune your scroll delay carefully — too fast and content doesn't load; 1000–1500ms is a solid starting point for most sites.
- Always add a scroll limit to prevent infinite loops on sites with massive or truly endless datasets.
- Intercepting the underlying API is often better than DOM scraping — check the Network tab in DevTools before writing a single line of scroll logic.
- Bot detection is real — use stealth plugins, realistic delays, and rotate user agents if you're hitting sites that are protective about their data.
The scroll loop isn't magic — it's just simulating what a human does. Once you understand that, the whole thing clicks into place. Now go scrape something!
Find more insights here
Scrape Bing Search: A Practical Technical Guide
Bing scraping blocked? Discover how to bypass rate limits and bot detection to extract URLs, titles,...
FilterBypass: Unblocking Restricted Sites in a Simple Way
FilterBypass is a free web proxy that acts as an intermediary between your browser and the target si...
YouTube.com Unblocked: Accessing YouTube When It’s Restricted
Learn how to access YouTube unblocked on school, work, or regional networks. Explore VPNs, proxies,...