Build a Reliable Web Scraper with Node.js and Cheerio

Imagine you need product prices from 50 different e-commerce pages. Copy-pasting them manually would take hours. There has to be a better way — and there is. Building a web scraper with Node.js and Cheerio is one of the fastest, most practical ways to collect HTML data from static websites using JavaScript you already know.

TL;DR: Use axios to fetch the raw HTML of a page, then pass it into cheerio — a server-side jQuery-like library — to parse and extract the data you need. It's lightweight, fast, and takes about 20 lines of code to get started.

Let's build one, step by step.

Why Node.js and Cheerio for Web Scraping?

Before we write a single line, let's talk about why this stack makes sense. Node.js is non-blocking and event-driven, which makes it naturally good at fetching many web pages concurrently. Cheerio, as described in its official docs at cheerio.js.org, implements a subset of core jQuery on the server — meaning if you've ever written $('.title').text() in a browser, you already know how to use it.

Here's the catch though: Cheerio only works on static websites — pages where the HTML is fully present in the server's response. If a site loads its content via JavaScript after the page loads (think React or Vue apps), you'll need Puppeteer or Playwright instead. We'll come back to that.

For static pages? Cheerio is perfect. Fast, lightweight, zero browser overhead.

Setting Up Your Project

Let's get the environment ready. Create a new folder and initialize your project:

mkdir cheerio-scraper && cd cheerio-scraper
npm init -y
npm install axios cheerio

Two packages, that's it. axios handles the HTTP requests; cheerio handles the HTML parsing.

Building Your First Node.js Web Scraper with Cheerio

Step 1: Fetch the HTML

const axios = require('axios');
const cheerio = require('cheerio');

async function scrape(url) {
  const { data } = await axios.get(url);
  return data; // This is the raw HTML string
}

axios.get() makes an HTTP GET request and returns a response object. We destructure data from it — that's the raw HTML. Simple.

Step 2: Parse with Cheerio

async function scrape(url) {
  const { data } = await axios.get(url);
  const $ = cheerio.load(data); // This is the magic part

  const title = $('h1').text();
  console.log('Page title:', title);
}

scrape('https://books.toscrape.com');

cheerio.load(data) ingests the HTML and returns a jQuery-like selector function — we conventionally call it $. From here, you use familiar CSS selectors to target elements. $('h1').text() grabs the text content of the first h1 tag. Satisfying, right?

Step 3: Extract Structured Data

Let's scrape a list of books from books.toscrape.com — a site built specifically for scraping practice:

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeBooks(url) {
  const { data } = await axios.get(url);
  const $ = cheerio.load(data);

  const books = [];

  $('article.product_pod').each((i, el) => {
    const title = $(el).find('h3 a').attr('title');
    const price = $(el).find('.price_color').text().trim();
    const rating = $(el).find('p.star-rating').attr('class').split(' ')[1];

    books.push({ title, price, rating });
  });

  return books;
}

scrapeBooks('https://books.toscrape.com').then(console.log);

Let's unpack what's happening: .each() iterates over every book card on the page. $(el).find() scopes the search inside each card — this is how you avoid accidentally grabbing data from the wrong element. .attr('title') fetches an HTML attribute, .text() gets the inner text, and .trim() removes whitespace. That's your entire data pipeline.

Handling Pagination

Most real-world scraping tasks involve multiple pages. Here's how to follow "next page" links:

async function scrapeAllPages(baseUrl) {
  let url = baseUrl;
  const allBooks = [];

  while (url) {
    const { data } = await axios.get(url);
    const $ = cheerio.load(data);

    $('article.product_pod').each((i, el) => {
      allBooks.push({
        title: $(el).find('h3 a').attr('title'),
        price: $(el).find('.price_color').text().trim(),
      });
    });

    const nextPage = $('li.next a').attr('href');
    url = nextPage ? new URL(nextPage, url).href : null;
  }

  return allBooks;
}

The while (url) loop keeps running until there's no next page link. new URL(nextPage, url).href resolves relative URLs correctly — a small detail that trips up a lot of people. MDN Web Docs has a solid reference on the URL constructor if you want to understand it better.

Common Pitfalls and How to Avoid Them

1. Your selectors break without warning. Websites change their HTML structure. What worked last week might not work today. Write your selectors defensively using optional chaining (?.) and always add null checks before calling .text() or .attr().

2. Getting blocked by anti-scraping techniques. Many sites detect scrapers by looking at the User-Agent header. Axios sends axios/1.x.x by default — a dead giveaway. Fix it:

const { data } = await axios.get(url, {
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  }
});

3. Hammering servers too fast. Making 100 requests per second is a good way to get your IP banned — and it's also just bad manners. Add delays between requests:

const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));
await sleep(1000); // wait 1 second between requests

4. Encoding issues with special characters. If you're seeing garbled text, Axios may be misreading the page encoding. You can handle this with iconv-lite to manually decode the response buffer.

Static vs Dynamic Websites: Know the Difference

This is a critical distinction for anyone doing web scraping with JavaScript. Cheerio only handles static HTML — content that's in the server's initial response. Dynamic sites use JavaScript frameworks to render content client-side, meaning the HTML Axios fetches is mostly empty.

How to tell the difference: Right-click → View Page Source in your browser. If your target data is visible in the raw source, Cheerio works. If it's not there, you'll need a headless browser like Puppeteer. If you want a deeper dive, check out our guide on How to Scrape Infinite Scrolling Pages with Puppeteer.

Advanced Tips

Rate Limiting and Concurrency

Instead of raw loops, use a queue with concurrency control. The p-limit package lets you run, say, 5 requests at a time rather than all at once:

const pLimit = require('p-limit');
const limit = pLimit(5); // max 5 concurrent requests

const results = await Promise.all(
  urls.map(url => limit(() => scrapeBooks(url)))
);

Rotating User Agents and Proxies

For larger-scale scraping, static headers won't cut it. Consider rotating user agents and using residential proxies. Services like ScrapingBee handle this infrastructure for you automatically if you'd rather not manage it yourself.

Saving Data

Once you've got your data, write it to a JSON file:

const fs = require('fs');
fs.writeFileSync('books.json', JSON.stringify(allBooks, null, 2));

Conclusion

Building a web scraper with Node.js and Cheerio is genuinely one of the most practical skills you can pick up as a JavaScript developer. The combination is fast, familiar, and works brilliantly for static HTML parsing. You now know how to fetch pages, extract structured data, handle pagination, and sidestep the most common pitfalls.

The natural next step is to try it on a real project — even something small, like scraping job listings or product prices from a site you actually use. Once it works, you'll understand why people get hooked on this stuff.

What We Learned

Cheerio + Axios is the go-to stack for scraping static HTML pages with Node.js — lightweight, fast, and easy to learn if you know jQuery selectors.
Cheerio does NOT work on dynamic (JavaScript-rendered) sites — use Puppeteer or Playwright when the content isn't in the raw HTML source.
Always set a custom User-Agent header — the default Axios header flags you as a bot immediately on many sites.
Add delays between requests — both to avoid bans and to be a respectful scraper; sleep() wrappers and p-limit are your friends.
Resolve relative URLs with new URL() — it handles all edge cases cleanly when following pagination links.
Use optional chaining and null checks — scrapers break silently when sites change their HTML structure; defensive selectors save you painful debugging.

Frequently Asked Questions

Q: Is Cheerio good for scraping JavaScript-rendered pages? No. Cheerio only parses static HTML. For JavaScript-rendered content, use Puppeteer or Playwright, which run a real browser engine.
Q: Is web scraping with Node.js legal? It depends on the site's Terms of Service and what you do with the data. Always check robots.txt and respect rate limits. Public data scraped for personal, non-commercial use is generally in a grey zone, but always consult legal guidance for commercial applications.
Q: What's the difference between Axios and the native fetch in Node.js? Both work. Axios has a cleaner API, better error handling, and broader compatibility. Node.js 18+ includes native fetch, so for new projects you can skip Axios if you prefer fewer dependencies.
Q: Can Cheerio handle malformed HTML? Yes — Cheerio uses htmlparser2 under the hood, which is quite forgiving with broken or messy HTML. It'll do its best to parse whatever you throw at it.
Q: How do I scrape sites that require login? You'll need to manage session cookies. Axios supports this via cookie jars with the axios-cookiejar-support package, or you can pass cookies manually in request headers.