Back to blog

Cheerio Web Scraping With Node.js - 2025 Guide

Key Considerations

  • Cheerio provides a fast, jQuery-like way to parse HTML, making web scraping easier, with results saved as JSON.
  • It's best suited for static web pages, allowing quick extraction of data without heavy browsers.
  • Use Cheerio with HTTP clients like Axios (and reliable proxies) for reliable scraping, and to output data as JSON.
  • Remember to scrape responsibly: respect sites’ rules, use rate limiting and proxies to avoid blocks.

Cheerio is a Node.js library for parsing HTML and XML, essentially a server-side jQuery with a similar, familiar API. It’s commonly used to scrape web pages without a browser.

It doesn’t render a page or load external resources like a browser; it simply parses the HTML and exposes the content as a lightweight DOM structure. In practice, Cheerio web scraping lets you pull data out of page HTML using familiar selectors.

When to use Cheerio

Use Cheerio when you need a lightweight scraper to retrieve structured content from static web pages. If the data you need is present in the page’s HTML (without requiring user interaction or heavy scripting), Cheerio is likely the perfect tool.

For example, scraping product listings or blog articles can be done quickly with Cheerio. It's ideal for web data extraction when you can fetch the raw HTML directly.

However, Cheerio is not a browser. It won't execute any JavaScript code or run page events. That means if a site relies on client-side JavaScript to display content (like a single-page application that loads data dynamically), Cheerio alone might not get the info you need.

In those cases, you may need a headless browser or other tools to handle the dynamic parts (more on that later).

Pros & cons

Pros: Cheerio is fast and efficient (no browser overhead, just HTML parsing), and it's easy to learn if you know jQuery. You can quickly build a web scraper to parse many static pages, and it integrates easily with Node tools (for example, saving results to a JSON file).

Cons: Cheerio cannot execute scripts or interact with pages. If a website loads data via JavaScript or requires login clicks, Cheerio by itself won’t capture that content. Your scraping logic may also break if the site’s HTML structure changes, since Cheerio relies on specific selectors. For highly dynamic sites or those behind logins, a headless browser tool or an official API might be more suitable.

Setting up your scraping environment

Before you start web scraping, ensure you have Node.js and npm installed. For this 2025 guide, Node.js 18.x or higher is recommended (newer Cheerio versions require Node 18+).

You can download Node.js (which includes npm) from the official website. You can use the official installer for your system, or a package manager (e.g., nvm on Linux) to get the latest LTS version. Once installed, verify by running node -v and npm -v in your terminal.

Installing Cheerio and Axios

Next, create a project folder for your scraper and run npm init -y inside it to initialize. Then install the needed libraries:

    	npm install cheerio axios
  	

This adds Cheerio (for HTML parsing) and Axios (for making HTTP requests) to your project. Axios is a promise-based HTTP client that makes it easy to fetch web pages in Node.

Note: If you're planning a lot of scraping, you might also configure proxies or use a proxy package. We’ll discuss proxies next, including how MarsProxies can help.

Creating your project structure

Now create an entry file for your scraper, for example index.js. Open it in your editor and import the required modules:

    	const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
  	

Here we import Axios, Cheerio, and Node’s file system module (fs), which lets us save data to a JSON file.

Web scraping with Cheerio: step-by-step

Let’s build a simple web scraping project step by step. For the demo, we'll use a public sandbox site like Books to Scrape where scraping is allowed.

Step 1: using proxies

Websites often track scrapers by IP address and may block or throttle requests if they see too many coming from one IP. Using proxies can help distribute your requests across different IPs, making your scraping more stealthy. Proxies are especially useful if you're scraping a large number of web pages or a site with strict anti-scraping measures.

In Node.js, you can configure Axios to use a proxy. For instance, with MarsProxies, you would get a proxy host, port, and login credentials, then do:

    	const proxyConfig = {
  host: 'PROXY_HOST',
  port: 1234,
  auth: { username: 'USER', password: 'PASS' }
};
const response = await axios.get('http://books.toscrape.com', { proxy: proxyConfig });
  	

This routes the request through the proxy. If you're using rotating residential proxies (like those from MarsProxies), each request can go through a different IP, helping you avoid bans. For a small number of pages, you might scrape fine without proxies, but for large-scale scraping, they're essential.

Step 2: Sending HTTP requests

The first step in scraping is to fetch the page’s HTML. Using Axios, this is straightforward:

    	const url = 'http://books.toscrape.com';
const response = await axios.get(url);
const html = response.data;
  	

This sends a GET request to the target URL and stores the page’s HTML in the html variable. (For APIs that return JSON, Axios will give you the parsed data directly.)

Step 3: loading HTML into Cheerio

Next, load the HTML string into Cheerio:

    	const $ = cheerio.load(html);
  	

The $ object now represents the page’s DOM in Cheerio. This is like having jQuery for a web page, but it's on the server side. You can use $ to begin selecting elements and navigate the content.

Step 4: selecting elements and extracting data

With Cheerio’s jQuery-like syntax, you can find elements by CSS selectors and retrieve text or attributes. For example, if each book is listed in an <article class="product_pod"> element (with the title in an <h3> link and the price in a <p class="price_color">), you could write:

    	const books = [];
$('.product_pod').each((i, element) => {
  const title = $(element).find('h3 a').attr('title');
  const price = $(element).find('.price_color').text();
  books.push({ title, price });
});
  	

This loop goes through each book entry, finds the title attribute of the link and the price text, and adds them to the books array as a simple JSON object with title and price.

After extracting the data, you can save it or use it. For instance, to save the books array as JSON:

    	fs.writeFileSync('books.json', JSON.stringify(books, null, 2));
  	

This writes the results to a JSON file (in JSON format).

Common Issues and Limitations

Handling JavaScript-rendered content

Cheerio only sees the HTML sent by the server. If a site populates content via JavaScript after load, extracting data with Cheerio will miss it.

In such cases, you may need to fetch data from an API (if the site offers a JSON endpoint, providing JSON data), or use a headless browser (like Puppeteer or Selenium) to retrieve the fully rendered page, then pass that HTML to Cheerio to parse.

Avoiding IP blocks and rate limits

If you perform web scraping too aggressively, websites might block your IP. To avoid this:

  • Throttle your requests. Add delays between requests or limit how many requests you send per second so you don't overwhelm the server.
  • Use proxies and rotation. Services like MarsProxies supply rotating IP addresses, so the site sees requests coming from different sources instead of one.
  • Randomize your pattern. Vary the timing of requests and use different User-Agent strings to mimic normal browsing behavior.

Dealing with changing HTML structure

Websites can change their layout at any time, which might break your scraper. Try to use flexible selectors and avoid brittle assumptions.

For example, relying on particular selectors (like certain nth-child positions) can be risky. If your scraper stops finding data, inspect the updated HTML and adjust your selectors or logic accordingly. Maintaining a scraper often means adapting to such changes.

Cheerio vs other scraping tools

Cheerio vs Puppeteer

Cheerio and Puppeteer serve different needs. Cheerio is a fast HTML parser, great for static content; Puppeteer controls a real headless Chrome browser and can scrape dynamic content.

Cheerio is much faster and lighter (it doesn’t run a browser), but it cannot execute JavaScript or simulate user interactions. Puppeteer can do everything a real browser can, like loading dynamic data, but it's slower and more resource-intensive.

Use Cheerio for static web scraping tasks where speed and simplicity are key, and use Puppeteer when you need to run scripts or interact with the page.

Cheerio vs Selenium

Selenium is another browser automation tool, similar in capability to Puppeteer (often used for testing web applications). The comparison is similar: Cheerio is lightweight, whereas Selenium drives an actual browser to handle dynamic pages.

If you only need to parse static HTML, Cheerio is much simpler. If the target content requires running a real browser (for example, to log in or execute JS), Selenium can do that at the cost of speed and complexity. Selenium can accomplish what Cheerio can't, but with more overhead.

When to combine Cheerio with other tools

Sometimes the best approach is a combination of both types of tools. For example, you might use Puppeteer or Selenium to log into a site or load a page fully, then grab that page’s HTML and feed it into Cheerio for parsing and data extraction.

This hybrid approach lets you overcome dynamic content and still benefit from Cheerio's easy selecting elements and parsing. Use the browser automation to deal with interactions and JS, and let Cheerio quickly sift through the resulting HTML for the data you need.

Best practices for ethical web scraping

Rate limiting and throttling

Always control the pace of your scraping. Making too many requests too fast can overwhelm a server and flag you as a bot. Incorporate short delays between requests or limit the number of requests per second. This courtesy helps you avoid detection and reduces strain on the target website.

Before scraping any site, check its robots.txt file and terms of service. Some websites explicitly forbid scraping. Only scrape publicly available data and never attempt to access private information or accounts you’re not authorized to use.

Avoid any actions that could harm the website (like overloading it with requests). Using proxies responsibly is important. In short, be respectful: gather data in line with site’s policies. By scraping politely, you reduce your chances of being blocked.

What is the difference between Cheerio and jQuery?

Cheerio is a server-side library for HTML parsing, while jQuery runs in the browser to manipulate live DOM elements. Cheerio mimics jQuery’s syntax but doesn’t handle events or dynamic page updates.

Which is better, Cheerio or Puppeteer?

Cheerio is better for static web pages where speed and simplicity matter. Puppeteer is better for dynamic content since it runs a full headless browser.

Can Cheerio scrape dynamic content?

Not directly. Cheerio only sees the static HTML and doesn’t execute JavaScript, so it can't scrape content loaded dynamically.

Can you get caught for web scraping?

Yes, especially if you send too many requests too quickly or ignore a site's rules. Using proxies like MarsProxies and adding delays helps reduce detection.

Learn more

Related articles

Mat Wilson
Author
Matt Willson
Technical content creator
Share on