Back to blog

How to Scrape Shopify Stores: Step-by-Step

-
Table of contents
-

Scraping Shopify stores is the process of collecting publicly available e-commerce information from Shopify-powered storefronts and turning it into a usable dataset. Shopify is a common target for e-commerce data collection because many stores follow similar URL patterns, publish product pages in predictable places, and often expose catalog information in machine-readable form.

In this guide, you’ll learn the main methods: using public endpoints, browser automation, XML sitemaps, and proxies for larger jobs.

Why Shopify stores are easier to scrape

Shopify stores are not all identical, but they usually share enough structure to make data extraction easier than with many custom e-commerce platforms. The platform standardizes product pages, collections, variants, and theme logic, so you can often test one store and apply a similar workflow to another.

Predictable URL structure

Most Shopify stores use clean, readable URLs. Product pages commonly appear under /products/, while category pages often appear under /collections/. Product handles are usually part of the URL, which makes them useful identifiers when deduplicating product listings.

In other words, Shopify sites publish an XML sitemap with all product URLs, typically found at https://<domain-name>/sitemap_products_1.xml, which can be used to gather product page URLs for scraping.

For example, a product page URL may look like this:

A category URL may look like this:

These patterns help you discover pages, organize extracted records, and connect products to each collection. If your goal is to scrape Shopify stores across multiple domains, URL structure is usually the first thing to inspect.

Public JSON endpoints

Many Shopify stores expose product data through public endpoints. The most common one returns catalog records in a machine-readable format. This can include titles, handles, descriptions, variants, prices, images, vendors, tags, and fields related to availability.

This is useful because machine-readable responses are easier to parse than HTML. Instead of selecting elements from a page, you can request a structured response and map fields directly into your database. That is why a basic Shopify scraper often starts with endpoint data before moving to rendered pages.

Shopify's public JSON API is designed to be easily accessible for scraping, making it one of the simplest eCommerce platforms to extract data from, as it intentionally exposes this data for public use.

When HTML scraping is still needed

Public endpoints are convenient, but they do not always return every field you need. Some stores use custom themes, app blocks, JavaScript-rendered widgets, or hidden promotional sections. Reviews, badges, shipping messages, size guides, or sale banners may appear only in the rendered HTML.

HTML scraping is also useful when public responses are disabled, incomplete, or lack page-level context. In those cases, you may need to load product pages, parse the document, and extract product details from visible sections or embedded scripts.

What data you can extract from Shopify stores

The exact fields depend on the store, theme, and method you use, but Shopify stores often make the following ecommerce data available:

  • Product titles
  • Product URLs and handles
  • Current prices and compare-at prices
  • Variants and options, such as size, color, or material
  • Product images
  • Descriptions and specifications
  • Availability, inventory signals, and stock messages
  • Vendor or brand
  • Tags
  • Product type
  • Collection pages

For competitive analysis, the most useful catalog fields usually include titles, prices, variants, availability, and product URLs. For catalog monitoring, you may also want product images, tags, and inventory-related changes. For search or marketplace projects, structured product data helps you standardize records across multiple Shopify stores.

Method 1: Use Shopify JSON endpoints

The fastest method is often to check whether a public endpoint is available. If it works, you can collect catalog records without rendering every page in a browser.

What products.json is

This is a public endpoint that many Shopify storefronts expose at the root of the domain. It behaves like a simple public API for catalog reads, although access is not guaranteed. A typical request looks like this:

The response is JSON and usually contains a products array. Each product object may include an ID, title, handle, description, vendor, product type, tags, variants, images, and timestamps. Variant objects often include prices, option values, SKU fields, and availability indicators.

This endpoint is not the same as the authenticated Shopify Admin API. The Admin API is for store owners and approved apps, and that API requires permission, while the products endpoint is a storefront-facing resource that may be publicly accessible.

For public scraping, always treat this API-like route as a convenience endpoint, not a guaranteed data source.

Pagination and limits

Large catalogs are split into pages. You can usually request a specific page and limit, such as:

Then increment the page number until the response returns no more products. Some stores may allow fewer items per page, and others may behave differently depending on their setup. Your scraper should not assume that every catalog will return the same number of items.

Pagination is important because a large store can have thousands of products and variants. Store the product ID or handle, track the page number, and stop only when you confirm that the current response is empty or contains no new records.

Pros and cons

The main advantage of endpoint scraping is speed. You can avoid rendering pages, reduce bandwidth, and get structured fields directly. It is ideal when you need catalog data at scale for catalog tracking, price monitoring, or product research.

The downside is coverage. An API response may omit theme-specific content, app-generated information, certain availability messages, or merchandising blocks. Also, not all Shopify stores expose the endpoint. When endpoint data is missing or incomplete, combine it with sitemap discovery or HTML extraction.

Method 2: Scrape Shopify stores with browser automation

Browser automation means loading pages in a real or headless browser and extracting the data after the page has rendered. Tools like Playwright, Puppeteer, or Selenium can click, scroll, wait for scripts to run, and read dynamic content.

When browser automation is needed

Use browser automation when a store depends heavily on JavaScript or when important catalog fields appear only after the page loads. This can happen with variant pickers, embedded recommendation widgets, dynamic pricing, region-specific availability, or custom product bundles.

It is also useful when you need to test what a real visitor sees. For example, the endpoint response might show all variants, but the page may display stock status, delivery estimates, or discount badges only after a script runs.

Pros and cons versus endpoint scraping

Browser automation gives better visual coverage and handles more complex storefronts. It can extract rendered text, interact with variant selectors, and capture page states that simple requests miss.

However, it is slower and more expensive to run. Each browser session uses more CPU, memory, and bandwidth. It is also more likely to trigger anti-bot systems when repeated at high volume. For that reason, a practical Shopify scraper often uses endpoint data first and switches to browser automation only for missing fields or difficult pages.

Method 3: Use XML sitemaps to discover product URLs first

Sitemaps are excellent for discovery. Many Shopify stores publish a sitemap.xml file that points to sitemap files for products, pages, blogs, and categories. This helps search engines find public URLs, but it also helps you build a list of pages before extraction.

Start with:

From there, look for product sitemap files and category sitemap files. Product sitemaps help you collect product URLs directly, while a category sitemap can show category-style pages that organize the catalog.

This method is especially helpful for large stores. Instead of crawling every internal link, you can parse the sitemap, extract product URLs, deduplicate them by handle, and then request each page or matching endpoint record.

You can also combine sitemap discovery with the products endpoint: use the endpoint for core fields, then use URLs from the sitemap to fill gaps or confirm that no product pages were missed.

Method 4: Scale Shopify scraping with proxies

A small test may work over your local connection, but larger jobs can get blocked. If you scrape Shopify stores repeatedly, across many pages, or across many domains, proxy infrastructure becomes important.

Why blocking happens at scale

Blocking usually happens because a site sees too many requests from the same IP address or detects behavior that does not look like normal browsing. Common triggers include high request frequency, repeated access to paginated URLs, missing headers, unstable sessions, and concurrent requests to many product listings.

Shopify merchants can also use apps, firewalls, or custom rules to limit automated traffic. That means two Shopify stores can react differently to the same scraper.

When rotating proxies are necessary

Rotating proxies are useful when you are collecting many pages, monitoring prices regularly, checking stock changes, or running multi-store scraping jobs. Proxy rotation spreads requests across multiple IP addresses, reducing the risk that any single IP becomes overused.

You do not need proxies for every small test. But if you need to scrape Shopify stores on a schedule, collect large catalogs, or monitor inventory changes across multiple markets, rotating IPs can make the workflow more stable.

Residential vs datacenter proxies

Datacenter proxies are usually faster and cheaper. They work well for lightweight tasks, testing, and stores with minimal blocking. Residential proxies use IPs associated with real internet service providers, so they are often better for stricter targets and recurring monitoring.

The tradeoff is cost and speed. Residential proxies are usually more expensive, while datacenter proxies can be more efficient for simple, high-speed collection. Choose based on the store’s defenses, your request volume, and how reliable the job needs to be.

Step-by-step workflow for scraping a Shopify store

A good workflow keeps discovery, extraction, cleanup, and scaling separate.

Step 1: Identify the store structure

Open the target site and check whether it follows standard Shopify patterns. Look at product URLs, collections, pagination, filters, and variant selectors. Confirm whether product pages use /products/ and whether category pages use /collections/.

Also inspect the page source for embedded data. Some themes include structured data inside script tags, which can reveal product IDs, variants, prices, and availability.

Step 2: Choose the best extraction method

Choose the lightest method that returns the fields you need. If the products endpoint works and contains your required fields, use it first. If it misses theme-specific content, add HTML parsing. If the page renders data after scripts run, use browser automation.

The goal is not to use the most advanced method. The goal is to extract the catalog with the least complexity while still getting accurate results.

Step 3: Check public JSON endpoints

Request the products endpoint and test pagination. Look at the response structure and identify the fields you can map. Save a sample response file and define your schema before scaling.

Typical fields include title, handle, vendor, product type, tags, variants, images, and timestamps. If the response contains enough structured product data, you can build most of your dataset from it.

Step 4: Gather product URLs from sitemaps or collections

Next, collect product URLs. Use sitemap.xml for broad discovery, including each collection, then parse product sitemap files. If the sitemap is incomplete, crawl category pages and extract links to product pages.

This step helps you compare sources. If a product exists in the sitemap but not in the endpoint, add it to a review queue. If a product appears multiple times through different category pages, deduplicate it by handle or canonical URL.

Step 5: Extract the fields you need

Now extract product data into a consistent schema. Keep one record per product and one nested or separate record per variant. Include price, compare-at price, SKU when available, options, product URL, image URL, vendor, tags, and availability.

For inventory, be careful with interpretation. Public storefronts often show availability signals rather than exact warehouse quantities. Treat “available,” “sold out,” and “low stock” as indicators unless the store exposes exact numbers.

Step 6: Handle pagination, duplicates, and blocking

Pagination errors are common. Track which pages you requested, how many products each page returned, and when results stop. Add safeguards for repeated pages and empty responses.

Duplicates can appear when the same product belongs to multiple category paths. Normalize URLs, remove tracking parameters, and deduplicate by handle or product ID.

For blocking, slow down requests before adding complexity. Use retries with backoff, realistic headers, session handling, and proxy rotation when needed.

Step 7: Export and clean the dataset

Export the final dataset to CSV, JSON, or a database. JSON is useful when you want to preserve variants and nested images. CSV is easier for spreadsheets but can flatten complex product structures.

Clean HTML from descriptions, normalize prices, standardize availability values, and validate URLs. Also record the scrape date, source domain, and extraction method so you can compare changes over time.

Common problems when scraping Shopify stores

Even though Shopify stores are relatively predictable, you will still run into edge cases. Plan for them before running a large job.

Missing or inconsistent fields

Endpoint data does not always expose everything you need. One store may include vendor and tags, while another may leave fields blank. Some products may have complete variant data, while others may have limited options.

Create defaults for missing fields and keep raw responses for debugging. This helps you fix mapping errors without scraping the same pages again.

Pagination and duplicate issues

Large catalogs can produce pagination mistakes, especially if products are added or removed while your scraper is running. Duplicate records also happen when the same product appears in multiple category pages.

Use stable identifiers whenever possible. Product IDs, handles, and canonical URLs are better than page position or visible title alone.

Blocking and throttling

CAPTCHA prompts, request limits, and unstable sessions can interrupt extraction. Reduce concurrency, add delays, and avoid hammering the same endpoint. For recurring jobs, monitor error rates and rotate IPs before blocks become widespread.

If a store uses strong anti-bot protection, an API-only approach may fail, and browser automation alone may not solve the issue. You may need better session management, higher-quality proxies, or a smaller request volume.

Scraping legality depends on what you collect, how you collect it, where you operate, and how you use the data. Public product pages are different from private customer accounts, checkout pages, or admin areas. Avoid anything behind login, payment, or access controls unless you have permission.

Always review robots.txt and the site’s terms. Robots.txt is not a law by itself, but it communicates crawling preferences and should be respected. Also, keep the request volume reasonable. Even public data collection can create risk if it disrupts a site or copies protected content at scale.

Common risks to avoid include collecting personal data, bypassing security controls, ignoring cease-and-desist requests, and republishing copyrighted descriptions or images without rights. When in doubt, get legal advice before launching a large Shopify scraping project.

Best practices for scraping Shopify stores

Start with public endpoints before rendering pages. Endpoint data is faster, cleaner, and easier to validate than browser output.

Use sitemaps for discovery. They help you find product pages without aggressive crawling.

Rotate proxies when scraping at scale. This is especially important for recurring monitoring, multi-store jobs, and high-volume product listings.

Respect rate limits. Lower concurrency is usually better than constant failures and retries.

Normalize variants and duplicates. Store products, variants, URLs, and availability in a consistent format.

Monitor for structure changes over time. Shopify themes, apps, and page templates can change without warning, so build logging and validation into your workflow.

Conclusion

There are several ways to collect Shopify data, and the best one depends on your target store, data needs, and scale.

Start with the products endpoint when it is available because it returns clean fields and can quickly provide core product data. Use sitemaps to discover URLs, HTML parsing for missing fields, and browser automation for JavaScript-heavy pages.

When you need to scrape Shopify stores regularly or across many domains, blocking risk becomes part of the project. Good request pacing, clean sessions, and reliable proxies help keep large-scale extraction stable.

With the right workflow, you can collect accurate Shopify data while keeping your scraper efficient, organized, and easier to maintain.

FAQ

What is products.json in Shopify?

products.json is a public storefront endpoint that often returns Shopify product information in JSON format, including titles, handles, variants, prices, and images.

Does every Shopify store have a products.json endpoint?

No. Many Shopify stores expose it, but some restrict it, customize behavior, or return incomplete data.

Do you need proxies to scrape Shopify stores?

Not always. Small tests may work without proxies, but proxies help when scraping many pages, many stores, or recurring monitoring jobs.

Can Shopify stores block scrapers?

Yes. Stores can use rate limits, firewalls, CAPTCHAs, apps, or custom rules to block automated traffic.

Is it legal to scrape Shopify stores?

Scraping public data may be legal in many cases, but it depends on the data, method, location, and use. Respect robots.txt, terms, and privacy rules.

Learn more
-

Related articles