Build A Amazon Products Scraper Updated

How To Architect Our Amazon Product Scraper

How we design our Amazon product scraper is going to heavily depend on:

  • The use case for scraping this data?
  • What data we want to extract from Amazon?
  • How often do we want to extract data?
  • How much data do we want to extract?
  • Your technical sophistication?

How you answer these questions will change what type of scraping architecture we build.

For this Amazon scraper example we will assume the following:

  • Objective: The objective for this scraping system is to monitor product rankings for our target keywords and monitor the individual products every day.
  • Required Data: We want to store the product rankings for each keyword and the essential product data (price, reviews, etc.)
  • Scale: This will be a relatively small scale scraping process (handful of keywords), so no need to design a more sophisticated infrastructure.
  • Data Storage: To keep things simple for the example we will store to a CSV file, but provide examples on how to store to MySQL & Postgres DBs.

To do this will design a Scrapy spider that combines both a product discovery crawler and a product data scraper.

As the spider runs it will crawl Amazon’s product search pages, extract product URLs and then send them to the product scraper via a callback. Saving the data to a CSV file via Scrapy Feed Exports.

The advantage of this scraping architecture is that is pretty simple to build and completely self-contained.


How To Build a Amazon Product Crawler

The first part of scraping Amazon is designing a web crawler that will build a list of product URLs for our product scraper to scrape.

Step 1: Understand Amazon Search Pages

With Amazon.com the easiest way to do this is to build a Scrapy crawler that uses the Amazon search pages which returns up to 20 products per page.

For example, here is how we would get search results for iPads.


'https://www.amazon.com/s?k=iPads&page=1'

This URL contains a number of parameters that we will explain:

  • k stands for the search keyword. In our case, k=ipadNote: If you want to search for a keyword that contains spaces or special characters then remember you need to encode this value.
  • page stands for the page number. In our cases, we’ve requested page=1.

Using these parameters we can query the Amazon search endpoint to start building a list of URLs to scrape.

To extract product URLs (or ASIN codes) from this page, we need to look through every product on this page, extract the relative URL to the product and the either create an absolute product URL or extract the ASIN.

 

 

Leave a Reply

Your email address will not be published.