How to scrape Amazon product data without getting blocked: a working Camoufox guide
Scrape Amazon product data with Camoufox + residential proxies. Working code, named failure modes, and when this approach is the wrong choice.
Amazon doesn't use Cloudflare. They don't use DataDome, Akamai, PerimeterX, or Kasada. They run their own in-house anti-bot stack, and it's tuned specifically against the patterns scrapers produce on product pages. This is why generic "bypass Cloudflare" advice doesn't transfer — you're fighting a different system.
Here's what you'll get from this post: the exact tier we use in production (Camoufox + residential proxy), a working Python code example you can run today, why [TODO: insert production stat from logs]% of attempts succeed where naive requests fails within seconds, and the cost and ToS trade-offs you need to know before deploying this to scale.
A note before we start: Amazon's terms of service prohibit automated access to product pages. This post is technical reference, not legal advice. Talk to a lawyer before running this in production at any meaningful volume.
Why Amazon's anti-bot system blocks most scraping attempts
Amazon's detection runs in three layers, and naive scrapers trip all three before they finish loading the first page.
Layer 1: TLS fingerprinting. When your HTTP client opens a connection to www.amazon.com, the TLS ClientHello message is inspected before any HTTP headers are read. Python requests (via urllib3) produces a JA4 fingerprint that does not match any real browser. Amazon's edge sees t13d[TODO: insert production stat from logs] and routes the request to their bot path, which returns either a 503, a CAPTCHA interstitial, or a stripped HTML response with no product data. Your User-Agent header is irrelevant; the decision happens before headers are parsed.
Layer 2: Request timing and ordering. Amazon's product pages load a specific sequence of resources: HTML, then a sequence of XHRs to /api/marketplaces, /gp/, and image CDNs. A scraper that fetches only the HTML produces a timing pattern (one request, no follow-ups, no asset loads) that scores as bot within the first 2-3 requests from the same session.
Layer 3: Session fingerprint isolation. Amazon ties detection to a session bundle: TLS fingerprint + cookie state + IP geolocation + Accept-Language + timing variance. Rotating IPs alone does not work because the rest of the bundle stays constant. We've seen sessions get blocked after [TODO: insert production stat from logs] requests on the same residential IP, and within [TODO: insert production stat from logs] requests when the IP rotates but the browser fingerprint doesn't.
The naive baseline: a Python script using requests with a Chrome User-Agent gets 403 or a CAPTCHA page on roughly [TODO: insert production stat from logs]% of product page hits within the first minute. Headless Chrome via Puppeteer with default settings does slightly better on the first request and worse on the fifth, because Amazon catches navigator.webdriver and the patched-stealth fingerprints that ship with puppeteer-extra-plugin-stealth.
This is why the rest of the post is about Camoufox specifically. Puppeteer-stealth, Playwright-stealth, and Selenium with undetected-chromedriver all fail on Amazon's TLS layer or get caught by behavioral scoring within a short session. We've tested them.
The Camoufox + residential proxy approach
Camoufox is a patched Firefox build that handles fingerprint resistance at the C++ level rather than via JavaScript injection. The practical difference: when a detection script asks the browser for navigator.webdriver, screen.width, or canvas rendering output, Camoufox returns values that don't reveal automation. Puppeteer-stealth patches these properties via JS, which leaves a trail in the prototype chain. Camoufox doesn't.
For Amazon specifically, two Camoufox properties matter:
- TLS fingerprint matches a real Firefox build. The JA4 hash is identical to what a real user running Firefox 133 on macOS would produce. Layer 1 of Amazon's detection passes.
- Real rendering pipeline with realistic timing. Camoufox loads page assets, runs the JS event loop, and produces XHR timing that matches a human-driven browser. Layer 2 passes.
Layer 3 (session fingerprint) is where residential proxies come in. Datacenter IPs (AWS, Azure, GCP, OVH) are flagged in Amazon's IP reputation database. We've seen brand-new datacenter IPs get blocked within [TODO: insert production stat from logs] requests. Residential IPs from providers like Bright Data, Oxylabs, or SOAX route through real consumer ISPs (Comcast, Spectrum, Vodafone) and clear Amazon's IP scoring on most requests.
Production data from our last quarter: Camoufox + residential proxy hits Amazon product pages successfully [TODO: insert production stat from logs]% of the time, with an average page load of [TODO: insert production stat from logs] seconds. "Successfully" means: the page loads, the JSON-LD Product schema is present in the HTML, and we don't hit a 403, 429, or CAPTCHA interstitial.
Two caveats on these numbers, and we say this clearly:
- Amazon rotates detection rules. The numbers above are from the past quarter. By the time you read this, they may be 5-10 points lower or higher. Build monitoring (we cover this at the end).
- The other [TODO: insert production stat from logs]% of requests fail. You need retry logic and proxy rotation between attempts, not a single-shot scraper.
Working code: scraping a single Amazon product
This is a minimal Python example using Camoufox. It loads a product page, extracts the JSON-LD schema, and handles the common error cases. Run it locally with pip install camoufox[geoip] and a residential proxy from any major provider.
import json
import re
import time
from camoufox.sync_api import Camoufox
PROXY = {
"server": "http://gate.smartproxy.com:7000",
"username": "your_proxy_user",
"password": "your_proxy_password",
}
PRODUCT_URL = "https://www.amazon.com/dp/B0CHX1W1XY"
def extract_jsonld_product(html: str) -> dict | None:
# JSON-LD is more stable than HTML element selectors.
# Amazon redesigns price/title CSS classes; schema.org markup
# is tied to SEO and changes far less often.
matches = re.findall(
r'<script[^>]+type=["\']application/ld\+json["\'][^>]*>(.*?)</script>',
html,
flags=re.DOTALL,
)
for raw in matches:
try:
data = json.loads(raw.strip())
except json.JSONDecodeError:
continue
if isinstance(data, dict) and data.get("@type") == "Product":
return data
return None
def scrape_product(url: str) -> dict:
with Camoufox(
headless=True,
proxy=PROXY,
humanize=True, # adds realistic timing jitter
geoip=True, # match locale to proxy IP geolocation
) as browser:
page = browser.new_page()
try:
response = page.goto(url, wait_until="domcontentloaded", timeout=30000)
except TimeoutError:
raise RuntimeError("Page load timeout — proxy slow or Amazon stalling")
status = response.status if response else 0
if status == 403:
raise RuntimeError("403 Forbidden — likely TLS or IP reputation block")
if status == 429:
raise RuntimeError("429 Rate limited — slow down or rotate proxy")
if status >= 500:
raise RuntimeError(f"{status} from Amazon — retry with backoff")
# Wait for product schema to render. Amazon ships JSON-LD in
# initial HTML, but some PDPs hydrate it post-load.
page.wait_for_selector('script[type="application/ld+json"]', timeout=10000)
html = page.content()
# Detect CAPTCHA interstitial — Amazon serves 200 OK with a
# captcha page, so status code alone is not enough.
if "api-services-support@amazon.com" in html or "captcha" in html.lower():
raise RuntimeError("CAPTCHA interstitial — slow down requests")
product = extract_jsonld_product(html)
if not product:
raise RuntimeError("No JSON-LD Product schema — page may be partial")
return product
if __name__ == "__main__":
data = scrape_product(PRODUCT_URL)
print(json.dumps({
"title": data.get("name"),
"price": data.get("offers", {}).get("price"),
"currency": data.get("offers", {}).get("priceCurrency"),
"rating": data.get("aggregateRating", {}).get("ratingValue"),
"reviews": data.get("aggregateRating", {}).get("reviewCount"),
}, indent=2))
time.sleep(15) # space out requests; do not hammerA typical successful output:
{
"title": "[TODO: insert production stat from logs]",
"price": "[TODO: insert production stat from logs]",
"currency": "USD",
"rating": "[TODO: insert production stat from logs]",
"reviews": "[TODO: insert production stat from logs]"
}Three things in this code that are not optional:
humanize=Trueadds timing jitter to mouse and keyboard events Camoufox emits during page load. Without it, page interaction timing is suspiciously uniform.wait_until="domcontentloaded"rather thannetworkidle. Amazon's product pages keep tracking pixels firing for 30+ seconds; waiting for network idle will time out.- Sleep between requests. [TODO: insert production stat from logs] seconds is the minimum we've found stable. Tighter intervals trigger the rate-limit layer.
Common errors and how to fix them
These are ranked by frequency in our production logs.
Error 1: HTTP 403 on the very first request. Cause: your proxy IP is a datacenter IP, or Camoufox isn't actually being used (you ran the script without the Camoufox context manager). Fix: verify your proxy is residential by hitting https://ipinfo.io/json through it and checking the org field — it should name a consumer ISP, not "Amazon AWS" or "Hetzner". Verify Camoufox is launching by setting headless=False once and watching the browser open.
Error 2: CAPTCHA page (HTTP 200 with api-services-support@amazon.com in the body). Cause: detection triggered despite Camoufox — usually because requests are too fast or the proxy IP has been used by other scrapers recently. Fix: increase the sleep between requests to [TODO: insert production stat from logs] seconds, add randomization (time.sleep(15 + random.uniform(0, 10))), and rotate the proxy session ID every [TODO: insert production stat from logs] requests.
Error 3: JSON-LD schema missing. Cause: Amazon served a partial page, often during a page redesign rollout or for a non-standard product type (Kindle books, AmazonFresh items). Fix: fall back to parsing the price from #corePriceDisplay_desktop_feature_div and the title from #productTitle. Don't fail the whole scrape — log the URL and degrade gracefully.
Error 4: Proxy auth failure (407). Cause: incorrect proxy credentials, expired subscription, or wrong endpoint format. Fix: test the proxy outside Camoufox first with curl -x http://user:pass@proxy:port https://api.ipify.org. If that fails, the proxy itself is broken; fix that before debugging Camoufox.
Error 5: Session gets blocked after [TODO: insert production stat from logs] requests. Cause: Amazon tied your requests to a session fingerprint bundle and the entire bundle is now flagged. Fix: rotate the proxy session ID (most providers expose a session parameter in the username, like user-session-abc123), restart the Camoufox browser instance to get fresh storage, and clear cookies between batches. Don't try to "recover" a flagged session — it's gone.
Debugging checklist when nothing works:
- Take a screenshot inside Camoufox (
page.screenshot(path="debug.png")) and look at what Amazon actually rendered. - Print
response.headersto see if Amazon returnedx-amz-ridor specific bot-flag headers. - Verify your proxy IP geolocation matches the Amazon domain (
amazon.comexpects US IPs;amazon.deexpects German IPs). - Try the same URL in a real browser from your laptop. If that also gets a CAPTCHA, your scraping IPs aren't the problem — Amazon's flag is broader.
Where this approach breaks down
Cost reality: a Camoufox + residential proxy request costs roughly $[TODO: insert production stat from logs] per page in raw infrastructure (proxy bandwidth + compute for a Firefox instance). At [TODO: insert production stat from logs]% success, your effective cost per successful page is higher, because failures still consume bandwidth. Compare this to Amazon's official Product Advertising API, which is free for approved Amazon Associates and returns structured data without any of this complexity.
The honest decision rule: if you can use the official API, use it. The Product Advertising API has rate limits, requires approval, and doesn't cover every product field — but for price, title, image, and ASIN-based lookups, it's the right call. Scraping is justified when:
- You need fields the API doesn't return (full review text, Q&A sections, "frequently bought together" data).
- You're not eligible for the API (no qualifying Associates account, or your use case violates their terms).
- You need historical price tracking at intervals the API rate-limit doesn't support.
Scale ceiling: at 10,000 product pages per day with [TODO: insert production stat from logs]% success, expect [TODO: insert production stat from logs]+ retries daily. Plan your queue depth and proxy budget accordingly. Above 100,000 pages/day, the per-request cost becomes prohibitive and direct data partnerships (Keepa, Rainforest API, or licensed Amazon datasets) are usually cheaper than running this yourself.
Detection rotation: Amazon updates anti-bot rules on roughly a [TODO: insert production stat from logs]-day cycle. Numbers in this post are accurate as of [TODO: insert date of last production validation]; by next quarter they may be different. Build monitoring before you build scale.
Jurisdictional and ToS risk: Amazon's terms of service explicitly prohibit automated data access without prior written agreement. Some jurisdictions (US under the CFAA, EU under the Digital Services Act) treat scraping public product data differently than scraping authenticated content. Get legal review before production deployment, especially if you're a company rather than an individual researcher.
Monitoring and maintenance
A scraper that works today and silently fails next month is worse than no scraper at all, because you'll make business decisions on stale data. The minimum monitoring to put in place:
Success-rate tracking. Log every request: timestamp, URL, status code, time taken, proxy IP, success/failure, error reason. Aggregate hourly. Alert if your rolling [TODO: insert production stat from logs]-hour success rate drops below [TODO: insert production stat from logs]%. Don't alert on individual failures — they're normal.
Exponential backoff with circuit breaker. When success rate drops below threshold, stop sending traffic for [TODO: insert production stat from logs] minutes. Hammering Amazon during a detection update gets your IP pool burned faster than backing off and waiting for the rules to settle.
Canary job. Pick [TODO: insert production stat from logs] stable, high-traffic ASINs (popular books, Echo devices — products that will exist for years). Scrape them on a fixed schedule, twice daily. When the canary fails on a product that was working yesterday, you know detection changed before your production scraper degrades. Diff the HTML response between yesterday and today to see what shifted.
Proxy provider A/B test. Don't lock into one residential proxy provider. Run [TODO: insert production stat from logs]% of traffic through provider A, [TODO: insert production stat from logs]% through provider B, and compare success rates monthly. Bright Data, Oxylabs, SOAX, and Smartproxy all have different IP pool quality and Amazon performs differently against each one over time.
Fallback decision rule. Define an escape hatch in code: if the rolling success rate for a product category drops below [TODO: insert production stat from logs]% for more than [TODO: insert production stat from logs] hours, automatically switch that category to the Product Advertising API (with reduced field coverage) and alert the team. Don't let your scraper die slowly — degrade to a working fallback, then fix the scraper offline.
If you don't want to maintain this stack yourself, DreamScrape's router handles Amazon traffic at the Camoufox tier with rotating residential proxies and automatic retry. Per-domain success rates are published at /intel/amazon.com, and the broader detection landscape is covered in Anti-bot detection in 2026: what actually blocks scrapers.
If you're scraping fewer than [TODO: insert production stat from logs] Amazon products per day and you can use the official Product Advertising API, use the API. If you need fields the API doesn't return, use the Camoufox approach above and budget for the cost and the maintenance.