Why 70% of Your Scraping Traffic Doesn't Need a Browser (2026)

Most scrapers default to headless Chrome for everything. We tested 60+ domains and found that roughly 70% of requests resolve at the cheapest tier — plain HTTP or JA4 fingerprinting. Here's the data, the reasoning, and what it means for how you build scraping infrastructure.

Curtis VaughanApr 29, 202613 min read

The default architecture for web scraping in 2026 is wrong. Most scraping setups — both indie scripts and commercial APIs — route every request through a headless browser. Playwright spins up. Chromium loads. JavaScript executes. The DOM renders. You wait 3-5 seconds. You pay browser-tier costs.

For roughly 70% of those requests, none of that was necessary. The site would have responded to a plain HTTP GET with the same content. The browser was overhead, not capability.

I've been running this experiment for the last few months across 60+ domains in production. The results are pretty consistent: most "scraping" doesn't need scraping in the heavy sense at all. It needs an HTTP request with the right TLS fingerprint and the right headers. Below is what the data actually says, why the industry got here, and what it means for how to build scraping infrastructure that doesn't waste 90% of its cost.

How the Industry Defaulted to Browsers for Everything

The standard scraping advice for the last decade has been some version of: "if a site blocks you, use a browser." Cloudflare blocks your requests call? Use Selenium. Selenium gets detected? Use Playwright. Playwright stealth gets caught? Use undetected-chromedriver. Eventually use Puppeteer with stealth plugins. Eventually use Camoufox.

Each escalation step solves a real problem. None of them solve the more fundamental problem: you started at the wrong tier.

The industry default became "use a browser by default and hope it's enough." Commercial scraping APIs reinforced this — Firecrawl, ScrapingBee, ScrapFly, and others built their products around browser-first architectures because that's what reliably works for the hardest cases. They priced accordingly. The user pays browser-tier prices for every request, regardless of whether the site needed a browser.

The cost difference matters. A typical browser-tier scrape costs roughly:

3-5 seconds of latency (vs ~200ms for HTTP)
200-500MB of memory per browser session (vs ~5MB for an HTTP client)
10-50x the CPU time
20-50x the per-request cost in commercial APIs

If you're scraping at any volume, paying browser prices for sites that respond to plain HTTP is the largest unforced error in scraping infrastructure.

What I Actually Tested

Across 60+ domains in DreamScrape's production traffic, I categorized each domain by which engine tier successfully returned valid content:

Tier 0 (HTTP / JA4): Plain HTTP request, optionally with TLS fingerprint impersonation via curl_cffi. Cost: ~$0.0001 per request. Latency: ~200ms.

Tier 1 (legacy JA3): curl-impersonate with older fingerprint scheme. Mostly subsumed by Tier 0 now.

Tier 2 (Stealth Playwright): Headless Chromium with playwright-extra and stealth plugins. Cost: ~$0.0002 per request (fixed VPS cost), but consumes seconds of compute. Latency: 3-5 seconds.

Tier 3 (Camoufox): Patched Firefox with C++ level fingerprinting via Python subprocess. Cost: ~$0.0002 per request but with the highest compute footprint. Latency: 8-12 seconds.

The distribution across the 60+ tested domains:

~70% resolve at Tier 0 (HTTP or JA4)
~15% require Tier 2 (Stealth Playwright)
~10% require Tier 3 (Camoufox)
~5% can't be reliably scraped without residential proxies + session warming (Reddit, Etsy, etc.)

You can verify this distribution yourself on the DreamScrape Intel Database — every domain shows its actual routing tier and success rate from real traffic, not synthetic benchmarks.

Why So Many Sites Don't Actually Need a Browser

The mental model most people have is: "modern web is JavaScript-heavy, so most sites need a browser to render." That's true for the user-facing version of those sites. It's often false for the scrapeable version.

Three patterns explain why:

1. Most "JavaScript-heavy" sites still ship server-rendered HTML for SEO. The Next.js SSR boom over the last few years means a huge share of sites that look like SPAs in the browser actually serve fully-rendered HTML to crawlers and to plain HTTP requests. Google needs to index them, so they render server-side. Your scraper benefits from the same architecture.

2. The "protection" on many Cloudflare-fronted sites is just TLS fingerprinting. A meaningful share of sites that block requests and urllib3 aren't running JavaScript challenges at all. They're checking the TLS handshake fingerprint. If you send the right TLS handshake — which curl_cffi does cheaply, without a browser — you're past the gate. I covered the mechanics of this in detail in How to Bypass Cloudflare Without a Browser Using JA4 TLS Fingerprinting.

3. Many sites expose internal APIs that the page itself uses. A page might render via React after fetching JSON from an internal endpoint. If you discover that endpoint, you can hit it directly with plain HTTP, bypass the browser entirely, and get cleaner data than you'd extract from the DOM. This is the basis of API Discovery, which I'll cover in a future post.

The combined effect of these three patterns: the share of sites that genuinely require a JavaScript runtime for scraping is much smaller than the industry assumes.

The Cost of the Default

Let's run the math on what the browser-first default costs in practice.

Suppose you're scraping 100,000 pages per month. Your traffic mix follows the typical distribution: 70% sites would respond to HTTP, 15% need a stealth browser, 10% need anti-detect Firefox, 5% are unscrapeable without proxies.

Browser-first approach (commercial API charging $0.001 per request, no tier discount):

100,000 requests × $0.001 = $100 per month
All requests take 3-5 seconds
Total compute time: ~70 hours per month

Tiered approach (router picks the cheapest engine that works, credits scale with engine tier):

70,000 HTTP/JA4 requests × $0.0001 = $7
15,000 Stealth Playwright requests × $0.0005 = $7.50
10,000 Camoufox requests × $0.001 = $10
5,000 fail or escalate to proxy = ~$10
Total: ~$35 per month
Most requests take ~200ms; only the 25% that need browsers take 3-10 seconds
Total compute time: ~10 hours per month

That's 65% cost savings and 85% latency improvement just from picking the right engine per site. The savings compound at scale: at 1M requests per month, the gap is $1,000 vs $350. At 10M requests per month, it's $10,000 vs $3,500.

This is what credit-based scraping pricing exists to capture. The router does the picking; you pay for the actual compute consumed, not for the worst case applied to every request.

The Counterargument, and Why It's Mostly Wrong

The strongest argument for browser-first is reliability. The thinking goes: "I don't know which sites will need a browser ahead of time. If I try HTTP first and it fails, I have to retry. Just using a browser from the start means one request, one success."

This argument was correct in 2018. It stopped being correct around 2022.

The reasons:

Routing tables solve the discovery problem. A new site might require trying multiple tiers on the first request. Every subsequent request goes straight to the tier that worked. After 5-10 requests per domain, the discovery cost is amortized to zero. You don't need to "know which sites need a browser" — the routing table figures it out and remembers.

HTTP failures are detectable and cheap. A 403 response, a JavaScript shell page (real HTML but no actual data), a redirect to a CAPTCHA — all of these are programmatically detectable in < 100ms. Detecting an HTTP failure and escalating to a browser costs essentially nothing. Starting with a browser costs 3-5 seconds even when it wasn't needed.

The "reliability" of browser-first is partially illusory. Sites that block requests will often also block headless Chrome with default settings. You need stealth plugins, fingerprint patches, sometimes residential proxies. The browser doesn't make detection go away; it shifts which detection layer matters. If you're using a browser without stealth, you're not actually more reliable than HTTP — you're just slower.

The honest version of the counterargument is: "I want to write the simplest possible code, and a single Playwright call is simpler than tier escalation logic." That's a real tradeoff for a one-off script. But if you're building scraping infrastructure that runs in production, the simplicity savings of browser-first are dwarfed by the cost overruns. Tier escalation logic is ~50 lines of code. The cost difference compounds forever.

What "70%" Looks Like in Practice

To make the abstract concrete, here's the actual tier distribution for some specific domains in our test set:

Tier 0 (HTTP/JA4) — works without a browser:

news.ycombinator.com
basketball-reference.com
hltv.org
pump.fun
api.github.com
example.com
Most static news sites
Most reference and stats databases
Most public APIs
Many e-commerce category pages

Tier 2 (Stealth Playwright) — needs a browser:

Many SPA-heavy SaaS dashboards
Sites with mandatory JS challenges
Modern e-commerce product detail pages with JS-rendered prices
Most search interfaces with JS-driven results

Tier 3 (Camoufox) — needs C-level anti-detect:

zillow.com (detects Playwright specifically)
Some real estate platforms
Some travel booking sites

Currently failing — even Camoufox isn't enough:

reddit.com (needs Camoufox + residential + session warming)
etsy.com (CAPTCHA wall)
dexscreener.com (JS challenge we haven't reverse-engineered)

You can search any specific domain in the Intel Database to see its current tier classification and success rate. The goal isn't to claim the test set is representative of every possible scraping use case — different traffic mixes will hit different distributions. The point is the order of magnitude. Most domains in any reasonable test set fall into Tier 0 or Tier 2, not Tier 3.

What This Means for How You Build Scraping Infrastructure

Three practical takeaways:

1. Stop defaulting to browsers. Build your scraping pipeline to try HTTP first, escalate on failure, and remember what worked. This is true whether you're rolling your own infrastructure or evaluating a commercial scraping API. If the API charges you the same per request regardless of whether it needed a browser, you're subsidizing other users' harder targets.

2. Invest in TLS fingerprinting at the HTTP layer. Roughly 30-40% of "protected" sites can be unblocked just by sending a real Chrome TLS fingerprint, no browser needed. The library to use is curl_cffi. The mechanism is documented in the JA4 fingerprinting post. The cost difference between adding impersonate="chrome131" to a request vs spinning up Chromium is roughly 50x in compute and 20x in latency.

3. Treat the routing decision itself as a first-class concern. The interesting part of scraping in 2026 isn't picking the best individual engine — it's deciding which engine to use for each target, learning from failures, and updating the routing as sites change their detection. This is the part that compounds. A static "use Playwright for everything" setup gets stale; a learning router gets better.

This is the architecture DreamScrape is built around. You can see the routing happening in real time in the Tier Race feature in the playground — fire HTTP, JA4, and stealth browser in parallel against any URL and see which one wins.

Frequently Asked Questions

Do I really not need Playwright for most scraping?

For most domains, no. Run a quick test: hit the site with curl and curl --tls-impersonate chrome131 and check whether the response contains the data you want. If yes, you don't need Playwright for that site. The pattern holds across more sites than the standard advice suggests.

When should I actually use a headless browser?

Use a headless browser when:

The page genuinely renders content via JavaScript and the server-side HTML doesn't include what you need
The site runs JavaScript challenges (Cloudflare Turnstile, hCaptcha) that require a real JS environment
You need to interact with the page (clicks, form fills, infinite scroll)
You need to capture screenshots or PDFs

Don't use a headless browser when:

You can get the same data from plain HTML
You can get the data from an internal API the page calls
You're scraping a static site, news article, or reference page

What about sites that detect HTTP scrapers via headers, not TLS?

Header-based detection still happens but is much weaker than TLS-based detection. Sending a realistic User-Agent, Accept, Accept-Language, and Accept-Encoding goes a long way. curl_cffi handles this for you when you use an impersonate profile — it sends the headers that real Chrome would send. If you're using plain requests, copy the headers from your browser's developer tools.

Will switching to HTTP-first break my scraping?

Not if you implement escalation. The pattern is:

Try HTTP
If you get a 403, a JS shell page, or a CAPTCHA, escalate to browser
Cache which engine worked for this domain so you don't re-test next time

Done correctly, you get the cost benefits of HTTP-first without losing reliability. Done incorrectly (just trying HTTP and giving up on failure), you lose 25-30% of sites. Don't do it incorrectly.

Does this apply if I'm scraping just one site?

Less so. If you're scraping one site and it requires a browser, just use a browser. The 70% number matters when you're scraping many different sites and the average cost compounds. For a single-site scraper, the architecture decision is between "the simplest thing that works for this site" and there's no routing benefit.

What's the cheapest way to scrape at scale?

The cheapest way is whatever architecture matches your traffic mix. If your traffic is 90% Tier 0 sites, cheap HTTP scraping with JA4 fingerprinting is the right answer. If your traffic is 80% Tier 3 sites, you need browser infrastructure regardless of routing. The routing layer is what lets you pay for the actual mix instead of the worst case.

For most general-purpose scraping, a credit-based API like DreamScrape (where engines have different costs and you only pay for what's needed) will be significantly cheaper than flat-rate browser-tier APIs.

Where's the source data for the 70% number?

The DreamScrape Intel Database shows tier distribution across all tracked domains in real time. Right now (April 2026), 60+ domains are tracked. As the test set grows, the exact number will drift, but the order of magnitude has been stable since I started tracking — most domains land at Tier 0.

The Bottom Line

The browser-first default in scraping is a holdover from a time when TLS fingerprinting libraries weren't mature, routing was hard to implement, and per-site testing was the only way to figure out which engine worked. None of those conditions hold anymore.

In 2026, building scraping infrastructure that defaults to a browser for every request is like building a database that runs a full table scan for every query. It works. It's also leaving 60-70% of the available efficiency on the table.

The architecture that wins from here is: try cheap before you try expensive, escalate only as needed, and remember what worked. The 30% of sites that genuinely need a browser are real, and you should have browser infrastructure for them. But making them the default is a tax on the other 70%.

If you're building this yourself, the components are open source. curl_cffi for the JA4 layer, Playwright for the stealth tier, Camoufox for the C-level anti-detect, your own routing table for the decisions. If you'd rather not, DreamScrape does it for you — try the playground to see the routing in action.