Technical

Python vs Node.js for web scraping in 2026: honest benchmark

Python vs Node scraping: benchmarks, library comparisons, and failure modes. Python wins on ecosystem; Node on throughput. 2026 data.

Curtis Vaughan12 min read

We ran 1000 concurrent requests through Python (asyncio + aiohttp, curl_cffi) and Node.js (undici, Playwright Cluster) on identical hardware. Node hit 611 req/s. Python hit 487 req/s. The 25% gap is real, reproducible, and almost never the thing that decides your scraping job.

This post walks through the actual numbers, the failure modes that show up at hour eight rather than minute one, and a decision tree that ignores tribal loyalty. The benchmark scope is HTTP concurrency, DOM parsing, and data extraction. Headless browser overhead is covered separately because including it inside an HTTP benchmark is the most common way these comparisons get distorted.

Why this comparison matters in 2026

Two things changed between 2023 and 2025 that broke the older "Python is the obvious choice" framing.

First, Node's HTTP layer matured. undici (the HTTP/1.1 + HTTP/2 client now bundled with Node) outpaces node-fetch by roughly 2-3x on sustained throughput in our internal tests and outpaces Python's aiohttp on raw req/s in this benchmark. Async I/O on V8 is no longer the weak link it was when request and axios dominated.

Second, Python's library moat held. lxml, BeautifulSoup, and Scrapy are still the most ergonomic tools for messy HTML. Cheerio narrowed the gap but didn't close it for XPath-heavy work.

The wrong way to pick a language for a new scraper is to start with the language and discover the tool gap three weeks in. The right decision tree:

  1. What's your bottleneck — req/s, parsing complexity, or developer hours?
  2. What does your existing stack look like?
  3. What does your team already know?

If you answer those three honestly, the language picks itself. Most of this post is the data behind those three answers.

One thing this benchmark does not cover: scraping under active anti-bot protection. Cloudflare, DataDome, and Akamai responses don't depend on the language — they depend on TLS fingerprint and browser context. If you're hitting those, see our writeup on JA4 fingerprinting; language choice is downstream of that decision.

Throughput showdown: 1000 concurrent requests

The test: 1000 concurrent GET requests to a static HTML target, 4-core machine, identical network conditions, no rate limiting from the target. Average page size ~45KB. Each run repeated 5 times; numbers below are medians.

StackPeak req/sp50 latencyp99 latencyMemory at peak
Python asyncio + aiohttp48782ms412ms~340MB
Python curl_cffi + asyncio412110ms580ms~410MB
Node undici61164ms380ms~290MB
Node Playwright Cluster891,800ms6,200ms~3.2GB

Node + undici wins raw throughput. The 611 vs 487 gap is consistent across runs. curl_cffi is slower than aiohttp because TLS impersonation costs CPU on every handshake — that's the price you pay to look like Chrome 131 at the protocol level, and it's worth it on protected sites.

Playwright Cluster at 89 req/s is not a fair HTTP comparison and shouldn't be read as one. It's listed because engineers conflate "scraping benchmark" with "include all tools," and the point is exactly that: if your job needs a real browser, you've already left the throughput tier behind. The cost structure is different by an order of magnitude.

Where each stack breaks under sustained load

The interesting numbers aren't peak throughput. They're what happens at hour two and hour eight.

Python under sustained load: GIL contention shows up when parsing happens in the same process as I/O. With aiohttp pulling data and lxml parsing it inline, CPU saturates around 300-350 req/s regardless of how many event loop tasks you spawn. The mitigation is multiprocessing or splitting parse into a separate worker pool. Both add serialization overhead.

Node under sustained load: memory creep. In long-running scrapers (over 8 hours), undici connection pools accumulate sockets that aren't fully closed under specific failure paths — connection-reset-by-peer being the most common. Memory grows roughly 50-150MB per hour without explicit client.close() discipline. We've seen Node scrapers OOM at hour 12 that ran clean at hour 2.

Both stacks hit file descriptor exhaustion at roughly 10k concurrent connections. Python's failure mode is graceful (OSError: [Errno 24]); Node's is messier (sockets in CLOSE_WAIT for minutes). The fix in both cases is ulimit -n 65536 and connection pool caps. This is an OS-level constraint, not a language one.

Cost translation: 25% throughput delta means roughly 25% more compute for the same volume. At 50M pages/month on a typical scraping VM, that works out to roughly $40-60/month in extra compute on a Hetzner CCX-class VM. Real, but rarely the largest line item — proxy bandwidth and storage usually dominate.

Library ecosystem: the parsing and extraction layer

Throughput gets you the bytes. Parsing turns bytes into rows. Python wins this layer cleanly, and the win is structural rather than something Node will close in a release cycle.

Python's parsing stack:

  • BeautifulSoup 4 — CSS selectors, Pythonic API, tolerant of malformed HTML. The standard for "parse this messy page in 6 lines."
  • lxml — C-backed, fast, full XPath 1.0 support. The standard for "parse 10 million pages and care about CPU time."
  • Scrapy — full framework with built-in middleware, dedup, retry logic, item pipelines. The standard for "this is a scraping product, not a script."
  • Selectolax — newer, roughly 2-3x faster than lxml for selector-heavy work in published benchmarks. Worth knowing about.

Node's parsing stack:

  • Cheerio — jQuery-like, ~80% feature parity with BeautifulSoup. Comfortable if you've written jQuery; awkward if you haven't.
  • Playwright/Puppeteer — required for any JavaScript rendering. Not a parser, a browser.
  • node-html-parser — faster than Cheerio for simple selectors, less complete API.

In production, parsing 10k pages with lxml takes roughly 8-12 seconds on a single core. Same pages, same hardware, with Cheerio: roughly 18-25 seconds. The gap widens with XPath complexity — Cheerio doesn't natively support XPath, and the workaround (css-select plus manual traversal) costs lines and CPU.

The honest concession: if your target returns clean JSON from an API endpoint, none of this matters. JSON.parse is JSON.parse. Python's parsing advantage shrinks to zero when there's no HTML to traverse. Most modern scraping targets are a mix — JSON for the data you want, HTML for the listing pages that link to it. The mix is what determines whether the parsing layer is a hot spot.

Developer velocity is the other half. A Python extractor for a typical product page is roughly 30-50 lines. The Node equivalent in Cheerio is roughly 40-65 lines. Not a huge gap, but Python reads top-to-bottom while Node tends to require more nested callbacks for the same selector chain.

JavaScript rendering: where each language breaks

Neither language has a native, production-grade solution for rendering JavaScript-heavy pages. Both offload to Playwright or Puppeteer. The language you call them from doesn't change much.

Python Playwright completes a page render in roughly 2.5-4 seconds including driver overhead. Node Playwright lands at roughly 2.3-3.8 seconds. The difference is within measurement noise.

Where both languages fail identically: at >500 concurrent render jobs, the browser pool saturates. Each Chromium instance eats 200-400MB of RAM, and a 16GB box caps at roughly 40-50 concurrent browsers regardless of which language is orchestrating them. The production fix is the same — queue-based distribution with a worker pool sized to memory, not CPU.

Cost reality: rendering-heavy scraping costs roughly $0.50-1.50 per 1,000 pages in cloud compute on a typical mid-tier VM. The language overhead is single-digit percent of that. If your job is 80% render, picking Python over Node to "save infra cost" is solving the wrong problem.

If you can avoid rendering — by hitting the underlying JSON APIs directly, or by getting through TLS-only protections like Cloudflare Bot Fight Mode without a browser — do it. The JA4 writeup covers when that's possible.

Production failure modes: GIL, memory, and connections

The benchmarks above run for minutes. Production scrapers run for days. Different problems show up.

Python GIL contention: any CPU-bound work in the parse step (lxml DOM iteration, regex over large strings, JSON-to-pandas conversion) blocks the event loop. On an 8-core machine with threading, you'll see roughly 1.5-2x speedup rather than the 8x you'd expect. Mitigation is multiprocessing.Pool for the parse step, which adds pickle serialization overhead between processes. The serialization tax runs roughly 5-15% of total parse time for typical page sizes.

Node memory creep: undici keeps connections alive for performance, which is correct. But under specific failure modes — TLS errors, peer resets, target-side timeouts — sockets land in states the pool doesn't reclaim. Heap grows. Mitigation: explicit Pool config with connections, keepAliveTimeout, and keepAliveMaxTimeout, plus a watchdog that recycles the pool every 30-60 minutes. Default config will leak.

File descriptor exhaustion hits both languages at the same threshold (ulimit -n, default 1024 on most Linux distros, 256 on macOS). Python's aiohttp surfaces this as OSError: Too many open files and recovers if you're catching it. Node tends to wedge into CLOSE_WAIT accumulation that survives the request and shows up as connection failures on the next batch.

Timeout recovery under load: both languages handle the happy path well and the degraded path identically badly. Under target-side slowness, the canonical fix is the same circuit-breaker pattern — track per-host failure rate, open the circuit when it crosses a threshold, half-open after a cool-down. Implement once, port across languages.

One Python advantage worth naming: profiling. cProfile, tracemalloc, and memory_profiler are stdlib or one-liner installs. Node profiling means clinic.js, --inspect, or Chrome DevTools — workable, but more steps from "scraper is slow" to "I see the hot frame."

Real code: scraping 10k URLs with both

Minimal working examples. Both handle exponential backoff, session pooling, error logging, and CSV output.

Python (asyncio + aiohttp + BeautifulSoup)

code
import asyncio, csv, logging
from contextlib import asynccontextmanager
import aiohttp
from bs4 import BeautifulSoup
 
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("scraper")
 
CONCURRENCY = 50
RETRIES = 3
TIMEOUT = aiohttp.ClientTimeout(total=30)
 
async def fetch(session, url, attempt=1):
    try:
        async with session.get(url, timeout=TIMEOUT) as r:
            if r.status == 429:
                await asyncio.sleep(2 ** attempt)
                if attempt < RETRIES:
                    return await fetch(session, url, attempt + 1)
                return url, None, "rate_limited"
            r.raise_for_status()
            return url, await r.text(), None
    except Exception as e:
        if attempt < RETRIES:
            await asyncio.sleep(2 ** attempt)
            return await fetch(session, url, attempt + 1)
        return url, None, str(e)
 
def parse(html):
    soup = BeautifulSoup(html, "lxml")
    title = soup.find("title")
    links = [a.get("href") for a in soup.find_all("a", href=True)]
    return (title.text.strip() if title else "", len(links))
 
async def worker(sem, session, url, writer):
    async with sem:
        url, html, err = await fetch(session, url)
        if err:
            log.warning("fail %s: %s", url, err)
            return
        title, link_count = parse(html)
        writer.writerow([url, title, link_count])
 
async def main(urls):
    sem = asyncio.Semaphore(CONCURRENCY)
    connector = aiohttp.TCPConnector(limit=CONCURRENCY * 2)
    async with aiohttp.ClientSession(connector=connector) as session:
        with open("out.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerow(["url", "title", "link_count"])
            await asyncio.gather(*(worker(sem, session, u, writer) for u in urls))
 
if __name__ == "__main__":
    urls = [line.strip() for line in open("urls.txt")]
    asyncio.run(main(urls))

Node (undici + cheerio)

code
import { Pool } from 'undici';
import * as cheerio from 'cheerio';
import { createWriteStream } from 'fs';
import { readFile } from 'fs/promises';
 
const CONCURRENCY = 50;
const RETRIES = 3;
const TIMEOUT_MS = 30_000;
 
const pools = new Map();
function poolFor(origin) {
  if (!pools.has(origin)) {
    pools.set(origin, new Pool(origin, {
      connections: CONCURRENCY,
      keepAliveTimeout: 300_000,
    }));
  }
  return pools.get(origin);
}
 
async function fetchOnce(url, attempt = 1) {
  const u = new URL(url);
  const pool = poolFor(u.origin);
  try {
    const res = await pool.request({
      path: u.pathname + u.search,
      method: 'GET',
      headersTimeout: TIMEOUT_MS,
      bodyTimeout: TIMEOUT_MS,
    });
    if (res.statusCode === 429) {
      if (attempt >= RETRIES) return { url, err: 'rate_limited' };
      await new Promise(r => setTimeout(r, 2 ** attempt * 1000));
      return fetchOnce(url, attempt + 1);
    }
    if (res.statusCode >= 400) {
      throw new Error(`status ${res.statusCode}`);
    }
    const body = await res.body.text();
    return { url, body };
  } catch (e) {
    if (attempt < RETRIES) {
      await new Promise(r => setTimeout(r, 2 ** attempt * 1000));
      return fetchOnce(url, attempt + 1);
    }
    return { url, err: e.message };
  }
}
 
function parse(html) {
  const $ = cheerio.load(html);
  const title = $('title').text().trim();
  const linkCount = $('a[href]').length;
  return { title, linkCount };
}
 
async function runBatch(urls, out) {
  let i = 0;
  async function worker() {
    while (i < urls.length) {
      const url = urls[i++];
      const { body, err } = await fetchOnce(url);
      if (err) {
        console.warn(`fail ${url}: ${err}`);
        continue;
      }
      const { title, linkCount } = parse(body);
      out.write(`"${url}","${title.replace(/"/g, '""')}",${linkCount}\n`);
    }
  }
  await Promise.all(Array.from({ length: CONCURRENCY }, worker));
}
 
const urls = (await readFile('urls.txt', 'utf8')).split('\n').filter(Boolean);
const out = createWriteStream('out.csv');
out.write('url,title,link_count\n');
await runBatch(urls, out);
out.end();
for (const p of pools.values()) await p.close();

10k URLs on a 4-core machine, identical rate limiting: Python finished in roughly 28 seconds. Node finished in roughly 22 seconds. The Python version is 56 lines; the Node version is 78 lines. Both are production-adjacent — you'd add metrics, structured logging, and probably a queue, but the bones are right.

Common runtime errors you'll see in both: ECONNRESET after a couple of hours (target side, not yours), 429 bursts (use jitter, not just exponential backoff), and timeouts on slow pages. Python-specific: UnicodeDecodeError on pages with mixed encodings — chardet plus explicit errors='replace' is the fix. Node-specific: RangeError: Maximum call stack if you recurse retries instead of iterating, and OOM under sustained load if you don't recycle the pool.

Common errors and how to fix them

Python aiohttp ClientConnectorError after ~1000 requests: usually ulimit -n or unclosed sessions. Set TCPConnector(limit=N, limit_per_host=M) explicitly, raise ulimit -n to 65536, and confirm you're using a single ClientSession per process rather than creating one per request.

Python lxml libxml2 XML parsing error on malformed HTML: switch to lxml.html (which is HTML-tolerant) or pass parser=html.parser to BeautifulSoup. Trade-off: lxml is faster, html.parser is more forgiving.

Node undici socket hang up: idle timeout too short. Default keepAliveTimeout is 4 seconds at the agent level for some configurations; raise to 60-300 seconds for slow targets. Also confirm the target supports keep-alive — some don't, and the fix is pipelining: 0.

Node cheerio Invalid selector: cheerio uses css-select, not XPath. If you copied an XPath from DevTools, translate to CSS or switch to a library that supports XPath. Test selectors in browser DevTools (document.querySelectorAll) before pasting into cheerio.

Both, on 429 responses: implement jitter — delay = base * 2^attempt + random(0, base). Without jitter, a fleet of workers retries in lockstep and hits the same rate limit window. Also rotate User-Agent and check whether the target is rate-limiting on IP, header, or session cookie. The fix differs by signal.

Where this comparison breaks down

A few honest scope limits.

Anti-bot context: this benchmark hit clean targets. Once Cloudflare, DataDome, or Akamai is in the path, throughput drops by orders of magnitude for both languages and the bottleneck moves to TLS fingerprinting and browser context. Language choice is irrelevant compared to engine choice at that point.

Multiprocessing not tested: Python ran asyncio only. With multiprocessing + asyncio per worker on an 8-core machine, Python's throughput ceiling moves up considerably — possibly past Node's single-process ceiling. We didn't run that benchmark because it adds operational complexity most teams won't take on.

Distributed scraping: if you're sharding across regions or running on Lambda/Cloud Functions, the language overhead is dwarfed by cold start and network. Go pick whichever language matches your platform.

Salary economics: Python and Node developers price within roughly 5-10% of each other in most US markets (per Stack Overflow's 2025 developer survey), but for a scraping team of 1-3 engineers the velocity difference matters more than the hourly rate. A team that ships in 2 weeks beats a team that ships in 6 weeks at any reasonable salary delta.

API rate limits dominate: most production scrapers are bottlenecked by the target's rate limit or your proxy budget, not by your HTTP client. If you're getting 50 req/s through a residential proxy at $8/GB, the difference between 487 and 611 req/s peak is theoretical.

The honest recommendation

Choose Python if:

  • Existing codebase is Python.
  • Parse logic is non-trivial — XPath, recursive descent, schema-on-read.
  • Team is comfortable with BeautifulSoup or Scrapy already.
  • Developer velocity matters more than peak req/s.

Choose Node if:

  • Backend is already Node, and avoiding a polyglot ops stack matters.
  • Raw HTTP throughput is the bottleneck (sustained >500 req/s on clean targets).
  • Targets are mostly JSON APIs — Cheerio is fine, parsing isn't a hot spot.
  • Team writes JavaScript daily and would context-switch into Python.

Hybrid (Node HTTP layer feeding a Python parse pipeline via a queue) is feasible for teams of 3+ engineers. For smaller teams, the operational overhead exceeds the throughput win.

The 25% throughput gap is real. It's also rarely the constraint. Pick the language that matches your codebase and your team, then spend the time you saved on retry logic, proxy rotation, and parse correctness — those are the things that decide whether your scraper survives week three.

If you'd rather not run any of this yourself, the DreamScrape playground handles the engine routing across HTTP, JA4, and browser tiers. Free tier is 2,000 scrapes a month, no signup required to try it.