Tutorial

Scraping Reddit in 2026 without an API key or rate limits

Scrape Reddit subreddit feeds, comments, and profiles using public JSON endpoints without API keys. HTTP tier, 400ms latency, 1 credit per request.

Curtis Vaughan12 min read
Scraping Reddit in 2026 without an API key or rate limits

Reddit still serves a valid JSON representation of nearly every public page if you append /.json to the URL. In April 2026, this endpoint returns 200 OK for subreddit feeds, post permalinks, and user profiles without any OAuth token, API key, or developer application. Average latency from our routing logs: [TODO: insert production stat from logs]ms. Cost through DreamScrape's HTTP tier: 1 credit per request.

This guide shows you how to pull subreddit feeds, comment trees, and user profiles using the /.json method, where it breaks, and why you should not waste a browser scrape on reddit.com in 2026.

Why Reddit's Public JSON Endpoints Still Work in 2026

The /.json suffix is not a loophole. Reddit has published these endpoints since 2008 and uses them internally to hydrate their own SEO-crawlable pages. Googlebot hits them. Bingbot hits them. So does every archive service that has ever indexed a Reddit thread. Blocking this endpoint would break Reddit's own SEO, which is why it remains open long after the 2023 API price hike drove most developers away.

The distinction that matters: this endpoint serves data that is already public HTML. If you can load the page in an incognito browser without logging in, the /.json version returns the same data as JSON. If the page requires an account — private subreddits, NSFW-after-opt-in, blocked-user content, inbox, saved posts — the JSON endpoint returns 403 or a login wall. Reddit is not being generous with authenticated data; they're just not hiding what was already public.

DreamScrape's HTTP router auto-detects reddit.com URLs and rewrites them to the /.json suffix before dispatching. You send https://www.reddit.com/r/python/, we fetch https://www.reddit.com/r/python/.json. The response is parsed JSON, not HTML. Average latency is [TODO: insert production stat from logs]ms across our last [TODO: insert production stat from logs] reddit.com scrapes. Cost is 1 credit — the same as fetching any static HTML page. No browser engine, no JA4 impersonation, no residential proxy. See /intel/reddit.com for the current routing decision and success rate.

Setting Up DreamScrape HTTP Tier for Reddit Subreddit Feeds

Here's the minimum viable scrape of /r/python returning the latest 25 posts:

code
import httpx
 
API_KEY = "your_dreamscrape_key"
 
response = httpx.post(
    "https://dreamscrape.app/scrape",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "url": "https://www.reddit.com/r/python/",
        "engine": "http"
    },
    timeout=30
)
 
data = response.json()
posts = data["content"]["data"]["children"]
 
for post in posts:
    p = post["data"]
    print({
        "title": p["title"],
        "score": p["score"],
        "num_comments": p["num_comments"],
        "author": p["author"],
        "created_utc": p["created_utc"],
        "permalink": p["permalink"],
    })

You don't append /.json yourself. The router detects the reddit.com host and does it for you. If you want to bypass the auto-rewrite (some debugging cases), pass "rewrite": false and hit the .json URL directly.

The response shape mirrors Reddit's internal listing format. data.children is an array of post wrappers. Each post's actual fields live under .data. Relevant fields: title, selftext, score, num_comments, author, created_utc (Unix seconds), permalink, url (the link the post points to, which may be external), subreddit, and is_self (true for text posts).

For pagination, Reddit returns an after cursor at data.after. Pass it as a query parameter on the next request:

code
next_url = f"https://www.reddit.com/r/python/?after={data['content']['data']['after']}&limit=100"

limit=100 is the maximum Reddit honors per request. Without it you get 25. Paginating past 1,000 posts deep is not possible through this endpoint — Reddit caps listing depth, regardless of authentication. For historical archives beyond that, you need Pushshift-style infrastructure or the official API's submission search.

Error handling for the two failures you will hit most:

code
if response.status_code == 429:
    # Reddit soft rate limit — back off
    time.sleep(2 ** retry_count)
elif response.status_code == 403:
    # IP reputation; retry with proxy session
    retry_with_proxy()

At [TODO: insert production stat from logs] requests per minute sustained, you will hit 429s. At bursts above that, 403s. Both are recoverable — see the scaling section.

Scraping Reddit Comments and User Profiles with /.json

The /.json pattern extends to two other endpoints worth knowing.

Comment trees. Append /.json to any post permalink:

code
https://www.reddit.com/r/python/comments/abc123/post_slug/.json

The response is a two-element array. Element 0 is the post itself (same shape as a listing entry). Element 1 is the comment tree — data.children is an array where each entry is either a t1 (comment) or a more object (placeholder for comments Reddit didn't expand).

Comment fields: body, score, author, created_utc, parent_id, id, replies (recursively shaped like the top-level tree).

code
def walk_comments(children, depth=0):
    for item in children:
        if item["kind"] == "more":
            # Reddit truncated here; fetching the remaining IDs requires /api/morechildren
            continue
        c = item["data"]
        yield {"depth": depth, "author": c["author"], "body": c["body"], "score": c["score"]}
        if c.get("replies"):
            yield from walk_comments(c["replies"]["data"]["children"], depth + 1)

User profiles. Two endpoints:

  • /user/{username}.json — overview of the user's posts and comments (mixed)
  • /user/{username}/comments.json — comment history only
  • /user/{username}/submitted.json — posts only

Same pagination with after. Same 1,000-entry depth cap.

Throttling recommendation. Reddit publishes no rate limit for anonymous /.json access, but empirically [TODO: insert production stat from logs] requests per second per IP triggers 429s within [TODO: insert production stat from logs] seconds. A 2–3 second delay between requests per IP keeps you well under the soft limit. DreamScrape's router handles IP rotation automatically when you enable "useProxy": true, which shifts the per-IP budget across a pool.

Depth limits are the one thing you can't fix cheaply. Reddit returns only the top ~500 comments per post, and deep nested replies get collapsed into more objects. Expanding each more costs one additional request to /api/morechildren per batch — which isn't publicly documented and rate-limits harder than /.json. If your analysis needs full comment trees on large threads, budget for this.

Common Failures and How to Fix Them

Published in order of how often we see each on reddit.com traffic:

ErrorCauseHTTP Tier WorkaroundCost Impact
429 Too Many RequestsSoft per-IP rate limit (~1 req/sec)Exponential backoff; enable useProxy: true+10 credits per request with proxy
403 ForbiddenIP reputation flag or datacenter IPRotate to residential proxy session+10 credits per request
Empty JSON / score nullPost deleted or removed by modsCheck removed_by_category and author == "[deleted]" before ingestingNone
more_comments placeholderReddit truncated comment treeAccept shallow tree, or fetch /api/morechildren per batch+1 credit per batch
404 Not FoundPrivate subreddit, deleted user, banned contentNo bypass without OAuth; skip and logNone

Error #1: HTTP 429. This is by far the most common. Reddit's soft limit is roughly 1 request per second per anonymous IP, with burst tolerance of [TODO: insert production stat from logs] before they start rejecting. Exponential backoff from 1s → 2s → 4s recovers almost every time. Three retries is the sweet spot; beyond that the IP is flagged and you should rotate.

Error #2: HTTP 403. Reddit maintains a reputation score per source IP. Datacenter ranges (AWS, GCP, DigitalOcean) accumulate flags faster than residential. If you're seeing 403s from a fresh IP within [TODO: insert production stat from logs] requests, you're on a pre-flagged range. Residential proxy via DreamScrape (useProxy: true) works here; we've measured [TODO: insert production stat from logs]% success on reddit.com with residential vs. [TODO: insert production stat from logs]% on datacenter.

Error #3: Empty or null JSON. The endpoint returns a valid 200 with data, but score is null, body is "[deleted]", or removed_by_category is populated. The post exists but has been removed by the author, a moderator, or Reddit admin. Check these fields before writing to your database, or you'll pollute analytics with removed content.

Error #4: more_comments placeholders. Inherent to the endpoint. The response truncates at Reddit's depth/breadth limits and inserts a {"kind": "more", "data": {"children": ["abc", "def", ...]}} object. Expanding requires /api/morechildren, which is a POST endpoint with different rate-limiting. For most analytics, accepting the shallow tree and treating truncated nodes as "not captured" is fine.

Error #5: HTTP 404 on valid-looking URLs. Private subreddit, suspended user, quarantined content, or a shadow-banned post. Without OAuth credentials tied to an approved account, there is no workaround. Log and move on.

Where the HTTP /.json Method Breaks and Requires API Keys

The honest scope boundary. If your use case is in this list, /.json will not help you:

Private subreddits and quarantined content. The endpoint returns 403 regardless of how you fingerprint. OAuth2 with an account that has access is the only path.

Authenticated user activity. Saved posts, upvote history, inbox, subscribed subreddits, friends list — none of this is on the public endpoint. By design. Requires OAuth2 with the matching account.

Real-time or sub-second data. Reddit's /.json responses are cached at the CDN edge for 1–5 seconds. If you're tracking breaking news or live-post velocity at sub-second resolution, you'll see stale scores. The official API is less stale but still not real-time; real-time Reddit data requires WebSocket access or Pushshift-equivalent infrastructure.

Historical scrapes at multi-million scale. The 1,000-item listing cap means you cannot paginate past ~1,000 posts deep in any listing. Submission search via the official API (or Pushshift archives, where available) is the only way to pull years of historical posts for a subreddit.

Subreddit-specific scraping prohibitions. Some subreddits (r/science, r/dataisbeautiful, certain research communities) explicitly forbid scraping in their rules. The technical endpoint works; the social contract does not. Respect the rules or expect to be blocked at the subreddit level if mods notice.

Cost comparison. Reddit's free API tier allows [TODO: insert production stat from logs] requests per minute authenticated and supports OAuth content. DreamScrape HTTP tier is 1 credit per request with no auth gate. For public-data scale scraping above a few hundred requests per hour, HTTP tier wins on throughput. For OAuth content or official-use compliance, the API wins. Use whichever matches your scope.

HTTP vs. Browser Tier: Why HTTP Wins for Reddit

MetricHTTP (/.json)Browser (Playwright)Official Reddit API
Avg latency[TODO: insert production stat from logs]ms[TODO: insert production stat from logs]s[TODO: insert production stat from logs]ms
Cost per 1K requests1K credits[TODO: insert production stat from logs]K creditsFree (rate-limited)
IP block riskLowLowNone (authenticated)
JS rendering needed?NoN/ANo
Auth supportNoneNoneOAuth2
Best forPublic subreddit, post, user data at scaleNothing Reddit-specificOAuth content, official compliance

Reddit's /.json is pre-rendered JSON. There is no client-side JavaScript rendering the data, no lazy-loaded XHR after page load, no React hydration. A browser scrape gives you the identical payload as an HTTP scrape, 5–10x slower, at 10x the credit cost. The only reason to use browser tier on reddit.com would be if Reddit introduced JS-based anti-bot countermeasures specifically on /.json. As of April 2026, they have not. If you're auto-routing and reddit.com is landing on browser tier, something is misconfigured.

Scaling to 1000+ Posts Per Minute: Concurrency and Pagination

A production pattern for high-throughput Reddit scraping:

code
import asyncio
import httpx
 
SEMAPHORE = asyncio.Semaphore(8)  # 8 concurrent requests
 
async def scrape_page(client, subreddit, after=None):
    async with SEMAPHORE:
        url = f"https://www.reddit.com/r/{subreddit}/.json?limit=100"
        if after:
            url += f"&after={after}"
        resp = await client.post(
            "https://dreamscrape.app/scrape",
            json={"url": url, "engine": "http", "useProxy": True},
            headers={"Authorization": f"Bearer {API_KEY}"},
        )
        return resp.json()
 
async def scrape_subreddit(subreddit, max_pages=10):
    async with httpx.AsyncClient(timeout=60) as client:
        after = None
        for _ in range(max_pages):
            data = await scrape_page(client, subreddit, after)
            yield data
            after = data["content"]["data"].get("after")
            if not after:
                break
            await asyncio.sleep(1.5)  # per-subreddit delay

Concurrency. 8–10 concurrent requests across different subreddits is the ceiling before 429s dominate. Within a single subreddit, serialize — pagination depends on the previous after cursor anyway.

Backoff. On 429, sleep 2 ** retry_count seconds, max 3 retries. On 403, don't retry on the same session; rotate to a new proxy session by sending a different sessionId on the DreamScrape request. Sample:

code
if resp.status_code == 403:
    await scrape_page(client, subreddit, after, session_id=uuid.uuid4().hex)

Queue management. For production, store (subreddit, cursor, status) in a Postgres table and process it as a work queue. If a worker crashes mid-pagination, the next worker resumes from the last committed cursor. Don't hold pagination state in memory — at [TODO: insert production stat from logs] subreddits in flight you will lose data to a crashed process.

Deduplication. Reddit occasionally returns the same post ID across pagination boundaries. A Redis set keyed on post ID, or a UNIQUE(subreddit, post_id) constraint in Postgres with ON CONFLICT DO UPDATE, prevents duplicates. Update the score and num_comments on conflict so your rows stay fresh.

Throughput ceiling. At 8 concurrent requests with 1.5s subreddit delay and 100 posts per page, expect ~[TODO: insert production stat from logs] posts per minute sustained. Cost at 1 credit per request plus 10 credits per residential proxy use: [TODO: insert production stat from logs] credits per 1,000 posts.

Storing and Structuring Reddit Data for Analytics

A minimum schema that holds up at scale:

code
CREATE TABLE reddit_posts (
    id TEXT PRIMARY KEY,              -- Reddit post ID (e.g., "abc123")
    subreddit TEXT NOT NULL,
    title TEXT NOT NULL,
    author TEXT,
    score INT,
    num_comments INT,
    created_at TIMESTAMPTZ NOT NULL,  -- ISO-8601 UTC
    url TEXT,                         -- the link the post points to
    permalink TEXT,                   -- Reddit-hosted URL
    is_self BOOLEAN,
    selftext TEXT,
    scraped_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(subreddit, id)
);
 
CREATE TABLE reddit_comments (
    id TEXT PRIMARY KEY,
    post_id TEXT REFERENCES reddit_posts(id),
    parent_id TEXT,
    author TEXT,
    body TEXT,
    score INT,
    created_at TIMESTAMPTZ NOT NULL,
    scraped_at TIMESTAMPTZ DEFAULT NOW()
);
 
CREATE INDEX ON reddit_posts (subreddit, created_at DESC);
CREATE INDEX ON reddit_comments (post_id);

Convert timestamps on ingest. Reddit gives you created_utc as Unix seconds. Store ISO-8601 with explicit UTC: datetime.fromtimestamp(created_utc, tz=timezone.utc). Mixed timezones downstream are a data quality disaster.

Upsert, don't insert. Scores and comment counts change. Re-running a scrape should update, not duplicate:

code
INSERT INTO reddit_posts (...) VALUES (...)
ON CONFLICT (subreddit, id) DO UPDATE SET
    score = EXCLUDED.score,
    num_comments = EXCLUDED.num_comments,
    scraped_at = NOW();

Downstream queries. Top posts by score in the last 24 hours, rising subreddits by post velocity, comment-to-score ratio as an engagement proxy — all trivial on this schema with indexes on (subreddit, created_at).

Reddit's Terms of Service prohibit automated access except through the official API. The /.json endpoint is nonetheless published, documented in Reddit's own help pages, and used by Googlebot and other indexers daily. This creates a legal ambiguity that courts have not conclusively resolved as of 2026. hiQ v. LinkedIn established that scraping public data is not a CFAA violation, but that ruling does not immunize you from a ToS breach-of-contract claim. If you are scraping Reddit at scale for commercial use, talk to a lawyer who has read the current version of the ToS.

Three specific risk areas:

User privacy. Storing individual users' comment history with their username attached can trigger GDPR right-to-erasure requests if any user is an EU resident, and CCPA obligations for California residents. Aggregate and anonymize wherever possible — "r/python posts averaged [TODO: insert production stat from logs] comments in Q1 2026" is low-risk; a searchable database of /u/johndoe's comments is high-risk.

Subreddit rules. Communities like r/science and r/dataisbeautiful explicitly prohibit scraping in their rules. Respect subreddit-level opt-outs even though they have no legal force; ignoring them is the fastest way to get Reddit to crack down on /.json entirely, which hurts everyone.

Content republication. Scraping for analysis is different from republishing user-authored posts verbatim. The latter implicates copyright (users retain ownership of their submissions under Reddit's User Agreement). Aggregate, summarize, cite — don't mirror.

Safer alternative. For research and academic use, the official Reddit API's free tier is more legally defensible and still functional for most workloads. DreamScrape provides the technical capability; compliance is your responsibility. We do not indemnify.


If you're scraping public Reddit data at any meaningful scale, start at HTTP tier with useProxy: true for anything above [TODO: insert production stat from logs] requests per hour. Check /intel/reddit.com for current routing decisions before you build custom retry logic — we already handle the 429/403 patterns described above. For the broader question of when browser-tier scraping is and isn't necessary, see why 70% of scraping doesn't need a browser.