What AI Bots Actually See | VisibilityTrace

An operator can spend a week tuning a page in a browser and ship it confident the content is in place — and a crawler will fetch the same URL and receive an empty document. This page is the explainer that pairs with the AI crawler readiness checklist. It describes the gap between the browser view and the AI-crawler view, the technical reasons the gap exists, and how to audit a page for the gap before launch. The evidence is drawn from operational practice on a fleet of static affiliate content sites; specific sites are not named.

 <!doctype html> <html lang="en">  <head>…</head>  <body>  <div id="root"></div>  <script src="app.js"></script>  </body> </html>

no H1 · no body text · no JSON-LD

Same URL · same server · two views. A browser executes JavaScript and renders the right side as the left. An AI crawler stops at the right side.

Browser DOM vs static HTML response — what a human sees vs what GPTBot, ClaudeBot, PerplexityBot receive

The browser view versus the bot view

A modern web page in a browser is not a document. It is a small HTML shell, a bundle of JavaScript, and a sequence of fetches the JavaScript performs after the page loads. The visible content — the H1, the paragraphs, the navigation — appears in the DOM after the JavaScript has executed and the framework has hydrated. To a human, the page looks like a document. To a crawler that does not execute JavaScript, the page is the small HTML shell.

Most AI crawlers do not execute JavaScript at scale. Some — including the most prominent — execute a constrained subset of JavaScript on a fraction of URLs, then fall back to the static body for everything else. The honest planning assumption is: your page is the bytes that come back in the first HTTP response, before any script runs. If the H1 is not in those bytes, it is not in the page for crawler purposes.

The empty-body SPA failure mode

The dominant failure mode for sites built on client-rendered SPA frameworks is the empty body. The HTML response is a near-empty document — a head with meta tags, a body containing one or two empty <div> elements, and a script tag pointing at a JavaScript bundle. The framework runs in the browser, fetches data, builds the DOM, and the user sees a complete page.

A crawler that does not execute JavaScript fetches the same URL and receives the empty document. From the crawler's perspective the page has no headline, no body copy, and no internal links. The framework's runtime document.title update never runs. The page is invisible.

This failure mode is operationally hard to detect because operator browser testing does not surface it: the operator's browser does execute JavaScript. The detection is the static-body probe described in the checklist — curl with a crawler user agent, look for the H1.

Why head-only fixes are not enough

Document-head libraries — Helmet and its peers — let a framework update the page title, meta description, and JSON-LD at runtime. Operators sometimes use these libraries to patch SEO metadata onto a fundamentally empty page. From the browser side, the patched head looks correct. From the crawler side, the patch never happens: the head in the static response is whatever the framework shipped at build time.

The head matters — meta description, canonical, social tags, JSON-LD — but it is the body where a crawler reads the content the page is about. A page with a perfectly patched head and an empty body fails as content. The fix is to render both the head and the body into the static response — either at build time (static-site generation) or per request (server-side rendering).

JSON-LD visibility

JSON-LD is the structured-data format AI crawlers prefer for entity, article, and FAQ extraction. For a crawler to use a JSON-LD block, the block must arrive in the static HTML response. Runtime-injected JSON-LD has the same problem as a runtime-patched title: the crawler does not run the script that injects it.

A useful practical test: fetch the page with curl, search for application/ld+json blocks in the response, and confirm that each block contains the entities the page is about — the Article with its headline and author, the FAQPage with its question list, the BreadcrumbList with its full item array. A response that contains only a generic WebSite or Organization block from the framework's default head is shipping a page without per-route structured data.

What different crawlers actually fetch

The major AI crawlers identify themselves by user-agent string and behave slightly differently. The operationally relevant distinctions are:

GPTBot — OpenAI's training-data crawler. Fetches the static HTML response. Respects robots.txt directives at the user-agent level. Does not render JavaScript at scale.
OAI-SearchBot — OpenAI's crawler for ChatGPT's web-search feature. Separate from GPTBot. Fetches static HTML. Used to populate the live results that appear in ChatGPT browsing responses.
ClaudeBot — Anthropic's crawler. Static HTML fetch. Respects per-user-agent robots.txt rules.
PerplexityBot — Perplexity's crawler. Fetches static HTML. Perplexity also performs live web fetches at query time using a different user-agent string; the two paths produce different views.
Google-Extended — Google's opt-out token for AI uses of Googlebot's existing crawl. Allowing or disallowing Google-Extended controls whether Google's AI features may use content Googlebot already fetched. The separate Googlebot crawl does perform JavaScript rendering on a delayed second pass; AI uses ride on top of that result.
CCBot — Common Crawl, the open dataset that many AI training pipelines pull from. Static HTML fetch.

Across this set, the dominant signal is the static body. A page that is legible to curl with no scripts is legible to all of these crawlers. A page that requires JavaScript execution to expose its content is legible to none of them at scale, with the partial exception of Google's delayed render.

Chunking and semantic structure

When an AI system pulls a page into a response, it does not lift the page whole. The page is chunked — split into smaller passages — and the chunks are indexed for retrieval. The chunking algorithm uses HTML structure as a signal: headings introduce new sections, lists are atomic units, paragraphs are sentence-bearing blocks.

Pages that flatten this structure — a single long <div> of text without headings, or a wall of paragraphs without a clear section hierarchy — produce chunks that lose the link to their context. The retrieved chunk reads as a disconnected statement rather than as part of an article on a specific topic.

The practical implication is editorial, not technical: every section that states a claim should be introduced by a heading the chunk can carry with it, and every claim should sit close enough to its source citation that a chunked passage of three or four paragraphs still contains both.

How to audit a page

The minimum audit for a single page is the sequence in the checklist: fetch the page with a crawler user agent, search for the H1, search for JSON-LD blocks, confirm the HTTP status, repeat across the six crawler user-agent strings. The output is a table of seven rows (six bots plus the browser view) by four columns (H1 present, body text present, JSON-LD present, status 200). Anything other than a uniform set of pass cells is a finding worth investigating.

For a multi-page site, the same audit runs against the sitemap and produces a per-URL report. Failures cluster — usually a class of page (account-gated, SPA-rendered, paginated) fails together. The cluster is the unit of fix.