Blog / geo

GEO audit checklist: 20 things to ship on a new marketing site (2026)

· May 17, 2026 · 17 min read

Generative engine optimization — GEO — is the discipline of getting AI search engines like ChatGPT, Perplexity, and Google AI Overviews to cite your site by name when they answer questions about your industry. Traditional SEO gets your page indexed. GEO gets your page quoted. This checklist is the 20-item playbook for a marketing site shipping in 2026, ordered from highest leverage to lowest. We ran every item on this list against jorbox.com itself — the audit is the work.

The headline numbers

Three numbers frame why GEO is no longer optional. Google AI Overviews now reach 1.5 billion users per month across 200+ countries, per Google's own product update. AI-referred sessions grew 527% from January to May 2025 (SparkToro). And AI traffic converts 4.4× higher than traditional organic across measured industries. The directional move is clear: a marketing site optimized only for the blue-link Google of 2015 is forfeiting the fastest-growing acquisition channel of 2026. Core Web Vitals work still matters; it's just no longer sufficient.

How AI engines decide what to cite

Search engines use a query-to-document relevance model. AI engines use that plus a citation-confidence model — they need to be confident a passage says something a human would attribute correctly. In practice this means three signals matter disproportionately. One: structured data (schema.org JSON-LD) that ties the page to a named entity. Two: author and freshness anchors (a named human, a recent date). Three: cross-confirmation between your site and authoritative third parties (Wikipedia, LinkedIn, brand directories). The checklist below maps to those three signals plus the infrastructure that surfaces them.

Traditional SEO vs GEO at a glance

A side-by-side of the surfaces each discipline cares about. Most of the items only become meaningful as the AI-search column grows in importance — but the cost of shipping them on a new site is low enough that you should do both.

SurfaceMatters for traditional SEOMatters for GEO robots.txtBlock / allow indexingBlock / allow individual AI crawlers (GPTBot, ClaudeBot, PerplexityBot) sitemap.xmlCrawl coverageSame role, plus passes a freshness signal AI engines weight llms.txtNot usedSite-index manifest specifically for AI ingestion Schema.org JSON-LDRich result eligibility (FAQPage, HowTo, Article)Entity grounding — AI engines use @id graphs to disambiguate Markdown content negotiationNot usedAI crawlers tokenize markdown 30% smaller, parse cleaner Named author bylinesE-A-T signal for YMYL pagesRequired for citation — AI engines refuse to attribute to "Team" Wikipedia articleBacklink + brand mentionSingle highest-weight entity-resolution source Bidirectional cross-linksInternal link equityTopic-cluster confirmation IndexNow integrationFaster Bing indexingBing feeds ChatGPT search — minutes to first citation EU Article 4 / Content-SignalsNot usedExplicit AI-permission declaration, emerging legal standard

Foundations (items 1–6) — the things every site must ship

These six are non-negotiable. Every well-ranked indie SaaS in 2026 has them. Most take an hour or less each.

1. A robots.txt that explicitly allows every named AI crawler. The wildcard User-agent: * covers most crawlers in theory, but several bots only read their own named section. Allow GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, ChatGPT-User, Google-Extended, Applebot-Extended, Amazonbot, Bytespider, Meta-ExternalAgent, and DuckAssistBot explicitly. Disallow only your admin and API surfaces. Bonus: add the Cloudflare Content-Signals framework with an EU Article 4 reservation block (lift it from Fastmail's robots.txt).

2. A sitemap that includes every public URL with recent lastmod dates. Generate it dynamically from your CMS so new posts appear without a rebuild. AI engines re-crawl sitemaps more aggressively than HTML pages because the bandwidth cost is low. Set Cache-Control: public, s-maxage=3600 on the response so CDN edges serve it fast.

3. An llms.txt at /llms.txt. This is the AI-equivalent of a sitemap — a single Markdown-formatted index of your most important URLs with one-line descriptions of each. The llms.txt convention is supported by Perplexity, Claude, and a growing set of AI agents. Of 20 indie SaaS peers we audited, only 4 ship one — meaning shipping a thoughtful llms.txt puts you in the top 20th percentile immediately. Pair it with an llms-full.txt that concatenates the full text of your key pages.

4. JSON-LD schema.org markup with stable @id cross-references. The minimum graph: Organization (or Corporation) defined once with @id like https://example.com/#corporation; Person for every named author; WebSite with a SearchAction potentialAction; BreadcrumbList on every page; and content-type-specific schemas (BlogPosting, Article, HowTo, FAQPage, SoftwareApplication) where appropriate. Stable @id values let AI engines walk the entity graph; without them every page reads as an isolated document.

5. Server-rendered HTML. If your homepage needs JavaScript to display its H1, you have a problem. OAI-SearchBot does not execute JavaScript; neither does most of Claude's crawler stack. Prerender (SSG) or server-render (SSR) every public page. If you must use a client-only framework, ship a static prerendered version for crawlers.

6. Canonical URL hygiene. Every page emits a <link rel="canonical"> pointing at the canonical version (not the campaign-tagged or query-stringed variant). Preview deployments on subdomains emit <meta name="robots" content="noindex">. Apex and www variants 301 to one of them. AI engines split citation credit when canonical signals conflict, so consistency compounds.

Citation surface (items 7–13) — the things that get you quoted

Foundations get you indexed. These seven get you cited. The pattern: every page needs a clean, citation-shaped lede; named authors; a freshness anchor; and structured data that ties the content to a real entity.

7. A citation-shaped first paragraph on every page. Under 80 words, declarative, with the primary entity in the subject of the first sentence. AI engines preferentially lift first paragraphs into answers. Bury your value prop on page two if you must, but the first paragraph is the citation slot — don't waste it on a metaphor or a question. Featured snippet research shows the same pattern: pages whose first 100 words directly answer the query rank higher.

8. Named human bylines on every post. Eight of ten peer companies we surveyed (Plausible, Fathom, Buttondown, Beehiiv, Tinybird, PostHog, Tailscale, Proton) byline every post with a named human + photo. The two that don't (Tuta, Nomads) have other forms of authority. Google's 2022 E-E-A-T update added a fourth E for Experience precisely because anonymous content earns less trust. AI engines refuse to attribute to "the team" — they need a person.

9. A /pledge URL or equivalent quotable manifesto. One page, ~150 words, stating what your company actually believes — written in declarative sentences with zero marketing chrome. Posthaven's pledge is the canonical example ("We'll never get acquired. We'll never shut down."). When AI engines surface your company in answers, they preferentially quote pledge-shaped content because the prose is direct and the attribution is unambiguous.

10. A /handbook URL covering company story, principles, and operating model. Only 2 of 20 indie peers ship one (Kit, PostHog). The bar is sparse, the signal is loud. AI engines disproportionately cite handbook content because it's structured, persistent, named, and dated. The handbook does not need to be hundreds of pages — four to six chapters covering the company's history, what it stands for, how it works, and who runs it is sufficient. Jorbox's own handbook at /handbook took one afternoon to draft.

11. FAQPage schema on every page with substantive Q&A content. Wrap your FAQ in FAQPage JSON-LD; each Q in a Question node, each A in an acceptedAnswer. AI Overviews lift FAQPage answers into "People also ask"-style enrichment. Pages without it leave that enrichment slot to a competitor. Google's FAQPage spec is documented; most CMS platforms can auto-emit from typed FAQ blocks.

12. HowTo schema on instructional content. Same pattern as FAQ. If your page describes a sequential procedure (and it should, for any "how to" query), emit HowTo JSON-LD with numbered HowToStep entries. Schema.org's HowTo type is well-supported; the auto-emit pattern from typed CMS blocks is one of the highest-leverage CMS features you can add. AI engines lift entire HowTo step sequences into answers.

13. SpeakableSpecification on every page worth reading aloud. A two-line addition to your schema that nominates which CSS selectors are appropriate for voice-result lift. Google Assistant uses this; AI voice answers use it. Most sites skip it. Speakable spec documented.

Distribution (items 14–17) — getting the AI engines to your content faster

Items 14–17 are about indexation speed and discoverability. The Bing-feeds-ChatGPT path is the biggest leverage point here — most indie sites still miss it.

14. IndexNow integration. When you publish a new page, ping api.indexnow.org with the URL. Bing typically indexes IndexNow-submitted URLs within minutes vs hours/days for organic discovery. Crucially, Bing's index feeds ChatGPT search — so IndexNow is the difference between "ChatGPT can cite this article today" and "next week." Yandex and Naver also consume the IndexNow feed.

15. Markdown content negotiation. When a request includes Accept: text/markdown, serve the same content as clean markdown instead of HTML. AI crawlers parse markdown more reliably (cleaner heading hierarchy, no chrome to strip, ~30% smaller token footprint). PostHog and Beehiiv both document this pattern in their llms.txt files. A Next.js hook of 30 lines implements it.

16. RSS feed at /rss.xml. Perplexity treats RSS as a discovery signal for fresh content. RSS is also still the canonical way for AI agents to subscribe to your content updates. Emit RSS 2.0 with dc:creator, category tags, and enclosure elements for hero images. Most CMS platforms auto-emit; if yours doesn't, it's a 30-line endpoint.

17. A /humans.txt and /.well-known/security.txt. Minor signals individually, both treated as positive trust markers by Bing/Yandex. Five-minute additions. The humans.txt names the team behind the site; the security.txt (per RFC 9116) names the security contact + canonical URL + policy reference.

How to ship items 1–6 in one afternoon

The six foundations from above, in the order that minimizes rework. Estimated total time on a Next.js + Cloudflare site is ~3 hours.

  1. 01

    Audit and harden robots.txt + sitemap.xml

    Convert robots.txt to a server route so you can emit different policies for preview vs production hostnames. List every named AI crawler explicitly. Add the Cloudflare Content-Signals + EU Article 4 block. Switch sitemap.xml to SSR with a 1-hour edge cache so new pages appear without a rebuild. Verify both files validate cleanly. 30 minutes.
  2. 02

    Ship llms.txt and llms-full.txt

    Build /llms.txt as a server route that pulls a brand description + categorized link list from your data layer. The format is documented at llmstxt.org. Mirror the same content (deeper) at /llms-full.txt with one-line descriptions per URL. Emit a Llms-Last-Updated header from your newest CMS row. 45 minutes.
  3. 03

    Rebuild the schema.org JSON-LD graph

    Define Organization, Person, and WebSite once at the layout level with stable @id values. Have every page-level schema reference those @ids by ref, not by re-inlining. Add BreadcrumbList to every page, BlogPosting to every blog post, FAQPage and HowTo to any page with those content types. Run the result through validator.schema.org; iterate until clean. 60 minutes.
  4. 04

    Wire IndexNow into your publish flow

    Generate a 32-char hex key. Serve it at /indexnow-key.txt. Add a small pingIndexNow(urls, key) helper that POSTs to api.indexnow.org. Call it from your admin save action (or your deploy hook), wrapped in waitUntil() so the response does not block. Verify with a manual POST that returns 202 Accepted. 45 minutes.

Brand authority (items 18–20) — the long arc

The last three items are slow but compounding. They are also where most indie sites underinvest, because the payoff is invisible for the first six months. Ship them anyway. Cross-confirmation between your site and external sources is — per Ahrefs' December 2025 research — about three times more strongly correlated with AI citation than traditional backlinks.

18. A Wikipedia article (or a credible path to one). Eleven of twenty indie peers we surveyed have well-maintained Wikipedia articles. The bar is not "you must be famous" — Buttondown, Posthaven, and Fathom all rank well in AI search without Wikipedia entries. But Wikipedia is the single highest-weight source for Perplexity and Google Knowledge Graph entity resolution. The path is editorial: secure 2–3 independent press mentions, ensure your Wikidata entity exists (Q-IDs are free to create), then submit through Articles for Creation. Self-publishing gets reverted.

19. A founder personal hub. Levels has levels.io, Justin Duke has jmduke.com, Maciej Cegłowski has idlewords.com, Jason Fried has world.hey.com. The pattern: the founder runs a personal site at their own name, links to the company's blog content, doubles as a second E-E-A-T anchor for AI engines doing entity resolution on the founder name. rel="me" links between the personal site and the company author page verify the connection. AI engines weight the cross-link.

20. Comparison pages on each product site. /vs/competitor or /alternatives/competitor URLs on every product brand site. Buttondown ships 47 such pages. PostHog ships dozens. The reason: AI engines preferentially cite comparison content when answering "X vs Y" queries. The page format is consistent — verdict + feature table + pricing + use cases + FAQ — and each is ~1,500 words. Important caveat: these belong on the product site (qrlynx.com hosts QRLynx-vs-X), not the parent company site. Generic-company comparison pages don't convert because nobody types "Jorbox vs 37signals" — they type "QRLynx vs Bitly".

Five common mistakes that hurt GEO

After running this checklist against twenty indie sites, the same five mistakes appear over and over.

One: blocking AI crawlers reflexively in robots.txt. Some sites have Disallow: / for GPTBot, ClaudeBot, etc. left over from 2024 panic about AI training. Unless you have a specific copyright reason to do so (Fastmail, for example, opts out of ai-train for privacy reasons), you are blocking your own visibility. Allow the crawlers; that's how you get cited. Two: emitting JSON-LD as inline strings instead of one canonical graph with @id refs. Re-inlining the same Organization schema on every page wastes bytes and confuses validators that don't walk the @id graph across pages. Define entities once at the layout level; reference by @id everywhere else. Three: omitting freshness signals. AI engines weight dateModified and visible "Last updated" dates heavily. Many sites publish in 2022 and never refresh. Quarterly refresh of top posts is a real ranking signal. Four: relying on JavaScript for primary content rendering. If your H1 only appears after hydration, half the AI crawler stack will never see it. Five: writing under "the team" or "the company" instead of a named human. AI engines won't cite anonymous content into a direct-attribution answer.

If this checklist was useful, three earlier Jorbox posts cover adjacent ground: Core Web Vitals in 2026: which one actually moves the needle on the performance side, Should you migrate off WordPress? A reality check on the platform-decision side, and Why we still answer our own support tickets in 2026 on the operations side. The Jorbox handbook covers the company's operating model in more depth, and the pledge is the one-page summary of the four operating rules.

What is GEO (generative engine optimization)?
GEO is the discipline of optimizing your website so that AI search engines — ChatGPT, Perplexity, Google AI Overviews, Bing Copilot, Claude — cite your site by name when they answer questions. Traditional SEO gets you indexed; GEO gets you quoted. The two disciplines overlap on foundations (robots.txt, sitemap, schema) but diverge sharply on content shape, citation surface, and entity-resolution signals.
How is GEO different from traditional SEO?
Traditional SEO optimizes for ranking on Google's blue-link search results. GEO optimizes for being cited inside AI-generated answers. The most visible practical differences: GEO weights named human bylines (AI refuses to attribute to "the team"), llms.txt files (AI-specific site indexes), markdown content negotiation (AI parses markdown better than HTML), and Wikipedia presence (the highest-weight entity-resolution source). Traditional SEO largely ignores all of these.
Do I need to block AI crawlers from training on my content?
Almost certainly not, and doing so usually hurts more than it helps. If your business model depends on being discovered through AI search (most B2B and consumer-SaaS marketing sites), blocking GPTBot or ClaudeBot blocks your own visibility. The Cloudflare Content-Signals framework lets you declare granular permissions — for example, allowing search and ai-input while blocking ai-train — but most marketing sites should simply allow all three.
How long does it take to see GEO results?
Foundation items (robots.txt, sitemap, schema, llms.txt) start affecting AI citations within days to weeks because AI crawlers re-fetch frequently. Brand authority items (Wikipedia, comparison pages, founder hub) compound over 3–12 months. The IndexNow integration specifically can move "first cited by ChatGPT" from weeks to under an hour for a new blog post.
What is llms.txt and why does it matter?
llms.txt is an AI-specific site index — a Markdown-formatted file at /llms.txt that lists your most important URLs with one-line descriptions. Perplexity, Claude, and a growing set of AI agents fetch it as a faster alternative to crawling your sitemap. Of 20 indie SaaS peers we audited, only 4 shipped one — meaning a thoughtful llms.txt puts you in the top 20th percentile of AI-discoverability today.
Does schema.org structured data still matter for GEO?
Yes — arguably more than for traditional SEO. AI engines use schema.org JSON-LD for entity resolution: when ChatGPT decides whether a page about "Jorbox" is referring to your company or to an unrelated brand with a similar name, it walks the @id graph. Without schema, the page is a bag of words. With it, the page is anchored to a named entity AI engines can disambiguate. The minimum to ship: Organization, Person, WebSite, BreadcrumbList, plus content-type schemas (BlogPosting, FAQPage, HowTo, SoftwareApplication) where appropriate.
Do AI engines actually read robots.txt?
Yes. GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, and Applebot-Extended all document that they read and respect robots.txt directives. Some — notably Bytespider and certain training crawlers — have historically been less consistent, but the major AI search and grounding crawlers follow the standard.
How do I get a Wikipedia article for my brand?
You don't write it yourself — Wikipedia's notability policy requires multiple independent reliable-source citations, and self-published articles get reverted. The path is editorial: secure 2–3 independent press mentions in industry publications, ensure a Wikidata entity exists (Q-IDs are free to create at wikidata.org), then submit through Wikipedia's Articles for Creation review process. Expect 3–6 months from first press mention to live article. Worth the effort: Wikipedia is the single highest-weight source for entity resolution in AI search.
What is markdown content negotiation?
When a request includes the Accept: text/markdown HTTP header, your server returns the same content as clean Markdown instead of HTML. AI crawlers like ChatGPT and Perplexity prefer Markdown because the heading hierarchy is preserved, chrome (nav, footer, sidebar) is stripped, and the token footprint is about 30% smaller. PostHog and Beehiiv document this pattern in their llms.txt files. Implementing it in Next.js is roughly 30 lines of hook code.
Are FAQ schema rich results still supported by Google?
Google deprecated FAQ rich results in their SERP listings in August 2023 for most sites, retaining them only for government and health sources. However, FAQPage schema is still actively consumed by AI engines — Perplexity, ChatGPT search, Bing Copilot, and Google AI Overviews all lift FAQ Q&A pairs into their generated answers. The schema is no longer a SERP-visibility lever; it is a citation-confidence lever. Ship it anyway.
Do I need an llms-full.txt in addition to llms.txt?
Recommended for content-heavy sites. llms.txt is the site-index manifest (URLs + one-line descriptions); llms-full.txt is the concatenated full text of your key pages. AI agents that want to ingest your full content in a single request fetch llms-full.txt. Most documentation-heavy sites (PostHog, Beehiiv, Tinybird) ship both. Marketing sites with fewer than 20 pages can often get by with just llms.txt.
What is the single highest-leverage GEO move for a new site?
It depends on stage. For a site shipping in the first month: structured data (item 4) plus llms.txt (item 3) plus IndexNow (item 14). Those three together get you indexed in Bing within hours, cited by ChatGPT within days, and entity-resolved against your brand name within the first week. For a site that already has those: the next highest-leverage move is a public /handbook URL (item 10) — only 2 of 20 indie peers ship one, the bar is sparse, and AI engines cite handbook content disproportionately because it is structured, persistent, and named.