LLMs.txt – What You Need to Know: The Largest Audit to Date from Adobe AEM

Published: June 2026 · longato.ch Companion piece: this article updates and extends my earlier write-up, llms.txt: my recommendation, August 2025.


The five findings you can quote

“Create llms.txt because it is cheap and Google is now looking at it, not because it will get you cited in ChatGPT today.”

“Across 22,494 recorded requests to /llms.txt over a 30-day window, agents that are verifiably large language models accounted for 258 hits, which is 1.1% of all traffic to the file.”

“The single biggest change since my August 2025 audit is Googlebot. It is now the largest named crawler hitting /llms.txt, with 1,219 recorded requests.”

“92.2% of all /llms.txt traffic came from agents that are neither mainstream search engines nor verifiable LLMs. The file’s main audience today is SEO tooling, monitoring services, and AI-readiness auditors inspecting the file, not models consuming it.”

“OpenAI’s user-facing and search agents, OAI-SearchBot and ChatGPT-User, generated 209 hits across roughly 69 hosts. That is the totality of OpenAI’s interest in /llms.txt in this dataset.”

“In a direct referrer analysis I found zero requests anywhere in the logs, search bots included, that carried /llms.txt as their referrer. Whatever crawlers do after reading the file, they do not arrive at other URLs from it in any way the logs can see.”

What changed since August 2025

My August 2025 analysis examined the same question on the same kind of footprint. The qualitative shift over the intervening period is best shown side by side.

August 2025 against June 2026

DimensionAugust 2025 (prior analysis)June 2026 (this audit)Direction of change
Googlebot hitting /llms.txtNot a meaningful presence1,219 hits, the largest named crawler at the fileMajor increase
Verifiable LLM hits to /llms.txtNegligible258 hits, 1.1% of all trafficStill negligible as a share
OpenAI-specific interestMinimal209 hits from OAI-SearchBot and ChatGPT-User, about 69 hostsSlightly up, still tiny
Dominant traffic sourceAlready non-LLMOther / unverified tooling at 92.2%The bucket has grown and professionalised
Self-labelled audit and readiness botsEmerging60.1% of all trafficNew, large category
Referrals originating from llms.txtNone observedStill none observedUnchanged
Crawler entry pointHomepage-ledHomepage-ledUnchanged

Sources: my prior published analysis from August 2025 for the “before” column, and Datasets C and D plus the referrer analysis for the “after” column.

The most material change is Googlebot’s arrival at /llms.txt in volume. This is consistent with a wider observation in the SEO community. Martina Raissle has noted publicly on LinkedIn that Google has begun including llms.txt in its Lighthouse checks, which is itself a signal that the file is at least on Google’s radar.

I want to be careful about what this does and does not prove. Googlebot fetching a URL is not proof that the content is used for ranking, AI Overviews, or AI Mode. A fetch is a fetch. But it is a clear change from a year ago, and combined with the Lighthouse inclusion, it is the first concrete sign from a major provider that llms.txt is being looked at rather than ignored. I weight this as worth acting on cheaply, not as proven to work, and my recommendation below reflects that.


My recommendation

This is my professional judgement, grounded in the data above.

Recommendation summary

#RecommendationSupporting evidenceConfidence
1Create the llms.txt fileGooglebot is now the largest named crawler at the file, 1,219 hits; Google has added it to Lighthouse checksModerate
2Treat it as low-effort insurance, not a growth leverGenerating the file is cheap; the return is asymmetric if providers do begin to use itHigh, on the cost logic
3Do not expect it to move LLM brand visibility or citations todayVerifiable LLMs account for 1.1% of hits; no referrer trail existsHigh
4Keep investing in homepage strength and internal linkingCrawlers enter via the homepage and follow linksHigh
5Watch Google AI Mode and AI Overviews specificallyGoogle’s fetching plus Lighthouse inclusion is the only mover in a year; impact there is plausible but unprovenLow, speculative

In plain terms: create the file, because Google is now hitting it, and that alone changes the calculus from a year ago. The effort is minimal, so the return on investment is favourable if the providers do in fact consume it; you are buying a cheap option on an uncertain upside. Will it move LLM brand visibility or citations? Probably not, not yet. The traditional consumer LLMs such as ChatGPT are not meaningfully using the file on this evidence, and the honest answer is that the consumption simply is not there at the scale that would move citations. Will it affect Google’s AI Mode? Maybe. Google is the one provider showing changed behaviour. I would not bet the strategy on it, but I would not ignore it either.


What llms.txt is?

llms.txt is a proposed Markdown file placed at the root of a domain, for example https://example.com/llms.txt. The llmstxt.org proposal frames it as a curated, machine-readable map: a short summary of the site plus a hand-picked list of the most important pages, often with companion .md versions of those pages, so that a large language model can find and ingest the high-value content without crawling the entire site or fighting through navigation, scripts, and boilerplate. The analogy its proponents draw is to robots.txt and sitemap.xml: a small, conventional file at a predictable path that machines can rely on. The crucial difference is that robots.txt and sitemap.xml are honoured by documented, identifiable crawlers, whereas llms.txt only delivers value if the LLM providers choose to read it. Whether they do is precisely the question this audit set out to answer with logs rather than opinion.


Why I ran this LLMs.txt audit

Two pressures converged.

The first was a recurring question from customers. I was being asked, on a roughly weekly cadence, whether llms.txt was actually being used, and whether it was worth the effort of generating and maintaining. That is a fair question, and it deserves a data-backed answer rather than a shrug.

The second was the state of the GEO and AEO conversation. The generative-engine-optimisation and answer-engine-optimisation community has been circulating a lot of confident, contradictory, and frequently unsourced claims about llms.txt: that the major models definitely read it, that it definitely boosts citations, or conversely that it is completely ignored. Both extremes tend to be asserted without server logs to back them. The only responsible move was to look at what bots actually do at the file, at scale.

This is, to my knowledge, the largest single llms.txt server-log and crawl audit conducted to date by number of distinct domains and by volume of bot traffic examined. The domains analysed are real customer sites hosted on Adobe Experience Manager, and they include some of the world’s largest websites, which is what makes the bot behaviour observed here representative rather than anecdotal.

“Most public claims about llms.txt are made without real analysis. This audit is my attempt to replace assertion with measurement, at the largest domain scale I am aware of.”


Methodology, scope, and caveats

Here is the setup in full so that the findings can be challenged or replicated.

Working with a server log file analysis tool, plus a large-scale crawl of /llms.txt paths, I assembled four datasets:

DatasetPurposeRowsKey fields
A, domain scope logWhich hosts received bot traffic, and how many distinct bots and agents each saw6,122 hostsorigin_host, hits, distinct_bots, distinct_agents, first_seen, last_seen
B, llms.txt existence crawlWhether /llms.txt actually resolves on each host, and what it returns5,553 crawl rows (4,819 distinct URLs, 4,685 distinct hosts)Address, Status Code, Content Type, Word Count, Size (Bytes), Crawl Timestamp
C, llms.txt hits by host and agentEvery recorded request to /llms.txt, split by host and full user-agent string6,749 rowsHost, request_user_agent, hits
D, llms.txt hits by agent typeThe same hit volume, pre-classified by agent family237 rowsUser Agent Type, User Agent Name, Full User Agent, Hits

The hit data in Datasets C and D covers a 30-day window. The crawl in Dataset B carries crawl timestamps dated 29 May 2026.

The four questions I set out to answer were:

  1. How many domains have a live llms.txt file?
  2. When an LLM reads llms.txt, does it then crawl the .md files it lists?
  3. How are LLMs actually finding the pages they crawl?
  4. Are there any referrals coming from llms.txt?

A few caveats, stated openly:

User-agent strings are self-declared. Any bot can claim to be anything. I classify “verifiable LLM” conservatively, counting only agents that match the documented user agents of known model providers such as OpenAI, Anthropic, Perplexity, and You.com. Hits in the “Other / unverified” bucket may include real AI activity behind generic strings, but I will not count what I cannot verify.

Datasets C and D contain no per-event timestamp column. The 30-day window is the query window the data was extracted under; it is not re-derivable from inside the files.

Dataset A’s first_seen and last_seen values span a short capture interval, about five minutes on 28 May 2026, which tells me these are sampling markers from one extract rather than the full 30-day span. I therefore use Dataset A only for structural facts such as host counts and bot diversity per host, and never to infer time-based volume.

The tables below are summary tables. I am not releasing the raw logs. The figures are reproducible in principle by anyone running the same crawl and the same log query.


How many domains actually have an llms.txt file?

This is where precision matters most, because “has an llms.txt” is not a single thing. A request to /llms.txt can return a real Markdown file, a redirect, a 404, a soft-200 HTML page, or an empty 200. I broke Dataset B down by HTTP status.

HTTP status of /llms.txt across 4,685 distinct hosts

Status codeMeaningCrawl rowsShare of rows
404Not found (no file)4,27076.9%
301Permanent redirect60610.9%
200OK (file served)1753.2%
403Forbidden1743.1%
302Temporary redirect1492.7%
0No response or connection failure901.6%
401Unauthorised470.8%
406Not acceptable280.5%
(blank)No status captured120.2%
410Gone1under 0.1%
307Temporary redirect1under 0.1%
Total5,553100%

Source: Dataset B, Status Code column. The row count includes 734 duplicate URLs, which I deduplicated before counting hosts.

A 200 response is necessary but not sufficient to call something a real llms.txt. Many 200s are HTML catch-all pages or empty bodies. So I tightened the definition in two further steps.

How many of the 200 responses are genuinely an llms.txt file?

Definition (progressively stricter)Distinct hostsShare of 4,685 probedShare of 6,122 scope-file hosts
Any HTTP 200 at /llms.txt1372.92%2.24%
200 and Content-Type: text/plain1112.37%1.81%
200 and word count above zero200.43%0.33%

Source: Dataset B, Status Code plus Content Type plus Word Count columns.

Depending on how strictly you define “has a working llms.txt“, the answer ranges from 137 hosts for any 200, down to 111 hosts for files served as plain text, and as low as 20 hosts for plain-text files with actual measurable content. The 23 responses that returned a 200 with an HTML content type are almost certainly not real llms.txt files at all.

“Of 4,685 domains probed, only 137 returned a 200 at /llms.txt. Tighten the definition to plain text with real content and the number collapses to 20. Adoption is not just low, much of the apparent adoption is hollow.”

Data-quality notes for the existence crawl

IssueDetailHow I handled it
Duplicate URLs5,553 rows but 4,819 distinct addresses, so 734 duplicate rowsDeduplicated to distinct hosts before counting
Soft-200 HTML23 of 175 200-responses were text/html, not a text fileExcluded from the strict definitions
Empty 200s155 of 175 200-responses had a word count of zeroReported separately and flagged as likely empty or placeholder
Word-count range on real filesThe 20 non-empty files ran from 2 to 69 wordsReported; even the “real” files are extremely short

A word count between 2 and 69 on the files that do have content tells me most of these are minimal stubs, a title and a couple of links, rather than the rich, curated index the llmstxt.org proposal envisions. Adoption is shallow on both axes: few sites have the file, and few of those have populated it meaningfully.


Do LLMs crawl the .md files, and are there any referrals from llms.txt?

These two questions share one answer, and it comes from a direct analysis of the referrer field in the logs.

I did not find a single request anywhere in the server logs whose referrer was a /llms.txt URL. This held across all bot types, search engines included, not only LLM agents.

There are two possible explanations, and the logs alone cannot distinguish between them. Either the bots do not crawl immediately: they may read llms.txt, archive or queue what they find, and crawl later in a separate session that carries no referrer linking back to the file. Or the referrer is simply not preserved: bots may crawl the listed .md files but not populate the referrer header with the llms.txt URL.

Either way, the practical consequence is the same. There is no observable evidence in the logs that llms.txt is functioning as a crawl-routing hub. If llms.txt were doing the job its proposal describes, feeding models a list of URLs that they then fetch, I would expect to see at least some referrer trail. I see none.


How are LLMs actually finding pages to crawl?

From the same referrer analysis: when bot requests did carry a referrer, it was, in the overwhelming majority of cases, the homepage of the domain.

The behavioural picture is that crawlers, including AI crawlers, predominantly enter a site at the homepage and discover the rest of the site by following links from there, exactly as classical web crawlers always have. They are not, on this evidence, entering via llms.txt and fanning out from its curated list. The homepage and its internal linking remain the primary discovery surface. This is a strong argument that the fundamentals of crawlability and internal linking still matter far more than a curated llms.txt for getting your content seen.

“On the referrer evidence, AI crawlers behave like classical crawlers. They enter at the homepage and follow links. llms.txt is not the front door.”


Who is actually hitting llms.txt? The 22,494-hit breakdown

This is the heart of the audit. Dataset D pre-classifies every recorded hit by agent family, and Dataset C lets me verify that classification against the raw user-agent strings. The two reconcile to the same total, 22,494 against 22,493, a one-hit difference from how the two extracts were generated.

/llms.txt hits by agent type, 30-day window

User-agent typeHitsShare
Other / unverified20,74692.2%
Search engine1,4346.4%
LLM / AI (verifiable)2581.1%
SEO / crawlers (declared)360.2%
Dataset / training130.1%
Social / preview7under 0.1%
Total22,494100%

Source: Dataset D, User Agent Type by Hits.

Hits by named agent (the agents that are identifiable)

Named agentOperator familyHits
GooglebotSearch engine1,219
OAI-SearchBotOpenAI153
BaiduSpiderSearch engine127
ChatGPT-UserOpenAI56
AmazonbotE-commerce / AI38
BingbotSearch engine36
GPTBotOpenAI (training)33
AhrefsBotSEO tool28
ApplebotSearch / AI13
BytespiderByteDance12
ClaudeBotAnthropic10
SemrushBotSEO tool6
Facebook External HitSocial preview5
PerplexityBotPerplexity4
Meta ExternalAgentMeta2
Perplexity-UserPerplexity1
YouBotYou.com1
CCBotCommon Crawl1

Source: Dataset D, User Agent Name by Hits, excluding the “Unknown” aggregate of 20,746.

The verifiable LLM/AI agents in full

LLM/AI agentHits
OAI-SearchBot (OpenAI)153
ChatGPT-User (OpenAI)56
GPTBot (OpenAI training)33
ClaudeBot (Anthropic)10
PerplexityBot (Perplexity)4
Perplexity-User (Perplexity)1
YouBot (You.com)1
Total verifiable LLM/AI258

Source: Dataset D, User Agent Type = LLM / AI.

“Strip out the search engines and the unverifiable bots, and the entire verifiable-LLM interest in llms.txt, across a 30-day window on thousands of domains, amounts to 258 requests. Anthropic, Perplexity, and You.com combined: 16.”

What is the 92% actually made of?

The unverified bulk deserves scrutiny rather than a dismissive label. Using Dataset C’s raw user-agent strings, I found that it is dominated by a long tail of self-described tooling: site-statistics bots, monitoring bots, SEO site-audit crawlers, and a striking number of agents whose own user-agent strings advertise that they exist to audit or check llms.txt and AI-readiness.

Composition of /llms.txt traffic by operator family (raw-string classification)

Operator familyHitsShareDistinct hosts touched
Other / unverified (tooling, monitors, auditors)20,77292.3%3,134
Google1,2275.5%319
OpenAI2421.1%69
Baidu1270.6%36
Amazon380.2%12
Microsoft / Bing350.2%20
Apple130.1%13
ByteDance120.1%5
Anthropic120.1%11
Meta8under 0.1%4
Perplexity5under 0.1%5
You.com1under 0.1%1
Common Crawl1under 0.1%1

Source: Dataset C, full user-agent strings classified by operator. Minor differences from the agent-type table reflect the raw-string method counting AdsBot-Google and similar agents under their parent family.

Two concentration facts stand out. The top ten user-agent strings alone accounted for 17,569 of 22,493 hits, which is 78.1% of all traffic to the file. And agents whose user-agent string self-labels with terms such as audit, monitor, readiness, llms.txt, crawler, GEO, or research represented 105 distinct agents and 13,508 hits, which is 60.1% of all traffic.

“60% of all traffic to llms.txt came from agents that openly describe themselves as auditors, monitors, or readiness-checkers. The file’s biggest use case right now is being inspected to see whether it exists, a self-referential market rather than consumption by models.”

This is the most under-reported reality of llms.txt in mid-2026. Raw hit counts on the file are rising, and it is tempting to read that as LLMs adopting it. The composition says otherwise. A large share of the traffic is the GEO ecosystem checking itself: tools verifying that a customer has the file, monitors polling for changes, readiness-scanners selling the idea that the file matters. That activity is real, but it is not evidence that any model is using the file to answer questions.


Host-level reality check

Beyond raw hits, I cross-referenced which hosts have a real file against which hosts received any /llms.txt traffic.

Hosts: file presence against received traffic

MeasureCount
Hosts returning 200 at /llms.txt (www-normalised)130
Hosts that received at least one /llms.txt request (www-normalised)2,649
Hosts that both have a file and received a hit80
Hosts that have a file but recorded no hit50
Distinct hosts receiving any /llms.txt hit (raw)3,236

Source: Datasets B and C, joined on www-normalised host.

Two things stand out. First, the vast majority of /llms.txt requests land on hosts that do not even have the file: bots and tools are probing for it speculatively and hitting 404s. Second, of the hosts that do have a real file, more than a third, 50 of 130, saw no recorded hit at all in the window. Presence and attention are only loosely coupled.



Limitations and an invitation to challenge

Here is where this audit stops short.

User agents are self-declared, so the 92.2% Other bucket could hide real AI activity behind generic strings. I have deliberately under-counted LLM activity rather than over-claim it. The hit datasets carry no per-event timestamps, so the 30-day window is the extraction window rather than a field I can re-derive. Fetched does not mean used: nothing in server logs can prove that any provider used llms.txt content in a model output, because logs show requests, not downstream use. This is a snapshot, a single 30-day window compared qualitatively to a prior one, not a continuous time series. And referrer behaviour is provider-dependent, so the absence of a referrer trail is strong evidence of no observable routing rather than absolute proof that no provider ever crawls from the file.

If you can replicate, extend, or contradict any of this with your own logs, I want to hear about it. I will investigate and publish a visible correction if anything here proves wrong.


Frequently asked questions

How many websites actually have an llms.txt file? In this audit, of 4,685 domains probed, 137 returned a working 200 response at /llms.txt, which is about 2.9%. If you require the file to be served as plain text the number is 111, and if you require it to contain real content it drops to 20.

What percentage of websites have llms.txt? On this AEM-hosted sample, between 0.4% and 2.9% depending on how strictly you define a working file. The headline figure of 2.9% counts any 200 response; the strict figure of 0.4% counts only plain-text files with measurable content.

Do large language models actually read llms.txt? Rarely, on this evidence. Verifiable LLM agents accounted for 258 of 22,494 requests to the file, which is 1.1% of all traffic, over a 30-day window across thousands of domains.

Does ChatGPT use llms.txt? OpenAI’s search and user agents, OAI-SearchBot and ChatGPT-User, made 209 requests across roughly 69 hosts. That is real but tiny, and there is no evidence in the logs that the file drives any onward crawling.

Does Google use llms.txt? Googlebot is now the single largest named crawler hitting the file, with 1,219 requests. Google has also begun including llms.txt in Lighthouse checks. A fetch is not proof of use in ranking or AI features, but it is a clear change from a year ago.

Does Gemini or Google AI Mode use llms.txt? I cannot confirm this from the data. What I can confirm is that Googlebot is fetching the file. Whether that content feeds AI Mode or AI Overviews is plausible but unproven on these logs.

Does Claude use llms.txt? Anthropic’s ClaudeBot made 10 requests to the file across the entire dataset. That is negligible.

Does Perplexity use llms.txt? Perplexity’s agents made 5 requests in total, PerplexityBot and Perplexity-User combined. That is negligible.

Is llms.txt worth creating in 2026? My view is yes, but as cheap insurance rather than a growth lever. It costs little to create, Google is now hitting it, and the upside is asymmetric if providers begin to consume it. Do not expect it to move LLM citations today.

Will llms.txt improve my rankings? There is no evidence in this data that it does. Crawlers enter via the homepage and follow internal links. Classical crawlability and internal linking remain far more important.

Will llms.txt get my brand cited in AI answers? Probably not at present. The models that drive consumer AI answers are barely touching the file, and there is no observable crawl activity downstream of it.

Do LLMs crawl the .md files listed in llms.txt? There is no evidence that they do so directly from the file. I found zero requests whose referrer was an llms.txt URL, so either crawlers do not crawl immediately after reading it, or they do not preserve the referrer.

How do LLMs and AI crawlers find pages to crawl? Predominantly via the homepage. When requests carried a referrer it was almost always the domain homepage, indicating crawlers enter there and follow internal links, exactly as classical crawlers do.

Should llms.txt be plain text or HTML? Plain text. In this audit, 23 of the 175 200-responses were served as HTML, and those are almost certainly catch-all pages rather than real llms.txt files. A real file should return text/plain.

Why do so many llms.txt requests return a 404? Because most sites do not have the file. In this crawl, 76.9% of probed URLs returned a 404. Many bots and tools probe for /llms.txt speculatively and simply hit a missing file.

What bots hit llms.txt the most? The largest single sources are unverified tooling and monitoring bots, followed by Googlebot as the largest named crawler. The top ten user-agent strings alone made up 78.1% of all traffic to the file.

Are most llms.txt hits really from AI models? No. 92.2% of traffic came from agents that are neither mainstream search engines nor verifiable LLMs, largely SEO tools, monitors, and AI-readiness auditors. Only 1.1% came from verifiable LLMs.

What is an llms.txt auditor bot? It is a crawler, often from a GEO or SEO tool, whose purpose is to check whether a site has an llms.txt file and report on it. In this dataset, agents that self-label as auditors, monitors, or readiness-checkers accounted for 60.1% of all traffic to the file.

Does having an llms.txt file guarantee bots will read it? No. Of the 130 hosts with a real file, 50 recorded no hit at all in the window. Presence and attention are only loosely coupled.

How big should an llms.txt file be? The proposal envisions a curated index, but in practice the files that had content in this audit were very short, between 2 and 69 words, suggesting most are minimal stubs. Aim for a genuinely useful, curated list of your most important pages rather than a token file.

Is llms.txt the same as robots.txt or sitemap.xml? It is similar in concept, a small conventional file at a predictable path, but different in standing. robots.txt and sitemap.xml are honoured by documented crawlers, whereas llms.txt only delivers value if model providers choose to read it, and on this evidence most do not yet.

Did anything change with llms.txt between 2025 and 2026? The biggest change is Google. Googlebot went from a non-presence to the largest named crawler at the file, and Google added it to Lighthouse. Everything else stayed roughly the same: verifiable LLM usage remained negligible, and no referrer trail from the file appeared.

Is this the largest llms.txt study? To my knowledge, yes, by number of distinct domains and by volume of bot traffic examined. The data comes from real customer domains hosted on Adobe Experience Manager, including some of the world’s largest websites.

Where does the data in this analysis come from? From server-log and crawl data across customer domains hosted on Adobe Experience Manager, analysed with a server log file analysis tool over a 30-day window, with a companion crawl of /llms.txt paths dated 29 May 2026.

How was the data anonymised? No customer, brand, or third-party vendor names appear anywhere in this article. Every identifier has been removed and replaced with a neutral category label, and only aggregate summary figures are published.

Can I reproduce these findings myself? Yes, in principle. Crawl /llms.txt across your domain set and record status, content type, and word count; query 30 days of server logs for requests to /llms.txt grouped by host and user-agent string; classify user agents conservatively; and separately query the referrer field for any request whose referrer is /llms.txt.

What is the single most important takeaway? That raw hit counts on llms.txt are misleading. Most of the traffic is the GEO ecosystem checking itself, not models consuming the file. Create the file because it is cheap and Google is now looking at it, but keep your real investment in homepage strength and internal linking.

A note on the data and on disclosure. The findings below come from server-log and crawl data across customer domains hosted on Adobe Experience Manager (AEM). I analysed this data directly using a server log file analysis tool. I work in this field, and all views expressed here are my own and do not represent those of my employer. No customer, brand, or third-party vendor names appear anywhere in this article. Every identifier has been removed and replaced with a neutral category label.


Written by Flavio Longato and published June 2026 on longato.ch. All views my own and not those of my employer. Companion analysis: llms.txt, my recommendation, August 2025. Spotted an error? Get in touch via longato.ch and I will publish a visible correction.

More posts

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.