---
title: "LLMs.txt &#8211; What You Need to Know: The Largest Audit to Date from Adobe AEM"
date: "2026-06-01"
author: "Flavio Longato"
categories: ["Generative Engine Optimization Course", "GEO"]
url: "https://www.longato.ch/llmstxt-2026-june/"
---

**Published:** June 2026 · longato.ch **Companion piece:** this article updates and extends my earlier write-up, [*llms.txt: my recommendation, August 2025*](https://www.longato.ch/llms-recommendation-2025-august/).

---

 &lt;figure class=&quot;wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio&quot;&gt; &lt;/figure&gt;The five findings you can quote
-------------------------------

&gt; “Create `llms.txt` because it is cheap and Google is now looking at it, not because it will get you cited in ChatGPT today.”

&gt; “Across 22,494 recorded requests to `/llms.txt` over a 30-day window, agents that are verifiably large language models accounted for 258 hits, which is 1.1% of all traffic to the file.”

&gt; “The single biggest change since my August 2025 audit is Googlebot. It is now the largest named crawler hitting `/llms.txt`, with 1,219 recorded requests.”

&gt; “92.2% of all `/llms.txt` traffic came from agents that are neither mainstream search engines nor verifiable LLMs. The file’s main audience today is SEO tooling, monitoring services, and AI-readiness auditors inspecting the file, not models consuming it.”

&gt; “OpenAI’s user-facing and search agents, OAI-SearchBot and ChatGPT-User, generated 209 hits across roughly 69 hosts. That is the totality of OpenAI’s interest in `/llms.txt` in this dataset.”

&gt; “In a direct referrer analysis I found zero requests anywhere in the logs, search bots included, that carried `/llms.txt` as their referrer. Whatever crawlers do after reading the file, they do not arrive at other URLs from it in any way the logs can see.”

What changed since August 2025
------------------------------

My [August 2025 analysis](https://www.longato.ch/llms-recommendation-2025-august/) examined the same question on the same kind of footprint. The qualitative shift over the intervening period is best shown side by side.

**August 2025 against June 2026**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dimension&lt;/th&gt;&lt;th&gt;August 2025 (prior analysis)&lt;/th&gt;&lt;th&gt;June 2026 (this audit)&lt;/th&gt;&lt;th&gt;Direction of change&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Googlebot hitting `/llms.txt`&lt;/td&gt;&lt;td&gt;Not a meaningful presence&lt;/td&gt;&lt;td&gt;1,219 hits, the largest named crawler at the file&lt;/td&gt;&lt;td&gt;Major increase&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Verifiable LLM hits to `/llms.txt`&lt;/td&gt;&lt;td&gt;Negligible&lt;/td&gt;&lt;td&gt;258 hits, 1.1% of all traffic&lt;/td&gt;&lt;td&gt;Still negligible as a share&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OpenAI-specific interest&lt;/td&gt;&lt;td&gt;Minimal&lt;/td&gt;&lt;td&gt;209 hits from OAI-SearchBot and ChatGPT-User, about 69 hosts&lt;/td&gt;&lt;td&gt;Slightly up, still tiny&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dominant traffic source&lt;/td&gt;&lt;td&gt;Already non-LLM&lt;/td&gt;&lt;td&gt;Other / unverified tooling at 92.2%&lt;/td&gt;&lt;td&gt;The bucket has grown and professionalised&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Self-labelled audit and readiness bots&lt;/td&gt;&lt;td&gt;Emerging&lt;/td&gt;&lt;td&gt;60.1% of all traffic&lt;/td&gt;&lt;td&gt;New, large category&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Referrals originating from `llms.txt`&lt;/td&gt;&lt;td&gt;None observed&lt;/td&gt;&lt;td&gt;Still none observed&lt;/td&gt;&lt;td&gt;Unchanged&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Crawler entry point&lt;/td&gt;&lt;td&gt;Homepage-led&lt;/td&gt;&lt;td&gt;Homepage-led&lt;/td&gt;&lt;td&gt;Unchanged&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Sources: my prior published analysis from August 2025 for the “before” column, and Datasets C and D plus the referrer analysis for the “after” column.*

&lt;div class=&quot;wp-block-column is-layout-flow wp-block-column-is-layout-flow&quot;&gt;The most material change is Googlebot’s arrival at `/llms.txt` in volume. This is consistent with a wider observation in the SEO community. Martina Raissle has [noted publicly on LinkedIn](https://www.linkedin.com/posts/martina-raissle_seo-technicalseo-searchengineoptimization-share-7465338294109663232-th6Q/) that Google has begun including `llms.txt` in its Lighthouse checks, which is itself a signal that the file is at least on Google’s radar.

I want to be careful about what this does and does not prove. Googlebot fetching a URL is not proof that the content is used for ranking, AI Overviews, or AI Mode. A fetch is a fetch. But it is a clear change from a year ago, and combined with the Lighthouse inclusion, it is the first concrete sign from a major provider that `llms.txt` is being looked at rather than ignored. I weight this as worth acting on cheaply, not as proven to work, and my recommendation below reflects that.

 &lt;/div&gt;---

My recommendation
-----------------

This is my professional judgement, grounded in the data above.

**Recommendation summary**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;\#&lt;/th&gt;&lt;th&gt;Recommendation&lt;/th&gt;&lt;th&gt;Supporting evidence&lt;/th&gt;&lt;th&gt;Confidence&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;Create the `llms.txt` file&lt;/td&gt;&lt;td&gt;Googlebot is now the largest named crawler at the file, 1,219 hits; Google has added it to Lighthouse checks&lt;/td&gt;&lt;td&gt;Moderate&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;Treat it as low-effort insurance, not a growth lever&lt;/td&gt;&lt;td&gt;Generating the file is cheap; the return is asymmetric if providers do begin to use it&lt;/td&gt;&lt;td&gt;High, on the cost logic&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;Do not expect it to move LLM brand visibility or citations today&lt;/td&gt;&lt;td&gt;Verifiable LLMs account for 1.1% of hits; no referrer trail exists&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;Keep investing in homepage strength and internal linking&lt;/td&gt;&lt;td&gt;Crawlers enter via the homepage and follow links&lt;/td&gt;&lt;td&gt;High&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;Watch Google AI Mode and AI Overviews specifically&lt;/td&gt;&lt;td&gt;Google’s fetching plus Lighthouse inclusion is the only mover in a year; impact there is plausible but unproven&lt;/td&gt;&lt;td&gt;Low, speculative&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;In plain terms: create the file, because Google is now hitting it, and that alone changes the calculus from a year ago. The effort is minimal, so the return on investment is favourable if the providers do in fact consume it; you are buying a cheap option on an uncertain upside. Will it move LLM brand visibility or citations? Probably not, not yet. The traditional consumer LLMs such as ChatGPT are not meaningfully using the file on this evidence, and the honest answer is that the consumption simply is not there at the scale that would move citations. Will it affect Google’s AI Mode? Maybe. Google is the one provider showing changed behaviour. I would not bet the strategy on it, but I would not ignore it either.

---

What llms.txt is?
-----------------

`llms.txt` is a proposed Markdown file placed at the root of a domain, for example `https://example.com/llms.txt`. The [llmstxt.org](https://llmstxt.org/) proposal frames it as a curated, machine-readable map: a short summary of the site plus a hand-picked list of the most important pages, often with companion `.md` versions of those pages, so that a large language model can find and ingest the high-value content without crawling the entire site or fighting through navigation, scripts, and boilerplate. The analogy its proponents draw is to `robots.txt` and `sitemap.xml`: a small, conventional file at a predictable path that machines can rely on. The crucial difference is that `robots.txt` and `sitemap.xml` are honoured by documented, identifiable crawlers, whereas `llms.txt` only delivers value if the LLM providers choose to read it. Whether they do is precisely the question this audit set out to answer with logs rather than opinion.

---

Why I ran this LLMs.txt audit
-----------------------------

Two pressures converged.

The first was a recurring question from customers. I was being asked, on a roughly weekly cadence, whether `llms.txt` was actually being used, and whether it was worth the effort of generating and maintaining. That is a fair question, and it deserves a data-backed answer rather than a shrug.

The second was the state of the GEO and AEO conversation. The generative-engine-optimisation and answer-engine-optimisation community has been circulating a lot of confident, contradictory, and frequently unsourced claims about `llms.txt`: that the major models definitely read it, that it definitely boosts citations, or conversely that it is completely ignored. Both extremes tend to be asserted without server logs to back them. The only responsible move was to look at what bots actually do at the file, at scale.

This is, to my knowledge, the largest single `llms.txt` server-log and crawl audit conducted to date by number of distinct domains and by volume of bot traffic examined. The domains analysed are real customer sites hosted on Adobe Experience Manager, and they include some of the world’s largest websites, which is what makes the bot behaviour observed here representative rather than anecdotal.

&gt; “Most public claims about `llms.txt` are made without real analysis. This audit is my attempt to replace assertion with measurement, at the largest domain scale I am aware of.”

---

Methodology, scope, and caveats
-------------------------------

Here is the setup in full so that the findings can be challenged or replicated.

Working with a server log file analysis tool, plus a large-scale crawl of `/llms.txt` paths, I assembled four datasets:

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dataset&lt;/th&gt;&lt;th&gt;Purpose&lt;/th&gt;&lt;th&gt;Rows&lt;/th&gt;&lt;th&gt;Key fields&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;A, domain scope log&lt;/td&gt;&lt;td&gt;Which hosts received bot traffic, and how many distinct bots and agents each saw&lt;/td&gt;&lt;td&gt;6,122 hosts&lt;/td&gt;&lt;td&gt;`origin_host`, `hits`, `distinct_bots`, `distinct_agents`, `first_seen`, `last_seen`&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;B, llms.txt existence crawl&lt;/td&gt;&lt;td&gt;Whether `/llms.txt` actually resolves on each host, and what it returns&lt;/td&gt;&lt;td&gt;5,553 crawl rows (4,819 distinct URLs, 4,685 distinct hosts)&lt;/td&gt;&lt;td&gt;`Address`, `Status Code`, `Content Type`, `Word Count`, `Size (Bytes)`, `Crawl Timestamp`&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;C, llms.txt hits by host and agent&lt;/td&gt;&lt;td&gt;Every recorded request to `/llms.txt`, split by host and full user-agent string&lt;/td&gt;&lt;td&gt;6,749 rows&lt;/td&gt;&lt;td&gt;`Host`, `request_user_agent`, `hits`&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;D, llms.txt hits by agent type&lt;/td&gt;&lt;td&gt;The same hit volume, pre-classified by agent family&lt;/td&gt;&lt;td&gt;237 rows&lt;/td&gt;&lt;td&gt;`User Agent Type`, `User Agent Name`, `Full User Agent`, `Hits`&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;The hit data in Datasets C and D covers a 30-day window. The crawl in Dataset B carries crawl timestamps dated 29 May 2026.

The four questions I set out to answer were:

1. How many domains have a live `llms.txt` file?
2. When an LLM reads `llms.txt`, does it then crawl the `.md` files it lists?
3. How are LLMs actually finding the pages they crawl?
4. Are there any referrals coming from `llms.txt`?
 
A few caveats, stated openly:

User-agent strings are self-declared. Any bot can claim to be anything. I classify “verifiable LLM” conservatively, counting only agents that match the documented user agents of known model providers such as OpenAI, Anthropic, Perplexity, and You.com. Hits in the “Other / unverified” bucket may include real AI activity behind generic strings, but I will not count what I cannot verify.

Datasets C and D contain no per-event timestamp column. The 30-day window is the query window the data was extracted under; it is not re-derivable from inside the files.

Dataset A’s `first_seen` and `last_seen` values span a short capture interval, about five minutes on 28 May 2026, which tells me these are sampling markers from one extract rather than the full 30-day span. I therefore use Dataset A only for structural facts such as host counts and bot diversity per host, and never to infer time-based volume.

The tables below are summary tables. I am not releasing the raw logs. The figures are reproducible in principle by anyone running the same crawl and the same log query.

---

How many domains actually have an llms.txt file?
------------------------------------------------

This is where precision matters most, because “has an `llms.txt`” is not a single thing. A request to `/llms.txt` can return a real Markdown file, a redirect, a 404, a soft-200 HTML page, or an empty 200. I broke Dataset B down by HTTP status.

**HTTP status of `/llms.txt` across 4,685 distinct hosts**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Status code&lt;/th&gt;&lt;th&gt;Meaning&lt;/th&gt;&lt;th&gt;Crawl rows&lt;/th&gt;&lt;th&gt;Share of rows&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;404&lt;/td&gt;&lt;td&gt;Not found (no file)&lt;/td&gt;&lt;td&gt;4,270&lt;/td&gt;&lt;td&gt;76.9%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;301&lt;/td&gt;&lt;td&gt;Permanent redirect&lt;/td&gt;&lt;td&gt;606&lt;/td&gt;&lt;td&gt;10.9%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;200&lt;/td&gt;&lt;td&gt;OK (file served)&lt;/td&gt;&lt;td&gt;175&lt;/td&gt;&lt;td&gt;3.2%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;403&lt;/td&gt;&lt;td&gt;Forbidden&lt;/td&gt;&lt;td&gt;174&lt;/td&gt;&lt;td&gt;3.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;302&lt;/td&gt;&lt;td&gt;Temporary redirect&lt;/td&gt;&lt;td&gt;149&lt;/td&gt;&lt;td&gt;2.7%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;No response or connection failure&lt;/td&gt;&lt;td&gt;90&lt;/td&gt;&lt;td&gt;1.6%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;401&lt;/td&gt;&lt;td&gt;Unauthorised&lt;/td&gt;&lt;td&gt;47&lt;/td&gt;&lt;td&gt;0.8%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;406&lt;/td&gt;&lt;td&gt;Not acceptable&lt;/td&gt;&lt;td&gt;28&lt;/td&gt;&lt;td&gt;0.5%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;(blank)&lt;/td&gt;&lt;td&gt;No status captured&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;0.2%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;410&lt;/td&gt;&lt;td&gt;Gone&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;307&lt;/td&gt;&lt;td&gt;Temporary redirect&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;**Total**&lt;/td&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;**5,553**&lt;/td&gt;&lt;td&gt;**100%**&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset B, `Status Code` column. The row count includes 734 duplicate URLs, which I deduplicated before counting hosts.*

A 200 response is necessary but not sufficient to call something a real `llms.txt`. Many 200s are HTML catch-all pages or empty bodies. So I tightened the definition in two further steps.

**How many of the 200 responses are genuinely an llms.txt file?**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Definition (progressively stricter)&lt;/th&gt;&lt;th&gt;Distinct hosts&lt;/th&gt;&lt;th&gt;Share of 4,685 probed&lt;/th&gt;&lt;th&gt;Share of 6,122 scope-file hosts&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Any HTTP 200 at `/llms.txt`&lt;/td&gt;&lt;td&gt;137&lt;/td&gt;&lt;td&gt;2.92%&lt;/td&gt;&lt;td&gt;2.24%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;200 and `Content-Type: text/plain`&lt;/td&gt;&lt;td&gt;111&lt;/td&gt;&lt;td&gt;2.37%&lt;/td&gt;&lt;td&gt;1.81%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;200 and word count above zero&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;td&gt;0.43%&lt;/td&gt;&lt;td&gt;0.33%&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset B, `Status Code` plus `Content Type` plus `Word Count` columns.*

Depending on how strictly you define “has a working `llms.txt`“, the answer ranges from 137 hosts for any 200, down to 111 hosts for files served as plain text, and as low as 20 hosts for plain-text files with actual measurable content. The 23 responses that returned a 200 with an HTML content type are almost certainly not real `llms.txt` files at all.

&gt; “Of 4,685 domains probed, only 137 returned a 200 at `/llms.txt`. Tighten the definition to plain text with real content and the number collapses to 20. Adoption is not just low, much of the apparent adoption is hollow.”

**Data-quality notes for the existence crawl**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Issue&lt;/th&gt;&lt;th&gt;Detail&lt;/th&gt;&lt;th&gt;How I handled it&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Duplicate URLs&lt;/td&gt;&lt;td&gt;5,553 rows but 4,819 distinct addresses, so 734 duplicate rows&lt;/td&gt;&lt;td&gt;Deduplicated to distinct hosts before counting&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Soft-200 HTML&lt;/td&gt;&lt;td&gt;23 of 175 200-responses were `text/html`, not a text file&lt;/td&gt;&lt;td&gt;Excluded from the strict definitions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Empty 200s&lt;/td&gt;&lt;td&gt;155 of 175 200-responses had a word count of zero&lt;/td&gt;&lt;td&gt;Reported separately and flagged as likely empty or placeholder&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Word-count range on real files&lt;/td&gt;&lt;td&gt;The 20 non-empty files ran from 2 to 69 words&lt;/td&gt;&lt;td&gt;Reported; even the “real” files are extremely short&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;A word count between 2 and 69 on the files that do have content tells me most of these are minimal stubs, a title and a couple of links, rather than the rich, curated index the llmstxt.org proposal envisions. Adoption is shallow on both axes: few sites have the file, and few of those have populated it meaningfully.

---

Do LLMs crawl the .md files, and are there any referrals from llms.txt?
-----------------------------------------------------------------------

These two questions share one answer, and it comes from a direct analysis of the referrer field in the logs.

I did not find a single request anywhere in the server logs whose referrer was a `/llms.txt` URL. This held across all bot types, search engines included, not only LLM agents.

There are two possible explanations, and the logs alone cannot distinguish between them. Either the bots do not crawl immediately: they may read `llms.txt`, archive or queue what they find, and crawl later in a separate session that carries no referrer linking back to the file. Or the referrer is simply not preserved: bots may crawl the listed `.md` files but not populate the referrer header with the `llms.txt` URL.

Either way, the practical consequence is the same. There is no observable evidence in the logs that `llms.txt` is functioning as a crawl-routing hub. If `llms.txt` were doing the job its proposal describes, feeding models a list of URLs that they then fetch, I would expect to see at least some referrer trail. I see none.

---

How are LLMs actually finding pages to crawl?
---------------------------------------------

From the same referrer analysis: when bot requests did carry a referrer, it was, in the overwhelming majority of cases, the homepage of the domain.

The behavioural picture is that crawlers, including AI crawlers, predominantly enter a site at the homepage and discover the rest of the site by following links from there, exactly as classical web crawlers always have. They are not, on this evidence, entering via `llms.txt` and fanning out from its curated list. The homepage and its internal linking remain the primary discovery surface. This is a strong argument that the fundamentals of crawlability and internal linking still matter far more than a curated `llms.txt` for getting your content seen.

&gt; “On the referrer evidence, AI crawlers behave like classical crawlers. They enter at the homepage and follow links. `llms.txt` is not the front door.”

---

Who is actually hitting llms.txt? The 22,494-hit breakdown
----------------------------------------------------------

This is the heart of the audit. Dataset D pre-classifies every recorded hit by agent family, and Dataset C lets me verify that classification against the raw user-agent strings. The two reconcile to the same total, 22,494 against 22,493, a one-hit difference from how the two extracts were generated.

**`/llms.txt` hits by agent type, 30-day window**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;User-agent type&lt;/th&gt;&lt;th&gt;Hits&lt;/th&gt;&lt;th&gt;Share&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Other / unverified&lt;/td&gt;&lt;td&gt;20,746&lt;/td&gt;&lt;td&gt;92.2%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Search engine&lt;/td&gt;&lt;td&gt;1,434&lt;/td&gt;&lt;td&gt;6.4%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;LLM / AI (verifiable)&lt;/td&gt;&lt;td&gt;258&lt;/td&gt;&lt;td&gt;1.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SEO / crawlers (declared)&lt;/td&gt;&lt;td&gt;36&lt;/td&gt;&lt;td&gt;0.2%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dataset / training&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;0.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Social / preview&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;**Total**&lt;/td&gt;&lt;td&gt;**22,494**&lt;/td&gt;&lt;td&gt;**100%**&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset D, `User Agent Type` by `Hits`.*

**Hits by named agent (the agents that are identifiable)**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Named agent&lt;/th&gt;&lt;th&gt;Operator family&lt;/th&gt;&lt;th&gt;Hits&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Googlebot&lt;/td&gt;&lt;td&gt;Search engine&lt;/td&gt;&lt;td&gt;1,219&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OAI-SearchBot&lt;/td&gt;&lt;td&gt;OpenAI&lt;/td&gt;&lt;td&gt;153&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;BaiduSpider&lt;/td&gt;&lt;td&gt;Search engine&lt;/td&gt;&lt;td&gt;127&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ChatGPT-User&lt;/td&gt;&lt;td&gt;OpenAI&lt;/td&gt;&lt;td&gt;56&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Amazonbot&lt;/td&gt;&lt;td&gt;E-commerce / AI&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Bingbot&lt;/td&gt;&lt;td&gt;Search engine&lt;/td&gt;&lt;td&gt;36&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPTBot&lt;/td&gt;&lt;td&gt;OpenAI (training)&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;AhrefsBot&lt;/td&gt;&lt;td&gt;SEO tool&lt;/td&gt;&lt;td&gt;28&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Applebot&lt;/td&gt;&lt;td&gt;Search / AI&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Bytespider&lt;/td&gt;&lt;td&gt;ByteDance&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClaudeBot&lt;/td&gt;&lt;td&gt;Anthropic&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;SemrushBot&lt;/td&gt;&lt;td&gt;SEO tool&lt;/td&gt;&lt;td&gt;6&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Facebook External Hit&lt;/td&gt;&lt;td&gt;Social preview&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PerplexityBot&lt;/td&gt;&lt;td&gt;Perplexity&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Meta ExternalAgent&lt;/td&gt;&lt;td&gt;Meta&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Perplexity-User&lt;/td&gt;&lt;td&gt;Perplexity&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;YouBot&lt;/td&gt;&lt;td&gt;You.com&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;CCBot&lt;/td&gt;&lt;td&gt;Common Crawl&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset D, `User Agent Name` by `Hits`, excluding the “Unknown” aggregate of 20,746.*

**The verifiable LLM/AI agents in full**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;LLM/AI agent&lt;/th&gt;&lt;th&gt;Hits&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;OAI-SearchBot (OpenAI)&lt;/td&gt;&lt;td&gt;153&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ChatGPT-User (OpenAI)&lt;/td&gt;&lt;td&gt;56&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPTBot (OpenAI training)&lt;/td&gt;&lt;td&gt;33&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ClaudeBot (Anthropic)&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PerplexityBot (Perplexity)&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Perplexity-User (Perplexity)&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;YouBot (You.com)&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;**Total verifiable LLM/AI**&lt;/td&gt;&lt;td&gt;**258**&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset D, `User Agent Type = LLM / AI`.*

&gt; “Strip out the search engines and the unverifiable bots, and the entire verifiable-LLM interest in `llms.txt`, across a 30-day window on thousands of domains, amounts to 258 requests. Anthropic, Perplexity, and You.com combined: 16.”

### What is the 92% actually made of?

The unverified bulk deserves scrutiny rather than a dismissive label. Using Dataset C’s raw user-agent strings, I found that it is dominated by a long tail of self-described tooling: site-statistics bots, monitoring bots, SEO site-audit crawlers, and a striking number of agents whose own user-agent strings advertise that they exist to audit or check `llms.txt` and AI-readiness.

**Composition of `/llms.txt` traffic by operator family (raw-string classification)**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Operator family&lt;/th&gt;&lt;th&gt;Hits&lt;/th&gt;&lt;th&gt;Share&lt;/th&gt;&lt;th&gt;Distinct hosts touched&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Other / unverified (tooling, monitors, auditors)&lt;/td&gt;&lt;td&gt;20,772&lt;/td&gt;&lt;td&gt;92.3%&lt;/td&gt;&lt;td&gt;3,134&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Google&lt;/td&gt;&lt;td&gt;1,227&lt;/td&gt;&lt;td&gt;5.5%&lt;/td&gt;&lt;td&gt;319&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;OpenAI&lt;/td&gt;&lt;td&gt;242&lt;/td&gt;&lt;td&gt;1.1%&lt;/td&gt;&lt;td&gt;69&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Baidu&lt;/td&gt;&lt;td&gt;127&lt;/td&gt;&lt;td&gt;0.6%&lt;/td&gt;&lt;td&gt;36&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Amazon&lt;/td&gt;&lt;td&gt;38&lt;/td&gt;&lt;td&gt;0.2%&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Microsoft / Bing&lt;/td&gt;&lt;td&gt;35&lt;/td&gt;&lt;td&gt;0.2%&lt;/td&gt;&lt;td&gt;20&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Apple&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;td&gt;0.1%&lt;/td&gt;&lt;td&gt;13&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;ByteDance&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;0.1%&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Anthropic&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;0.1%&lt;/td&gt;&lt;td&gt;11&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Meta&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Perplexity&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;td&gt;5&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;You.com&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Common Crawl&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;under 0.1%&lt;/td&gt;&lt;td&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Dataset C, full user-agent strings classified by operator. Minor differences from the agent-type table reflect the raw-string method counting AdsBot-Google and similar agents under their parent family.*

Two concentration facts stand out. The top ten user-agent strings alone accounted for 17,569 of 22,493 hits, which is 78.1% of all traffic to the file. And agents whose user-agent string self-labels with terms such as audit, monitor, readiness, llms.txt, crawler, GEO, or research represented 105 distinct agents and 13,508 hits, which is 60.1% of all traffic.

&gt; “60% of all traffic to `llms.txt` came from agents that openly describe themselves as auditors, monitors, or readiness-checkers. The file’s biggest use case right now is being inspected to see whether it exists, a self-referential market rather than consumption by models.”

This is the most under-reported reality of `llms.txt` in mid-2026. Raw hit counts on the file are rising, and it is tempting to read that as LLMs adopting it. The composition says otherwise. A large share of the traffic is the GEO ecosystem checking itself: tools verifying that a customer has the file, monitors polling for changes, readiness-scanners selling the idea that the file matters. That activity is real, but it is not evidence that any model is using the file to answer questions.

---

Host-level reality check
------------------------

Beyond raw hits, I cross-referenced which hosts have a real file against which hosts received any `/llms.txt` traffic.

**Hosts: file presence against received traffic**

 &lt;figure class=&quot;wp-block-table&quot;&gt;&lt;table class=&quot;has-fixed-layout&quot;&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Measure&lt;/th&gt;&lt;th&gt;Count&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Hosts returning 200 at `/llms.txt` (www-normalised)&lt;/td&gt;&lt;td&gt;130&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hosts that received at least one `/llms.txt` request (www-normalised)&lt;/td&gt;&lt;td&gt;2,649&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hosts that both have a file and received a hit&lt;/td&gt;&lt;td&gt;80&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Hosts that have a file but recorded no hit&lt;/td&gt;&lt;td&gt;50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Distinct hosts receiving any `/llms.txt` hit (raw)&lt;/td&gt;&lt;td&gt;3,236&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;

&lt;/figure&gt;*Source: Datasets B and C, joined on www-normalised host.*

Two things stand out. First, the vast majority of `/llms.txt` requests land on hosts that do not even have the file: bots and tools are probing for it speculatively and hitting 404s. Second, of the hosts that do have a real file, more than a third, 50 of 130, saw no recorded hit at all in the window. Presence and attention are only loosely coupled.

---

---

Limitations and an invitation to challenge
------------------------------------------

Here is where this audit stops short.

User agents are self-declared, so the 92.2% Other bucket could hide real AI activity behind generic strings. I have deliberately under-counted LLM activity rather than over-claim it. The hit datasets carry no per-event timestamps, so the 30-day window is the extraction window rather than a field I can re-derive. Fetched does not mean used: nothing in server logs can prove that any provider used `llms.txt` content in a model output, because logs show requests, not downstream use. This is a snapshot, a single 30-day window compared qualitatively to a prior one, not a continuous time series. And referrer behaviour is provider-dependent, so the absence of a referrer trail is strong evidence of no observable routing rather than absolute proof that no provider ever crawls from the file.

If you can replicate, extend, or contradict any of this with your own logs, I want to hear about it. I will investigate and publish a visible correction if anything here proves wrong.

---

Frequently asked questions
--------------------------

**How many websites actually have an llms.txt file?** In this audit, of 4,685 domains probed, 137 returned a working 200 response at `/llms.txt`, which is about 2.9%. If you require the file to be served as plain text the number is 111, and if you require it to contain real content it drops to 20.

**What percentage of websites have llms.txt?** On this AEM-hosted sample, between 0.4% and 2.9% depending on how strictly you define a working file. The headline figure of 2.9% counts any 200 response; the strict figure of 0.4% counts only plain-text files with measurable content.

**Do large language models actually read llms.txt?** Rarely, on this evidence. Verifiable LLM agents accounted for 258 of 22,494 requests to the file, which is 1.1% of all traffic, over a 30-day window across thousands of domains.

**Does ChatGPT use llms.txt?** OpenAI’s search and user agents, OAI-SearchBot and ChatGPT-User, made 209 requests across roughly 69 hosts. That is real but tiny, and there is no evidence in the logs that the file drives any onward crawling.

**Does Google use llms.txt?** Googlebot is now the single largest named crawler hitting the file, with 1,219 requests. Google has also begun including `llms.txt` in Lighthouse checks. A fetch is not proof of use in ranking or AI features, but it is a clear change from a year ago.

**Does Gemini or Google AI Mode use llms.txt?** I cannot confirm this from the data. What I can confirm is that Googlebot is fetching the file. Whether that content feeds AI Mode or AI Overviews is plausible but unproven on these logs.

**Does Claude use llms.txt?** Anthropic’s ClaudeBot made 10 requests to the file across the entire dataset. That is negligible.

**Does Perplexity use llms.txt?** Perplexity’s agents made 5 requests in total, PerplexityBot and Perplexity-User combined. That is negligible.

**Is llms.txt worth creating in 2026?** My view is yes, but as cheap insurance rather than a growth lever. It costs little to create, Google is now hitting it, and the upside is asymmetric if providers begin to consume it. Do not expect it to move LLM citations today.

**Will llms.txt improve my rankings?** There is no evidence in this data that it does. Crawlers enter via the homepage and follow internal links. Classical crawlability and internal linking remain far more important.

**Will llms.txt get my brand cited in AI answers?** Probably not at present. The models that drive consumer AI answers are barely touching the file, and there is no observable crawl activity downstream of it.

**Do LLMs crawl the .md files listed in llms.txt?** There is no evidence that they do so directly from the file. I found zero requests whose referrer was an `llms.txt` URL, so either crawlers do not crawl immediately after reading it, or they do not preserve the referrer.

**How do LLMs and AI crawlers find pages to crawl?** Predominantly via the homepage. When requests carried a referrer it was almost always the domain homepage, indicating crawlers enter there and follow internal links, exactly as classical crawlers do.

**Should llms.txt be plain text or HTML?** Plain text. In this audit, 23 of the 175 200-responses were served as HTML, and those are almost certainly catch-all pages rather than real `llms.txt` files. A real file should return `text/plain`.

**Why do so many llms.txt requests return a 404?** Because most sites do not have the file. In this crawl, 76.9% of probed URLs returned a 404. Many bots and tools probe for `/llms.txt` speculatively and simply hit a missing file.

**What bots hit llms.txt the most?** The largest single sources are unverified tooling and monitoring bots, followed by Googlebot as the largest named crawler. The top ten user-agent strings alone made up 78.1% of all traffic to the file.

**Are most llms.txt hits really from AI models?** No. 92.2% of traffic came from agents that are neither mainstream search engines nor verifiable LLMs, largely SEO tools, monitors, and AI-readiness auditors. Only 1.1% came from verifiable LLMs.

**What is an llms.txt auditor bot?** It is a crawler, often from a GEO or SEO tool, whose purpose is to check whether a site has an `llms.txt` file and report on it. In this dataset, agents that self-label as auditors, monitors, or readiness-checkers accounted for 60.1% of all traffic to the file.

**Does having an llms.txt file guarantee bots will read it?** No. Of the 130 hosts with a real file, 50 recorded no hit at all in the window. Presence and attention are only loosely coupled.

**How big should an llms.txt file be?** The proposal envisions a curated index, but in practice the files that had content in this audit were very short, between 2 and 69 words, suggesting most are minimal stubs. Aim for a genuinely useful, curated list of your most important pages rather than a token file.

**Is llms.txt the same as robots.txt or sitemap.xml?** It is similar in concept, a small conventional file at a predictable path, but different in standing. `robots.txt` and `sitemap.xml` are honoured by documented crawlers, whereas `llms.txt` only delivers value if model providers choose to read it, and on this evidence most do not yet.

**Did anything change with llms.txt between 2025 and 2026?** The biggest change is Google. Googlebot went from a non-presence to the largest named crawler at the file, and Google added it to Lighthouse. Everything else stayed roughly the same: verifiable LLM usage remained negligible, and no referrer trail from the file appeared.

**Is this the largest llms.txt study?** To my knowledge, yes, by number of distinct domains and by volume of bot traffic examined. The data comes from real customer domains hosted on Adobe Experience Manager, including some of the world’s largest websites.

**Where does the data in this analysis come from?** From server-log and crawl data across customer domains hosted on Adobe Experience Manager, analysed with a server log file analysis tool over a 30-day window, with a companion crawl of `/llms.txt` paths dated 29 May 2026.

**How was the data anonymised?** No customer, brand, or third-party vendor names appear anywhere in this article. Every identifier has been removed and replaced with a neutral category label, and only aggregate summary figures are published.

**Can I reproduce these findings myself?** Yes, in principle. Crawl `/llms.txt` across your domain set and record status, content type, and word count; query 30 days of server logs for requests to `/llms.txt` grouped by host and user-agent string; classify user agents conservatively; and separately query the referrer field for any request whose referrer is `/llms.txt`.

**What is the single most important takeaway?** That raw hit counts on `llms.txt` are misleading. Most of the traffic is the GEO ecosystem checking itself, not models consuming the file. Create the file because it is cheap and Google is now looking at it, but keep your real investment in homepage strength and internal linking.

&gt; **A note on the data and on disclosure.** The findings below come from server-log and crawl data across customer domains hosted on Adobe Experience Manager (AEM). I analysed this data directly using a server log file analysis tool. I work in this field, and all views expressed here are my own and do not represent those of my employer. No customer, brand, or third-party vendor names appear anywhere in this article. Every identifier has been removed and replaced with a neutral category label.

---

*Written by Flavio Longato and published June 2026 on [longato.ch](https://www.longato.ch/). All views my own and not those of my employer. Companion analysis: [llms.txt, my recommendation, August 2025](https://www.longato.ch/llms-recommendation-2025-august/). Spotted an error? Get in touch via [longato.ch](https://www.longato.ch/) and I will publish a visible correction.*