Tag: llms.txt

  • Do LLMs Crawl Markdown (.md) Files? Data Analysis

    As large language models (LLMs) such as ChatGPT, Claude and other generative AI systems reshape how people discover and consume information, a recurring question for digital marketers and content strategists is whether these models work directly with Markdown files (.md).

    Markdown is widely used by developers and documentation teams as a lightweight, human‑readable authoring format. But does it play a role in how LLMs crawl and consume web content?

    Recently, I carried out a targeted log‑file analysis to better understand how (or if) .md files are surfaced to LLM crawlers.

    Summary of Findings:

    • LLMs ignore Markdown files — log analysis showed no evidence that GPTBot, ClaudeBot, or similar crawlers request or prioritise .md content, even when listed in llms.txt.
    • HTML remains the standard — structured, cached HTML is consistently the most reliable and supported format for both search engines and LLMs.
    • No ROI for .md delivery — maintaining Markdown alongside HTML adds overhead without proven gains in visibility, brand mentions, or indexing.
    • Use Markdown internally only — it remains valuable for documentation and workflows, but should not be treated as a delivery layer for AI optimisation.
    • Optimise for what works today — focus efforts on clean, semantic HTML, caching strategies, and accessibility rather than speculative standards like llms.txt.

    Why are we discussing markdown (.md) in the first place?

    Because within the documentation of the llms.txt files, it mentions that you should use markdown

    …We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with .md appended. (URLs without file names should append index.html.md instead.)…

    Do LLMs Use .md Files?

    LLMs are not prioritizing .md files — log analysis showed no requests from GPTBot, ClaudeBot, or other AI crawlers, even when .md files were listed in llms.txt.
    Authoritative domains are no exception — two sites with DA > 90 and millions of daily visits had zero LLM traffic over a 24-hour window.
    HTML remains the standard — well-structured, cached HTML is still the most reliable format for both search engines and LLMs.
    No clear marketing ROI for .md — maintaining .md alongside HTML adds overhead, with no proven visibility or brand-mention benefits.

    Recommendation: Focus optimization on HTML outputs; treat .md as an internal content format, not as a delivery layer for LLMs.


    Scope of the Analysis

    To keep the study practical but still representative, the parameters were defined as follows:

    • Duration: 24 hours of raw CDN log data
    • Sample: Two high‑traffic websites with millions of daily visits
    • Domain authority: Both sites carried DA > 90
    • Configuration: Each site exposed llms.txt files listing selected .md resources for potential crawling
    • New content: At least one fresh page on each site was published within the analysis window

    One limitation is the relatively short duration. A full week’s logs would have meant processing well over 150 GB of data, which was not technically feasible for this initial phase.


    Analysis

    • No identifiable LLM crawler traffic: Across both sites, no access requests were logged from recognised LLM bots such as GPTBot, ClaudeBot, or PerplexityBot.
    • No .md file retrieval: Although .md files had been explicitly referenced within the llms.txt directive, there was no evidence of these being fetched by any bot claiming to represent an LLM provider.
    • What about other bots? I found random scrapping bots that were hitting the files, but no major / significant traditional search engine were found.
    • What are you trying to accomplish with having an .md file? The biggest question really is, why would you use .md files in the first place? what would be your end goal:
    • Increased context for LLMs: Some GEO specialsits suggest adding more textual information in the .md files that would otherwise not be relevant to the end user. In my opinion this is not wise as it can easily be considered a form of manipulation.

    Why would you even consider using an .md file?

    The question really comes down to: What is your goal, why do you want to use .md files in the first place? What are you trying to fix / resolve for this?

    1. Improve crawling rates:

    Imagine you have a website with millions of new UGC (User Generated Content) assets being published each day. Search engines often have issues catching up with the indexation. using .MD files could be a solution to this. However, I don’t believe that this would be a wise way of proceeding: Why?

    • You are actually generating an additional asset for LLMs to crawl and not just the original URL
    • .MD do not have a html head, were you could create a relationship such a with rel-canonical to the main document.
      • This would therefore need to be done in the HTTP PUT request
    • There are other formats suchas plain .HTML that could be used just for the bot i.e.

    Process:

    1. user agent that includes “bot” makes a request for a given page
    2. The system delivers a pre-rendered page:
      What is a pre-rendered page? It is a page that the JS has fully been rendered in the browser and that version is cached. Therefore ZERO JS is needed to visualise the content on the page. The main limitation is that certain JS interactions would not work but the content is visible.
    3. This file has exactly the same URL so that it is not an additional asset for the bot to render
    4. Cache the .html version of the website using CDN to improve speed
    StepsUserBot
    user-agent detectionUser gets the standard version of the page with JSGets a clean optimized cached version with no JS but all client side JS already rendered.
    The Cache file would then be on a fast CDN and syndicated

    I’ve done this exact implementation for this type of website and we were able to increase the indexing rate from 30% to 93% in a 18 month window. Why? Because it is estimated that for every 1 JS rendered page you could crawl 100 none JS with the same computing resources. Therefore, it is logical to have a version like this

    But is this not cloaking? No, we deliver EXACTLY the same textual version that the browser would see once JS is rendered.

    Definition: Cloaking is when you intentionally try to manipulate search engines by hiding / adding content to a version that is only for them.

    2. Heavy JS rendered content

    Imagine you have a website that hosts PDF files, for the end user you visualise the PDF in a JS heavy viewer. Search engines are unable to see the content of that PDF and you can’t expose the PDF as user would steel the files. How do you expose it to search engines / LLMs?

    In this context using an .md file would make logical sense but my main concern is how do LLMs understand the relationship between the two and when they reference in the prompt. would they ever reference a .md file? I think not.

    So again, the cleaned out .html file would win here.

    • Only one url for the same asset which helps crawling budgets and indexation

    Questions I Am Frequently Asked

    My site is rendered entirely client‑side with JavaScript. Should I consider publishing content in Markdown to make it accessible?

    A: No. A better approach is to ensure you serve a pre‑rendered HTML version and cache it properly. This way, the content remains crawlable to both LLM bots and traditional search crawlers.

    Are there cases where Markdown should be used as a direct LLM optimisation layer?

    Not really. Markdown is useful internally—for documentation or content maintenance—but optimisation for LLMs should focus on clean, structured HTML output, not Markdown.

    Do .md files improve visibility in large language models (LLMs) compared to standard HTML pages?

    No. Current evidence shows LLMs do not prioritize .md files over HTML. Well-structured HTML remains the most reliable format for visibility.

    If my site is client-side rendered (JS heavy), would exposing .md files help LLMs or search engines access my content more easily?

    No. A better solution is to use an HTML renderer and cache the results. This ensures both search engines and LLMs can properly access your content.

    Are there proven cases where .md files increased brand mentions or visibility in generative AI search results?

    Not yet. While some proposals suggest benefits, no third-party studies or log data confirm that .md exposure improves brand visibility in LLM outputs.

    How much additional maintenance effort would it take to manage .md versions of my pages alongside existing HTML—and is the ROI justified?

    Maintaining parallel .md and HTML versions increases workload and risk of outdated content. At this stage, the ROI is unproven.

    Should I list .md files in llms.txt to signal them to AI crawlers, or is it better to optimize the HTML output we already have?

    Optimize HTML. Listing .md files in llms.txt is experimental and currently unused by major LLM bots. HTML optimization offers a clearer path to results. More info here


    Third‑Party Context

    Independent sources in both search and AI research echo these observations:

    • Yoast SEO notes that although llms.txt was proposed as a kind of robots.txt for AI crawlersno major LLM provider currently supports it. GPTBot, Claude, or Google’s AI products do not read Markdown or llms.txt as part of their active crawling routines (Yoast, 2024).
    • Daydream’s analysis warns that managing .md‑based feeds can introduce risks of data divergence—where Markdown goes out of sync with published HTML. This could actually harm brand accuracy if models ingest outdated content (Daydream Library, 2024).
    • Academic work (HtmlRAG study, arXiv, 2024) tested retrieval‑augmented generation (RAG) pipelines and found that HTML retained semantic structure—headings, metadata, table layouts—that plain text or Markdown often strips away. These structural signals improved contextual knowledge retention and retrieval performance, supporting the argument that HTML delivers more value to LLM ingestion workflows.

    Collectively, these insights align with the practical results of the log‑file study.


    Recommendations based on my research, experience and observation

    • Do not publish directly in Markdown for LLM visibility. Keep Markdown for internal versioning and workflows.
    • Focus on HTML as the public output layer. Ensure semantic tags are used and the pages are properly cached.
    • Do not rely on llms.txt today. It is an experimental idea with very limited adoption.
    • Prioritise accessibility and clarity of HTML outputs over trying to second‑guess speculative AI standards.

    Conclusion

    This analysis, while narrow in scope, makes one point clear: LLMs are not actively crawling or requesting Markdown files, even when explicitly listed in llms.txt. Instead, industry evidence shows that AI ingestion pipelines focus on HTML‑rendered content, which provides richer context and stronger retrieval signals.

    For now, organisations should maintain emphasis on accessible, semantically‑structured HTML, coupled with robust caching strategies. Markdown remains valuable as an internal content authoring format, but it is not a shortcut to visibility within LLM ecosystems.


  • LLMs.txt: Why AI Crawlers Ignore It (2025 Audit)

    Updated: June 2026 · A new article has been published on the subject about LLMS.txt and extends my earlier write-up, llms.txt

    This analysis aims to review the usage of LLMs.txt files in the context of LLMs.

    How was the analysis performed: I audited 30 days of raw CDN logs for 1,000 Adobe Experience Manager domains to see who actually requests the file. The results were, frankly, brutal.

    Findings of the LLMs.txt audit:

    • LLM-specific bots stayed away. No GPTBot, ClaudeBot, PerplexityBot, or similar were seen at all.
    • Google still probes everything. Its desktop crawler accounted for 95% of all hits.
    • Bing is curious but inconsistent. Only seven requests—concentrated on one domain (out of one-thousand)
    • OpenAI’s search bot was minimal. Ten calls from OpenAIBotSearch. GPTBot itself was absent.
    • SEO tools inflated the logs. Tools like Semrush Mobile and SiteAudit caused many hits, unrelated to LLMs.
    RankUser-agentShare of all llms.txt hits
    1GoogleBotDesktop94.9%
    2OpenAIBotSearch1.1%
    3ScanPire0.8%
    4BingBot0.8%
    Eight other bots<1% each

    Why Aren’t LLMs going to the llms.txt file?

    1. The spec is still unofficial. No LLM lab has committed to honoring it yet.
    2. Most training uses pre-built datasets. Like Common Crawl or books, not live fetches.
    3. Robots.txt already covers them. Major labs honor standard tokens like GPTBot and ClaudeBot.
    4. It’s not cost-effective. Probing llms.txt on every domain wastes crawl budget.

    What are my recommendations for site owners in relation to llms.txt

    This really depends on the difficulty of implementing the llms.txt file, if you feel that it would be relatively easy to create the file then go for it. If it requires a large amount of resources, then I’d recommend you hold-back until we clearly see benefits.

    For example, this domain uses the llms.txt file at https://www.longato.ch/llms.txt because it was easy to implement

    • Use robots.txt instead. It’s the only widely respected barrier today
    • Watch your logs. Use tools like Grafana or BigQuery to detect AI crawlers directly
      • Remember, if you use a CDN you’d need to look into the logs within the CDN

    What Might Change Soon for LLMs.txt

    As of now (2025 August) there are no major announcements from LLMs in relation to llms.txt

    ProviderCurrent stance on llms.txtSignal to watch
    OpenAINo support announcedGPTBot documentation updates
    Google / GeminiMonitors but uses Google-ExtendedRevisions to Google-Extended policy
    Microsoft / CopilotSilentBingBlog crawler updates
    MetaNo mentionMeta crawler policy changes
    AnthropicNo mentionClaudeBot UA policy

    Are there any external validation of my findings?

    DateKey developmentWho said / did itTake‑away
    17 Jun 2025“FWIW no AI system currently uses llms.txt.”John Mueller, Google, on BlueskyGoogle confirms zero support and no immediate plans. (Search Engine Roundtable)
    19 Jun 2025ScaleMath publishes an adoption‑tracker deep‑dive.Independent analystsFinds early enthusiasm among dev‑doc sites but no proof of LLM consumption. (ScaleMath)
    02 Jul 2025PPC Land headline – “llms.txt adoption stalls as major AI platforms ignore proposed standard”.Industry pressOpenAI, Google, Anthropic still not honoring the file. (PPC Land)
    22 Jul 2025Mueller advises adding X‑Robots‑Tag: noindex to llms.txt to avoid clutter in Google results.GoogleTactical hygiene tip; doesn’t affect crawling behaviour. (Stan Ventures)
    24 Jul 2025Logs show OpenAI’s crawler fetching llms.txt every ~15 min on some sites. Google’s Gary Illyes repeats “we won’t support it.”Search Engine RoundtableAnecdotal evidence OpenAI is testing discovery, not an official endorsement. (Search Engine Roundtable)
    Late Jul 2025Server‑log studies detect sporadic hits from other AI bots but no sustained utilisation.ArcherEdu analyticsSuggests experiments, not production use. (archeredu.com)

    Where to Go from Here

    • Automate deployment of llms.txt across all properties using your CMS or server configuration.
    • Audit quarterly. LLM behavior evolves fast—track what’s changed.

    Bottom line: llms.txt is a good idea in theory, but today’s bots don’t read it. Until adoption improves, your best defense remains robots.txt and a clear content policy backed by logs.

    FAQ: Understanding llms.txt

    What is llms.txt and who proposed it?

    llms.txt is a proposed text file format that website owners can place at the root of their domain https://example.com/llms.txt. The goal is to help LLMs to improve discovery and indexation.

    Large language models increasingly rely on website information, but face a critical limitation: context windows are too small to handle most websites in their entirety. Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.
    Source: https://llmstxt.org/

    In addition to this, MD files are used to create raw text versions of pages which allows llm bots to faster crawl and read the content. This is especially important for JS heavy / client side sites.

    Why are they wrong?

    While well-meaning, this recommendation overestimates its real-world effect. As shown in our log analysis, none of the major LLM crawlers (OpenAI’s GPTBot, Anthropic’s ClaudeBot, PerplexityBot, Meta’s crawler, etc.) currently request the llms.txt file. Only traditional SEO crawlers like GoogleBot or BingBot made any contact—and not for training purposes.

    So while it may feel proactive, adding llms.txt today does almost nothing.

    Continue the conversation: