As large language models (LLMs) such as ChatGPT, Claude and other generative AI systems reshape how people discover and consume information, a recurring question for digital marketers and content strategists is whether these models work directly with Markdown files (.md).
Markdown is widely used by developers and documentation teams as a lightweight, human‑readable authoring format. But does it play a role in how LLMs crawl and consume web content?
Recently, I carried out a targeted log‑file analysis to better understand how (or if) .md files are surfaced to LLM crawlers.
Summary of Findings:
- LLMs ignore Markdown files — log analysis showed no evidence that GPTBot, ClaudeBot, or similar crawlers request or prioritise
.mdcontent, even when listed inllms.txt. - HTML remains the standard — structured, cached HTML is consistently the most reliable and supported format for both search engines and LLMs.
- No ROI for
.mddelivery — maintaining Markdown alongside HTML adds overhead without proven gains in visibility, brand mentions, or indexing. - Use Markdown internally only — it remains valuable for documentation and workflows, but should not be treated as a delivery layer for AI optimisation.
- Optimise for what works today — focus efforts on clean, semantic HTML, caching strategies, and accessibility rather than speculative standards like
llms.txt.
Why are we discussing markdown (.md) in the first place?
Because within the documentation of the llms.txt files, it mentions that you should use markdown
…We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with
.mdappended. (URLs without file names should appendindex.html.mdinstead.)…
– LLMs are not prioritizing .md files — log analysis showed no requests from GPTBot, ClaudeBot, or other AI crawlers, even when .md files were listed in llms.txt.
– Authoritative domains are no exception — two sites with DA > 90 and millions of daily visits had zero LLM traffic over a 24-hour window.
– HTML remains the standard — well-structured, cached HTML is still the most reliable format for both search engines and LLMs.
– No clear marketing ROI for .md — maintaining .md alongside HTML adds overhead, with no proven visibility or brand-mention benefits.
Recommendation: Focus optimization on HTML outputs; treat .md as an internal content format, not as a delivery layer for LLMs.
Scope of the Analysis
To keep the study practical but still representative, the parameters were defined as follows:
- Duration: 24 hours of raw CDN log data
- Sample: Two high‑traffic websites with millions of daily visits
- Domain authority: Both sites carried DA > 90
- Configuration: Each site exposed
llms.txtfiles listing selected .md resources for potential crawling - New content: At least one fresh page on each site was published within the analysis window
One limitation is the relatively short duration. A full week’s logs would have meant processing well over 150 GB of data, which was not technically feasible for this initial phase.
Analysis
- No identifiable LLM crawler traffic: Across both sites, no access requests were logged from recognised LLM bots such as GPTBot, ClaudeBot, or PerplexityBot.
- No .md file retrieval: Although
.mdfiles had been explicitly referenced within thellms.txtdirective, there was no evidence of these being fetched by any bot claiming to represent an LLM provider. - What about other bots? I found random scrapping bots that were hitting the files, but no major / significant traditional search engine were found.
- What are you trying to accomplish with having an .md file? The biggest question really is, why would you use .md files in the first place? what would be your end goal:
- Increased context for LLMs: Some GEO specialsits suggest adding more textual information in the .md files that would otherwise not be relevant to the end user. In my opinion this is not wise as it can easily be considered a form of manipulation.
Why would you even consider using an .md file?
The question really comes down to: What is your goal, why do you want to use .md files in the first place? What are you trying to fix / resolve for this?
1. Improve crawling rates:
Imagine you have a website with millions of new UGC (User Generated Content) assets being published each day. Search engines often have issues catching up with the indexation. using .MD files could be a solution to this. However, I don’t believe that this would be a wise way of proceeding: Why?
- You are actually generating an additional asset for LLMs to crawl and not just the original URL
- .MD do not have a html head, were you could create a relationship such a with rel-canonical to the main document.
- This would therefore need to be done in the HTTP PUT request
- There are other formats suchas plain .HTML that could be used just for the bot i.e.
Process:
- user agent that includes “bot” makes a request for a given page
- The system delivers a pre-rendered page:
What is a pre-rendered page? It is a page that the JS has fully been rendered in the browser and that version is cached. Therefore ZERO JS is needed to visualise the content on the page. The main limitation is that certain JS interactions would not work but the content is visible. - This file has exactly the same URL so that it is not an additional asset for the bot to render
- Cache the .html version of the website using CDN to improve speed
| Steps | User | Bot |
| user-agent detection | User gets the standard version of the page with JS | Gets a clean optimized cached version with no JS but all client side JS already rendered. The Cache file would then be on a fast CDN and syndicated |
I’ve done this exact implementation for this type of website and we were able to increase the indexing rate from 30% to 93% in a 18 month window. Why? Because it is estimated that for every 1 JS rendered page you could crawl 100 none JS with the same computing resources. Therefore, it is logical to have a version like this
But is this not cloaking? No, we deliver EXACTLY the same textual version that the browser would see once JS is rendered.
Definition: Cloaking is when you intentionally try to manipulate search engines by hiding / adding content to a version that is only for them.
2. Heavy JS rendered content
Imagine you have a website that hosts PDF files, for the end user you visualise the PDF in a JS heavy viewer. Search engines are unable to see the content of that PDF and you can’t expose the PDF as user would steel the files. How do you expose it to search engines / LLMs?
In this context using an .md file would make logical sense but my main concern is how do LLMs understand the relationship between the two and when they reference in the prompt. would they ever reference a .md file? I think not.
So again, the cleaned out .html file would win here.
- Only one url for the same asset which helps crawling budgets and indexation
Questions I Am Frequently Asked
A: No. A better approach is to ensure you serve a pre‑rendered HTML version and cache it properly. This way, the content remains crawlable to both LLM bots and traditional search crawlers.
Not really. Markdown is useful internally—for documentation or content maintenance—but optimisation for LLMs should focus on clean, structured HTML output, not Markdown.
.md files improve visibility in large language models (LLMs) compared to standard HTML pages? No. Current evidence shows LLMs do not prioritize .md files over HTML. Well-structured HTML remains the most reliable format for visibility.
.md files help LLMs or search engines access my content more easily? No. A better solution is to use an HTML renderer and cache the results. This ensures both search engines and LLMs can properly access your content.
.md files increased brand mentions or visibility in generative AI search results? Not yet. While some proposals suggest benefits, no third-party studies or log data confirm that .md exposure improves brand visibility in LLM outputs.
.md versions of my pages alongside existing HTML—and is the ROI justified? Maintaining parallel .md and HTML versions increases workload and risk of outdated content. At this stage, the ROI is unproven.
.md files in llms.txt to signal them to AI crawlers, or is it better to optimize the HTML output we already have? Optimize HTML. Listing .md files in llms.txt is experimental and currently unused by major LLM bots. HTML optimization offers a clearer path to results. More info here
Third‑Party Context
Independent sources in both search and AI research echo these observations:
- Yoast SEO notes that although
llms.txtwas proposed as a kind of robots.txt for AI crawlers, no major LLM provider currently supports it. GPTBot, Claude, or Google’s AI products do not read Markdown orllms.txtas part of their active crawling routines (Yoast, 2024).- This is also validaded by my LLMs.txt analysis used for GEO
- Daydream’s analysis warns that managing
.md‑based feeds can introduce risks of data divergence—where Markdown goes out of sync with published HTML. This could actually harm brand accuracy if models ingest outdated content (Daydream Library, 2024). - Academic work (HtmlRAG study, arXiv, 2024) tested retrieval‑augmented generation (RAG) pipelines and found that HTML retained semantic structure—headings, metadata, table layouts—that plain text or Markdown often strips away. These structural signals improved contextual knowledge retention and retrieval performance, supporting the argument that HTML delivers more value to LLM ingestion workflows.
Collectively, these insights align with the practical results of the log‑file study.
Recommendations based on my research, experience and observation
- Do not publish directly in Markdown for LLM visibility. Keep Markdown for internal versioning and workflows.
- Focus on HTML as the public output layer. Ensure semantic tags are used and the pages are properly cached.
- Do not rely on
llms.txttoday. It is an experimental idea with very limited adoption. - Prioritise accessibility and clarity of HTML outputs over trying to second‑guess speculative AI standards.
Conclusion
This analysis, while narrow in scope, makes one point clear: LLMs are not actively crawling or requesting Markdown files, even when explicitly listed in llms.txt. Instead, industry evidence shows that AI ingestion pipelines focus on HTML‑rendered content, which provides richer context and stronger retrieval signals.
For now, organisations should maintain emphasis on accessible, semantically‑structured HTML, coupled with robust caching strategies. Markdown remains valuable as an internal content authoring format, but it is not a shortcut to visibility within LLM ecosystems.