Introduction
As large language models (LLMs) such as ChatGPT, Claude, and Gemini become widespread, many website operators are asking: “How much of my website can AI actually read?”
I had this same question and decided to conduct an actual verification. Being an AI training target carries both benefits — broader information dissemination and increased recognition for those publishing specialized information — and concerns about unintended usage or inaccurate information generation.
However, accurate information about actual LLM access behavior is limited, with much speculation and conjecture circulating. So I used my own website to empirically verify whether LLMs can actually access it.
Test Site Architecture
The verification target was my own website.
- Domain: Security, privacy, and AI governance
- Technical architecture: Hybrid (multi-domain site: static hosting + CMS combination)
Architecture overview:
Root domain (static hosting service):
/llms.txt(AI-oriented sitemap)/robots.txt(crawler control)/.well-known/security.txt(RFC 9116 compliant)- Redirect configuration
www subdomain (no-code CMS):
/(top page)/XXXX/(content page 1)/YYYY/(content page 2)/ZZZZ/(content page 3)
Test Method and Results
Test Overview
LLMs tested: ChatGPT, Claude (Anthropic), Gemini
Note: This tested not the LLMs’ direct web crawling capabilities, but their ability to retrieve information through the web access features they provide (web search plugins, etc.).
- Test method: Information retrieval through each LLM’s web access features (web search plugins, etc.)
- Test date: July 3, 2025 (setup performed on July 2)
Successful Access (Excerpts)
The following content was successfully accessed (though results varied by LLM — e.g., some LLMs could not view CMS content):
| File | Hosting | Result | Content Retrieved |
|---|---|---|---|
| llms.txt | Static | Success | LLM content usage policy |
| robots.txt | Static | Success | Crawler permission settings |
| .well-known/security.txt | Static | Success | Security contact information |
| Top page | No-code CMS | Success | Full text retrieved |
| Content pages (multiple) | No-code CMS | Success | Detailed information |
Access Failures
The following types of access failures were confirmed:
Inaccessible public pages with specific URL patterns: Some LLMs reported certain public pages with specific URL patterns as “inaccessible.” These pages actually required no authentication and had no crawl-blocking settings in robots.txt or llms.txt. This was likely due to temporary web retrieval failures on the LLM side, or a Pattern-Matching Bias where the AI misidentified pages as “private” based on certain strings in the URL. Since access intermittently succeeded and failed, the exact cause remains unclear.
Temporary errors: Errors likely attributable to DNS propagation delays immediately after setup, delayed server certificate provisioning, and unexplained 500-series errors on the CMS side also occurred.
Test Results
This verification revealed the following about LLM website access:
LLMs Can Read No-Code CMS Content (With Occasional Errors)
Verified result: LLMs can access content built on major no-code CMS platforms. As long as the final HTML output is properly structured, LLMs can read the information without issues. However, multiple instances of errors occurring even after successful reads were experienced.
AI Can Read Dynamically Generated Sites
Verified result: LLMs successfully retrieved CMS content that relies on JavaScript for dynamic generation. Modern LLM crawlers have evolved to broadly support many dynamic sites. Again, errors after successful reads occurred multiple times.
Stumbling Points
Several technical challenges were encountered during verification:
1. Understanding Domain Architecture
Challenge: Difficulty determining how to structure the relationship between subdomains and root domains. Since system files couldn’t be placed directly in the CMS’s design section, domain separation was necessary.
Solution: Proper redirect configuration on the root domain side enabled integrated management and operation of both domains.
2. Placing RFC-Compliant Files
Challenge: When attempting to place RFC-compliant special files like the .well-known directory, some hosting services treated files and directories beginning with ”.” as private (some SaaS platforms don’t display them).
Solution: Migrating to a hosting service that properly supports .well-known directory placement achieved standards compliance.
3. Automatic SSL Certificate Issuance
Challenge: DNS propagation delays prevented automatic SSL certificate issuance from reflecting immediately. Repeated reloading and clicking didn’t change the situation.
Solution: Waiting for time to pass (up to approximately 24 hours) allowed DNS propagation and automatic SSL certificate issuance to complete, resolving the issue.
Note: Since errors occurred even after LLMs could read the site, troubleshooting was sometimes difficult.
Controlling LLM Training Access
As website operators, there are cases where you want to show content to LLMs and cases where you don’t. Here are countermeasures for each scenario. (Since confirming whether content actually became a training target takes time, this article covers theory only.)
When You Want LLMs to See Your Content
- Explicit permission in robots.txt: Allow crawlers access to the entire site or specific paths. This is the fundamental instruction for LLMs to discover content.
- Organized HTML/URL structure: Use semantic HTML and logical URL structures so LLMs can easily understand content structure.
- Content quality: Providing high-quality content with strong expertise and trustworthiness is most important.
- Preparing for future AI optimization via llms.txt: Use llms.txt files to explicitly indicate your intention to permit LLM training. This complements robots.txt and conveys more granular intent. (Note: This is currently a proposed standard without official support from major LLMs.)
When You Don’t Want LLMs to See Your Content
- Explicit bot rejection in robots.txt: Deny specific LLM crawlers or all crawlers access to the entire site or specific paths.
- Authentication restrictions: Pages requiring login (member-only content) won’t be crawled by LLMs.
- JavaScript obfuscation (unverified): Heavily obfuscated content using JavaScript may be difficult to crawl. (However, this is unverified, and modern crawlers have strong JavaScript parsing capabilities, making this an unreliable method.)
Configuration Examples
Promoting training access — robots.txt example:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: *
Allow: /
Plus detailed guidance in llms.txt (e.g., training permission statements, credit display requests, citation rules).
Preventing training access — robots.txt example:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Disallow: /
Plus noindex meta tag on each page:
<meta name="robots" content="noindex">
(Prevents search engines and LLMs from indexing the page.)
Key Findings
This verification yielded the following important insights about the relationship between LLMs and websites:
- Modern LLMs have evolved: They flexibly accommodate diverse hosting formats and CMS technologies, with the ability to crawl many websites.
- Importance of robots.txt: Confirmed once again that robots.txt is the most reliable and standard means of controlling LLM access.
- Effectiveness of hybrid architecture: Even hybrid configurations combining static hosting and CMS can be accessed by LLMs without technical issues.
- Declaring intent to LLMs: Using llms.txt to clearly communicate your intention to permit or deny training is worth considering, though its practical effectiveness requires future verification.
- Adapting to the AI era: Proper response to LLMs is connected not only to SEO (Search Engine Optimization) but also to new concepts like GEO (Generative Engine Optimization).
Conclusion
This empirical verification made the following points clear:
- Modern LLMs can access a wider variety of website technologies than imagined (though access capabilities and results may vary by LLM).
- Intentional configuration by site operators is more important than technical constraints for controlling access.
- With proper design, it is possible to strategically control the relationship with AI.
- Even when browser access works without issues, LLM access may encounter 500 errors after confirming 200 access success — keep this in mind during verification.
As an AI-era web strategy, I strongly feel that balancing “information design assuming you will be learned from” with “a controllable publication policy” is becoming increasingly important.
About the Test Data
This article is written based on actual access test results.
Disclaimer
This article reflects verification results using specific services at a specific point in time, and the author assumes no responsibility for the accuracy, completeness, or applicability of its content to any particular situation.
The writing process for this article includes the use of AI. The ultimate validity and applicability of the content should be evaluated at the reader’s own judgment and responsibility.