AI-Era Web Crawling Verification: Which Websites Can LLMs Actually Read? — Comprehensive Access Test Results Using a Hybrid-Architecture Site

Introduction

As large language models (LLMs) such as ChatGPT, Claude, and Gemini become widespread, many website operators are asking: “How much of my website can AI actually read?”

I had this same question and decided to conduct an actual verification. Being an AI training target carries both benefits — broader information dissemination and increased recognition for those publishing specialized information — and concerns about unintended usage or inaccurate information generation.

However, accurate information about actual LLM access behavior is limited, with much speculation and conjecture circulating. So I used my own website to empirically verify whether LLMs can actually access it.

Test Site Architecture

The verification target was my own website.

Domain: Security, privacy, and AI governance
Technical architecture: Hybrid (multi-domain site: static hosting + CMS combination)

Architecture overview:

Root domain (static hosting service):

/llms.txt (AI-oriented sitemap)
/robots.txt (crawler control)
/.well-known/security.txt (RFC 9116 compliant)
Redirect configuration

www subdomain (no-code CMS):

/ (top page)
/XXXX/ (content page 1)
/YYYY/ (content page 2)
/ZZZZ/ (content page 3)

Test Method and Results

Test Overview

LLMs tested: ChatGPT, Claude (Anthropic), Gemini

Note: This tested not the LLMs’ direct web crawling capabilities, but their ability to retrieve information through the web access features they provide (web search plugins, etc.).

Test method: Information retrieval through each LLM’s web access features (web search plugins, etc.)
Test date: July 3, 2025 (setup performed on July 2)

Successful Access (Excerpts)

The following content was successfully accessed (though results varied by LLM — e.g., some LLMs could not view CMS content):

File	Hosting	Result	Content Retrieved
llms.txt	Static	Success	LLM content usage policy
robots.txt	Static	Success	Crawler permission settings
.well-known/security.txt	Static	Success	Security contact information
Top page	No-code CMS	Success	Full text retrieved
Content pages (multiple)	No-code CMS	Success	Detailed information

Access Failures

The following types of access failures were confirmed:

Inaccessible public pages with specific URL patterns: Some LLMs reported certain public pages with specific URL patterns as “inaccessible.” These pages actually required no authentication and had no crawl-blocking settings in robots.txt or llms.txt. This was likely due to temporary web retrieval failures on the LLM side, or a Pattern-Matching Bias where the AI misidentified pages as “private” based on certain strings in the URL. Since access intermittently succeeded and failed, the exact cause remains unclear.

Temporary errors: Errors likely attributable to DNS propagation delays immediately after setup, delayed server certificate provisioning, and unexplained 500-series errors on the CMS side also occurred.

Test Results

This verification revealed the following about LLM website access:

LLMs Can Read No-Code CMS Content (With Occasional Errors)

Verified result: LLMs can access content built on major no-code CMS platforms. As long as the final HTML output is properly structured, LLMs can read the information without issues. However, multiple instances of errors occurring even after successful reads were experienced.

AI Can Read Dynamically Generated Sites

Verified result: LLMs successfully retrieved CMS content that relies on JavaScript for dynamic generation. Modern LLM crawlers have evolved to broadly support many dynamic sites. Again, errors after successful reads occurred multiple times.

Stumbling Points

Several technical challenges were encountered during verification:

1. Understanding Domain Architecture

Challenge: Difficulty determining how to structure the relationship between subdomains and root domains. Since system files couldn’t be placed directly in the CMS’s design section, domain separation was necessary.

Solution: Proper redirect configuration on the root domain side enabled integrated management and operation of both domains.

2. Placing RFC-Compliant Files

Challenge: When attempting to place RFC-compliant special files like the .well-known directory, some hosting services treated files and directories beginning with ”.” as private (some SaaS platforms don’t display them).

Solution: Migrating to a hosting service that properly supports .well-known directory placement achieved standards compliance.

3. Automatic SSL Certificate Issuance

Challenge: DNS propagation delays prevented automatic SSL certificate issuance from reflecting immediately. Repeated reloading and clicking didn’t change the situation.

Solution: Waiting for time to pass (up to approximately 24 hours) allowed DNS propagation and automatic SSL certificate issuance to complete, resolving the issue.

Note: Since errors occurred even after LLMs could read the site, troubleshooting was sometimes difficult.

Controlling LLM Training Access

As website operators, there are cases where you want to show content to LLMs and cases where you don’t. Here are countermeasures for each scenario. (Since confirming whether content actually became a training target takes time, this article covers theory only.)

When You Want LLMs to See Your Content

Explicit permission in robots.txt: Allow crawlers access to the entire site or specific paths. This is the fundamental instruction for LLMs to discover content.
Organized HTML/URL structure: Use semantic HTML and logical URL structures so LLMs can easily understand content structure.
Content quality: Providing high-quality content with strong expertise and trustworthiness is most important.
Preparing for future AI optimization via llms.txt: Use llms.txt files to explicitly indicate your intention to permit LLM training. This complements robots.txt and conveys more granular intent. (Note: This is currently a proposed standard without official support from major LLMs.)

When You Don’t Want LLMs to See Your Content

Explicit bot rejection in robots.txt: Deny specific LLM crawlers or all crawlers access to the entire site or specific paths.
Authentication restrictions: Pages requiring login (member-only content) won’t be crawled by LLMs.
JavaScript obfuscation (unverified): Heavily obfuscated content using JavaScript may be difficult to crawl. (However, this is unverified, and modern crawlers have strong JavaScript parsing capabilities, making this an unreliable method.)

Configuration Examples

Promoting training access — robots.txt example:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Allow: /

Plus detailed guidance in llms.txt (e.g., training permission statements, credit display requests, citation rules).

Preventing training access — robots.txt example:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /

Plus noindex meta tag on each page:

<meta name="robots" content="noindex">

(Prevents search engines and LLMs from indexing the page.)

Key Findings

This verification yielded the following important insights about the relationship between LLMs and websites:

Modern LLMs have evolved: They flexibly accommodate diverse hosting formats and CMS technologies, with the ability to crawl many websites.
Importance of robots.txt: Confirmed once again that robots.txt is the most reliable and standard means of controlling LLM access.
Effectiveness of hybrid architecture: Even hybrid configurations combining static hosting and CMS can be accessed by LLMs without technical issues.
Declaring intent to LLMs: Using llms.txt to clearly communicate your intention to permit or deny training is worth considering, though its practical effectiveness requires future verification.
Adapting to the AI era: Proper response to LLMs is connected not only to SEO (Search Engine Optimization) but also to new concepts like GEO (Generative Engine Optimization).

Conclusion

This empirical verification made the following points clear:

Modern LLMs can access a wider variety of website technologies than imagined (though access capabilities and results may vary by LLM).
Intentional configuration by site operators is more important than technical constraints for controlling access.
With proper design, it is possible to strategically control the relationship with AI.
Even when browser access works without issues, LLM access may encounter 500 errors after confirming 200 access success — keep this in mind during verification.

As an AI-era web strategy, I strongly feel that balancing “information design assuming you will be learned from” with “a controllable publication policy” is becoming increasingly important.

About the Test Data

This article is written based on actual access test results.

Disclaimer

This article reflects verification results using specific services at a specific point in time, and the author assumes no responsibility for the accuracy, completeness, or applicability of its content to any particular situation.

The writing process for this article includes the use of AI. The ultimate validity and applicability of the content should be evaluated at the reader’s own judgment and responsibility.