Semantic Discovery: JSON-LD for RAG Citation Accuracy

Root Configuration: llms.txt and robots.txt

Deploy the following configuration files to your root directory to establish the primary handshake with AI crawlers and agentic systems. As of June 2026, the distinction between training crawlers and real-time search crawlers is the critical factor for citation visibility.

/llms.txt

Place this file at the root to provide a machine-readable index for Business-to-Agent (B2A) interactions. While major search engines do not use this for ranking, agentic browsers and IDE-based agents use it for context window optimization.

# Olwen
> GEO and AI search visibility platform for technical founders.

## Documentation
- [Core Features](https://www.olwen.io/features): Technical breakdown of AI tracking and CMS automation.
- [API Reference](https://www.olwen.io/api): Integration specs for CDN and repo connections.
- [GEO Implementation](https://www.olwen.io/guides/geo): Step-by-step guide for RAG optimization.

## Tools
- [FAQ Generator](https://www.olwen.io/tools/faq): Automated schema generation for product pages.

/robots.txt

Configure your robots.txt to prioritize search-specific bots that drive real-time citations. Use the following block to allow retrieval for search while managing training bandwidth.

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GPTBot
Disallow: /private/

User-agent: ClaudeBot
Disallow: /private/

User-agent: Google-Extended
Allow: /

Separating OAI-SearchBot from GPTBot ensures that your content is available for real-time retrieval in ChatGPT Search without necessarily feeding the broader training pipeline if that is your preference. PerplexityBot remains highly aggressive in June 2026: ensure your server can handle the request volume or implement rate limiting at the CDN level.

Printed server log showing AI crawler user agents.

Schema Configuration: Organization and Product

Embed the following JSON-LD block in the head of your root template. This structure provides the foundational entities that Retrieval-Augmented Generation (RAG) systems use to verify brand claims and resolve entity ambiguity.

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "@id": "https://www.olwen.io/#organization",
  "name": "Olwen",
  "url": "https://www.olwen.io",
  "logo": "https://www.olwen.io/logo.png",
  "description": "GEO and AI search visibility platform for founders.",
  "knowsAbout": [
    "Generative Engine Optimization",
    "Search Engine Optimization",
    "Retrieval-Augmented Generation",
    "Structured Data"
  ],
  "sameAs": [
    "https://x.com/olwen_io",
    "https://linkedin.com/company/olwen"
  ]
}

Field Mapping for Citations

Precision in field mapping reduces the computational overhead for LLMs during the retrieval phase. When an agent fetches a page, it looks for specific keys to populate its internal knowledge graph.

knowsAbout and Authority

The knowsAbout property is the primary signal for topical authority. List specific technical competencies rather than broad marketing terms. For Olwen, this includes "Generative Engine Optimization" and "RAG optimization." This property helps AI models like Claude 3.5 or GPT-4o (as of June 2026) categorize your brand as a primary source for these topics during the retrieval step.

featureList for Comparison Tables

Within the Product schema, use the featureList property to provide a comma-separated list of capabilities. AI agents frequently generate comparison tables for users. If your features are buried in prose, the agent may hallucinate or omit them. Structured features ensure your product is accurately represented in competitive sets.

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Olwen GEO Suite",
  "description": "Automated AI search visibility and tracking.",
  "brand": {
    "@type": "Brand",
    "name": "Olwen"
  },
  "featureList": [
    "AI crawler visit tracking",
    "Automated FAQ generation",
    "Competitor visibility monitoring",
    "CMS-repo synchronization"
  ]
}

Brand Consistency

Deploy the Brand property within all product and service schemas. This ensures that different AI models, which may have been trained on varying datasets, resolve your product to the correct parent entity. Use the @id field to link the Product to the Organization defined in your root template. This creates a connected graph that RAG systems can traverse to verify the legitimacy of a claim.

Developer typing JSON-LD code on a mechanical keyboard.

Agentic Readiness: WebMCP Implementation

As of June 2026, the Web Model Context Protocol (WebMCP) is the emerging standard for allowing AI agents to interact with web applications. Unlike traditional scraping, WebMCP allows you to expose specific functions as tools.

Registering Tools

Use the navigator.modelContext API to register tools that an agent can call. This is particularly useful for search features or calculators that are otherwise difficult for an LLM to parse from the DOM.

if ('modelContext' in navigator) {
  navigator.modelContext.registerTool({
    name: "check_visibility",
    description: "Check the AI search visibility score for a specific domain.",
    parameters: {
      type: "object",
      properties: {
        domain: { type: "string", description: "The domain to analyze." }
      },
      required: ["domain"]
    },
    execute: async ({ domain }) => {
      const response = await fetch(`/api/visibility?domain=${domain}`);
      return await response.json();
    }
  });
}

By exposing this tool, you allow an agentic browser to perform a real-time lookup using your backend logic rather than relying on its own potentially stale training data. This increases the likelihood of your brand being cited as the source of the data.

Automated Deployment via Olwen

Maintaining schema accuracy across thousands of pages is a significant technical burden. Olwen automates this by connecting directly to your repository and CMS.

Repo and CMS Integration

Olwen monitors your code changes and content updates to ensure that JSON-LD blocks remain synchronized with the visible content. When a new product feature is added to your documentation, Olwen updates the featureList in the schema and the llms.txt file automatically. This prevents the "schema drift" that often leads to AI hallucinations or citation failures.

Tracking AI Crawler Visits

Olwen integrates with your CDN (e.g., Cloudflare, Vercel, or Akamai) to monitor requests from AI user agents. By analyzing the logs for OAI-SearchBot, Claude-SearchBot, and PerplexityBot, Olwen provides a real-time dashboard of which pages are being indexed for RAG.

Bot Name	Purpose	Frequency	Impact on Citation
OAI-SearchBot	ChatGPT Search	High	Primary source for OpenAI citations
Claude-SearchBot	Claude Search	Medium	Primary source for Anthropic citations
PerplexityBot	Perplexity Index	Very High	Drives real-time answer engine results
Googlebot	Google AI Overviews	High	Foundation for Gemini-based answers

Generating Website Fixes and FAQs

Based on the crawler data, Olwen identifies content gaps where AI agents are failing to find structured answers. It then generates recommended FAQ sections and technical fixes. These are pushed directly to your CMS or as a Pull Request to your repository, ensuring that your site is always optimized for the latest retrieval patterns.

Technical manual for WebMCP protocol on a desk.

Validation and Monitoring

After deploying JSON-LD and WebMCP tools, validate the implementation using the following workflow:

Schema Validation: Use the Schema Markup Validator to ensure there are no syntax errors in your JSON-LD. RAG systems often discard malformed blocks entirely.
Agentic Testing: Use a WebMCP-compatible browser (such as Chrome 149+ in June 2026) to verify that registered tools are discoverable by the agent.
Log Analysis: Monitor your CDN logs for 404 errors on /llms.txt. If bots are requesting this file and it is missing, you are losing an opportunity to provide a low-token summary of your site.
Citation Tracking: Use Olwen to monitor how often your brand is cited in major AI systems. Compare this against your competitor visibility to identify which schema properties require further refinement.

Efficiency in data delivery is the primary driver of citation accuracy. By reducing the noise in the context window through structured JSON-LD and root-level markdown, you ensure that AI agents can retrieve and verify your brand's claims with minimal latency. Map the legalName and sameAs properties to establish a verifiable identity that persists across different model versions and retrieval architectures. End of technical instruction.