Naly Engineering Notes: JSON-LD, Sitemaps, and AI Citation Readiness for Prediction Articles

Abstract

In Naly's article platform, JSON-LD, sitemaps, and explicit lead/metadata plumbing turn each published prediction note into a machine-readable artifact without replacing editorial quality. The thesis is that discovery quality now depends on two parallel contracts: one for users who read pages, and one for crawlers and agents that need canonical sources, structured facts, and stable update signals. Naly's goal is to make each article indexable, cite-ready, and time-accurate on first publish (as of June 23, 2026).

Where it sits in Naly

Naly's technology stack is already positioned for this: next@16.0.7 on React 19.2.1 for server-first rendering, drizzle-orm with @neondatabase/serverless for relational article data, and @vercel/blob for stable media URLs. The GEO objective is not a separate SEO subsystem; it is part of the publish pipeline that serves both humans and machines from the same canonical article model.

The current design anchor is the article publish boundary: a post record must generate identical signals across page markup, metadata blocks, sitemap exports, and article summaries. If any channel diverges, the same article can be interpreted differently by Googlebot, AI assistants, and internal analytics, creating inconsistent behavior.

Within Naly, this means these data paths are coupled:

Article body and source graph from drizzle-backed records
Page rendering and metadata via Next server components
Discovery control via sitemap.xml, news-sitemap.xml, and image metadata
Citation readiness via answer-first leads and explicit source URL arrays

Technical mechanism

Naly should implement a publication contract with five deterministic outputs per article.

Canonical article model Each article should expose stable fields: canonical URL, headline, standfirst/lead, publish date, modified date, author objects, section/topic tags, main image URLs, source URLs, and language. This is the root of both Google and AI-facing interpretation. For prediction content, source URLs are especially important because they let external systems separate opinion from verifiable input.
Server-side metadata generation Use generateMetadata in app page.tsx/layout.tsx with server-only logic so crawler-visible tags are in initial HTML when possible. Next.js documents support this server-side model and note that metadata fetches can be memoized across generation paths, reducing duplicated DB/API work. For high-volume pages this keeps publish-time latency predictable.
JSON-LD injection Render a strict NewsArticle block in app pages as a <script type="application/ld+json"> object with stable IDs and required fields (headline, datePublished, dateModified, author, image, mainEntityOfPage, isPartOf where relevant). Next's metadata guidance explicitly prefers JSON-LD for structured representation and documents a script-based pattern for structured entity data in components.
Discovery maps Generate one general sitemap and one news-focused sitemap. Google docs frame both as crawl discovery tools, with a separate news sitemap allowed for cleaner tracking in Search Console. A sitemap entry should include loc, lastmod, and when needed, image and news extensions at URL level to aid specialized indexing. A dedicated output for image-heavy coverage is useful for discovery consistency.
Answer-first lead optimization For AI and search surfaces, treat the lead paragraph as both user utility and machine utility. Use the same short lead as the Open Graph description and as the short-form answer surface while keeping the full body canonical to the article URL. This creates a coherent signal path: the first returned sentence aligns humans, bots, and attribution extractors.

A compact publication workflow is:

Persist article and source graph in DB.
Build metadata + lead + schema payload from one normalized selector.
Emit page HTML, JSON-LD, and sitemap rows in one publish transaction family.
Revalidate or invalidate caches on post updates.

What the literature says

Google documents position structured data as a way for crawlers to understand page facts at scale, while also warning that eligibility is conditional and not guaranteed. Official guidance repeatedly emphasizes JSON-LD as the recommended format and validates that only compliant, representative, and non-misleading markup can appear in rich results.

Google also clarifies that sitemaps are discovery aids, not guarantees. Even correctly formatted sitemaps help large or newly launched sites expose content and can carry content-specific hints (images/news), but indexing still depends on crawler follow-through and visibility quality.

On schema semantics, schema.org defines NewsArticle as a dedicated subtype for reporting and background news content, making it the natural match for Naly-style prediction and market-analysis posts when they report concrete updates.

From the platform side, Next.js guidance is aligned: metadata is best treated as render-time server responsibility, and JSON-LD is a supported, explicit method for structured description. The same ecosystem also exposes sitemap route conventions and generation APIs suitable for large URL sets.

In the RAG literature, one study on structured linked data for agentic retrieval found that Schema.org/linked representations can improve retrieval quality, especially when combined with richer navigable affordances beyond plain text. Another recent RAG-context study reports that formatting and context consistency materially changes grounding behavior. Together, these papers support Naly's thesis that article metadata quality is not cosmetic optimization; it materially changes downstream consumption.

Design trade-offs

Freshness versus cache stability: server-side metadata must refresh quickly on edits, while cached route artifacts should not flap on every request.
Minimal viable markup versus completeness: adding required fields improves compliance, but over-modeling risks stale or incorrect links if source data is delayed.
Crawl guidance versus trust signals: a broader sitemap set improves coverage, but too many low-value URLs can dilute quality in downstream indexing.
Human readability versus machine clarity: lead-first UX remains primary, but the same text must remain faithful when parsed by downstream systems.
Simplicity versus future proofing: start with strict required fields and stable typing now, then evolve toward richer entity graphs if evidence justifies complexity.

Failure modes

Structural invalidation: malformed JSON-LD or missing required fields triggers rich-result ineligibility and can reduce confidence in AI parsing.
Semantic drift: if the visible lead/article body and structured description diverge, systems may treat Naly content as low-reliability or misleading.
Timestamp mismatch: dateModified lag can create stale recency behavior for prediction articles where timing is business-critical.
Sitemap entropy: stale lastmod values, oversized sitemaps, or blocked robots paths can hide fresh content from crawlers.
Over-optimized but unverifiable claims: structured fields that include unverifiable assertions can be penalized by quality checks even if markup is syntactically valid.
Version lock mismatch: mixed rendering paths (cached route handler + dynamic edits) can create split-brain metadata and inconsistent URL snapshots.

Implementation notes

For Naly, the practical rollout should be phased and deterministic:

Add a required metadata schema in the article domain model before changing rendering.
Add a single JSON-LD builder function with type-safe input and deterministic ordering.
Normalize lead, source URLs, and image URLs at write time.
Add generateMetadata for dynamic article-level tags and app/sitemap.ts plus app/news-sitemap.ts with explicit change windows.
Emit dedicated image references where images materially influence discovery.
Add CI checks for JSON-LD validity and structured-data guideline conformance.
Add canary dashboards: sitemap freshness, schema parse success, and lead-to-body consistency.

This design is compatible with existing Naly runtime components and keeps implementation local to publish-time code paths, which aligns with the team goal of maximizing trust, retention, and discoverability without replacing existing content workflows.