Naly Engineering Notes: Codex CLI, Structured JSON, and Cron-Safe AI Workers

Abstract

Naly uses Codex CLI as the control plane for scheduled AI work: each cron entry runs a checked-in script that launches a bounded assistant task with web search and a strict output contract, then writes validated JSON artifacts for downstream publishing, replay, and retries. This makes model work behave like an operational pipeline—same command, explicit schema, explicit artifacts—rather than a mutable interactive workflow. As of 2026-06-25, this distinction is the reliability edge for growth-critical content infrastructure.

Where it sits in Naly

Naly already has active GST priorities around discovery, retention, and distribution workflows. This pattern fits that execution layer directly:

Cron jobs invoke a single tsx script per use case so every run stays inside versioned code, not ad hoc shell snippets.
Codex CLI performs the reasoning and retrieval work, while Naly handles orchestration control: schedule, retries, locking, and durable artifacts.
Structured output feeds downstream systems built on Next.js, React, Drizzle ORM, and Neon without downstream schema guessing.
External logs and run metadata are written under NALY_LOG_ROOT, making post-mortems reproducible independent of process output buffering.

The thesis is: for production use, LLM quality is only half the system; deterministic boundaries around that LLM are the other half.

Technical mechanism

1) Contract-first worker entrypoint

Each task starts from three immutable inputs:

A codified prompt template and task intent.
A response schema that must be validated.
A run envelope (run_id, source_query, attempt, env), persisted before API call.

Naly invokes Codex CLI in batch mode from cron, not interactive mode. Codex is documented as a local coding agent and shipped as a standalone CLI with open-source distribution and active releases, and can be run in a scripted environment Codex CLI, OpenAI Codex repo.

2) Why structured output is non-negotiable

OpenAI structured-output guides describe parser-supported schema extraction and the strict mode behavior needed for machine pipelines. In Naly, the model output is treated as an intermediate artifact, not final truth, so the JSON contract is where reliability is enforced:

required fields (headline, evidence list, confidence, citations, failure reason)
optional fields with defaults
numeric confidence and bounded enums
explicit parser failures surfaced as run errors, not silently auto-corrected text.

3) Cron-to-agent lifecycle with concurrency control

Cron executes scheduled lines according to standard 5-field timing fields and launches a command when fields match crontab. For production safety Naly adds:

lock guard (single active run per task)
idempotent run key
bounded retry policy with jitter
external log capture for every phase
post-run state update in database tables managed by Drizzle/Neon.

flock is designed exactly for this guardrail pattern: acquire a lock, execute critical section, exit cleanly when already locked flock. Because lock state follows file descriptors, overlapping cron windows are explicitly denied instead of corrupting state.

4) Why MCP matters in this pattern

Model Context Protocol formalizes host/client/server tool contracts using JSON-RPC, capability negotiation, and structured tool calls MCP. In Naly, MCP-style boundaries reduce implicit coupling: web search can be represented as a controlled tool interface with explicit capabilities instead of free-form shell-level behavior.

What the literature says

Recent research shows reliability is not equivalent to raw capability. The AI Agent Reliability paper reports substantial gaps between task accuracy and consistency across runs, and proposes explicit reliability dimensions (consistency, robustness, predictability, safety) for operational evaluation Towards a Science of AI Agent Reliability. This supports Naly’s run-state-first design: if a run succeeds but cannot be repeated with clear artifacts, it is not production-grade.

For structured outputs, ToolPRM argues that structured tool-calling behavior needs explicit supervision and that improvements are especially strong when modeling the internal function-calling process rather than only final outcomes ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling. That aligns with Naly’s schema-first runner loop: the quality gate is at interface boundaries, not just content fluency.

A third paper on the same frontier, SLOT, shows a practical alternative path by adding a model-agnostic output-shaping layer on top of LLMs SLOT: Structuring the Output of Large Language Models. It reinforces the same principle: structure reliability is an engineering problem even when base model quality is strong.

Design trade-offs

Strict schema vs. model portability: strict schemas reduce downstream ambiguity but can increase occasional parse churn when providers interpret constraints differently.
Cron simplicity vs. queue elasticity: cron is simple and visible, but bursty workloads need lock-aware backoff or a queue layer to avoid overlapping runs and missed windows.
In-task web search vs. deterministic replay: fresh web search improves freshness but introduces source volatility; therefore, Naly stores query text, source list, and raw references in run artifacts.
External logs vs. DB-only logs: filesystem logs survive process restarts and are cheap for ingestion; DB logs simplify query ergonomics but can fail open loops if not carefully partitioned.

Failure modes

Schema drift: provider output violates schema; mitigation is strict validation fail-fast, schema version pinning, and dead-letter runs.
Overlapping cron executions: double writes or duplicate external actions; mitigation is advisory lock + process-exit guard.
Search instability: upstream tool response changes across attempts; mitigation is retry caps, exponential delay, and persisted upstream references.
Timing surprises: DST and host timezone differences can affect schedule semantics; mitigation is UTC scheduling policy and explicit environment checks.
Silent partial writes: parse succeeds but downstream write fails; mitigation is transactional persistence ordering and idempotent upserts.
Security/context leakage: any tool-capable agent path can overreach, so MCP-like least-privilege boundaries and explicit consent/auth assumptions are necessary, especially where tool execution paths touch network resources MCP security notes.

Implementation notes

Keep one worker command per business task in scripts/ and let cron call only that entrypoint.
Store the schema file hash in run metadata so parser expectations are auditable after a model upgrade.
Set set -euo pipefail-style behavior in wrapper scripts and always include run IDs in log names.
Write three checkpoints to logs: started, codex_result_parsed, artifact_persisted.
Use NALY_LOG_ROOT so runtime traces never pollute repository state and survive restart contexts.
Persist attempt, exit_code, retry_reason, and validation_errors to allow GST and audit dashboards to separate flaky infra from genuine model regressions.