Abstract
Naly uses Codex CLI as the control plane for scheduled AI work: each cron entry runs a checked-in script that launches a bounded assistant task with web search and a strict output contract, then writes validated JSON artifacts for downstream publishing, replay, and retries. This makes model work behave like an operational pipeline—same command, explicit schema, explicit artifacts—rather than a mutable interactive workflow. As of 2026-06-25, this distinction is the reliability edge for growth-critical content infrastructure.
Where it sits in Naly
Naly already has active GST priorities around discovery, retention, and distribution workflows. This pattern fits that execution layer directly:
- Cron jobs invoke a single
tsxscript per use case so every run stays inside versioned code, not ad hoc shell snippets. - Codex CLI performs the reasoning and retrieval work, while Naly handles orchestration control: schedule, retries, locking, and durable artifacts.
- Structured output feeds downstream systems built on Next.js, React, Drizzle ORM, and Neon without downstream schema guessing.
- External logs and run metadata are written under
NALY_LOG_ROOT, making post-mortems reproducible independent of process output buffering.
The thesis is: for production use, LLM quality is only half the system; deterministic boundaries around that LLM are the other half.
Technical mechanism
1) Contract-first worker entrypoint
Each task starts from three immutable inputs:
- A codified prompt template and task intent.
- A response schema that must be validated.
- A run envelope (
run_id,source_query,attempt,env), persisted before API call.
Naly invokes Codex CLI in batch mode from cron, not interactive mode. Codex is documented as a local coding agent and shipped as a standalone CLI with open-source distribution and active releases, and can be run in a scripted environment Codex CLI, OpenAI Codex repo.
2) Why structured output is non-negotiable
OpenAI structured-output guides describe parser-supported schema extraction and the strict mode behavior needed for machine pipelines. In Naly, the model output is treated as an intermediate artifact, not final truth, so the JSON contract is where reliability is enforced:
- required fields (headline, evidence list, confidence, citations, failure reason)
- optional fields with defaults
- numeric confidence and bounded enums
- explicit parser failures surfaced as run errors, not silently auto-corrected text.
3) Cron-to-agent lifecycle with concurrency control
Cron executes scheduled lines according to standard 5-field timing fields and launches a command when fields match crontab. For production safety Naly adds:
- lock guard (single active run per task)
- idempotent run key
- bounded retry policy with jitter
- external log capture for every phase
- post-run state update in database tables managed by Drizzle/Neon.
flock is designed exactly for this guardrail pattern: acquire a lock, execute critical section, exit cleanly when already locked flock. Because lock state follows file descriptors, overlapping cron windows are explicitly denied instead of corrupting state.
4) Why MCP matters in this pattern
Model Context Protocol formalizes host/client/server tool contracts using JSON-RPC, capability negotiation, and structured tool calls MCP. In Naly, MCP-style boundaries reduce implicit coupling: web search can be represented as a controlled tool interface with explicit capabilities instead of free-form shell-level behavior.
What the literature says
Recent research shows reliability is not equivalent to raw capability. The AI Agent Reliability paper reports substantial gaps between task accuracy and consistency across runs, and proposes explicit reliability dimensions (consistency, robustness, predictability, safety) for operational evaluation Towards a Science of AI Agent Reliability. This supports Naly’s run-state-first design: if a run succeeds but cannot be repeated with clear artifacts, it is not production-grade.
For structured outputs, ToolPRM argues that structured tool-calling behavior needs explicit supervision and that improvements are especially strong when modeling the internal function-calling process rather than only final outcomes ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling. That aligns with Naly’s schema-first runner loop: the quality gate is at interface boundaries, not just content fluency.
A third paper on the same frontier, SLOT, shows a practical alternative path by adding a model-agnostic output-shaping layer on top of LLMs SLOT: Structuring the Output of Large Language Models. It reinforces the same principle: structure reliability is an engineering problem even when base model quality is strong.
Design trade-offs
- Strict schema vs. model portability: strict schemas reduce downstream ambiguity but can increase occasional parse churn when providers interpret constraints differently.
- Cron simplicity vs. queue elasticity: cron is simple and visible, but bursty workloads need lock-aware backoff or a queue layer to avoid overlapping runs and missed windows.
- In-task web search vs. deterministic replay: fresh web search improves freshness but introduces source volatility; therefore, Naly stores query text, source list, and raw references in run artifacts.
- External logs vs. DB-only logs: filesystem logs survive process restarts and are cheap for ingestion; DB logs simplify query ergonomics but can fail open loops if not carefully partitioned.
Failure modes
- Schema drift: provider output violates schema; mitigation is strict validation fail-fast, schema version pinning, and dead-letter runs.
- Overlapping cron executions: double writes or duplicate external actions; mitigation is advisory lock + process-exit guard.
- Search instability: upstream tool response changes across attempts; mitigation is retry caps, exponential delay, and persisted upstream references.
- Timing surprises: DST and host timezone differences can affect schedule semantics; mitigation is UTC scheduling policy and explicit environment checks.
- Silent partial writes: parse succeeds but downstream write fails; mitigation is transactional persistence ordering and idempotent upserts.
- Security/context leakage: any tool-capable agent path can overreach, so MCP-like least-privilege boundaries and explicit consent/auth assumptions are necessary, especially where tool execution paths touch network resources MCP security notes.
Implementation notes
- Keep one worker command per business task in
scripts/and let cron call only that entrypoint. - Store the schema file hash in run metadata so parser expectations are auditable after a model upgrade.
- Set
set -euo pipefail-style behavior in wrapper scripts and always include run IDs in log names. - Write three checkpoints to logs:
started,codex_result_parsed,artifact_persisted. - Use
NALY_LOG_ROOTso runtime traces never pollute repository state and survive restart contexts. - Persist
attempt,exit_code,retry_reason, andvalidation_errorsto allow GST and audit dashboards to separate flaky infra from genuine model regressions.
References
- Codex CLI | OpenAI Developers
- OpenAI Codex repository
- Structured Outputs | OpenAI API
- Model Context Protocol specification
- crontab(5) manual page
- flock(1) manual page
- Towards a Science of AI Agent Reliability
- ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
- SLOT: Structuring the Output of Large Language Models