Naly Engineering Notes: Machine Cron Locks and Observable Publishing Pipelines

Abstract

TL;DRNaly uses machine cron as a small but deliberate scheduler: timestamped wrappers launch publishing and distribution jobs, flock prevents overlapping runs, stripped-runtime bootstrapping makes the environment explicit, and external logs plus deterministic artifacts turn every execution into evidence. The thesis is that simple host-level automation can be production-grade when concurrency, replayability, and observability are designed as first-class outputs rather than shell afterthoughts.

Machine cron is not a workflow engine. It does not know whether an article was published, a blob was uploaded, a database write was idempotent, or a downstream notification was safe to send. Its job is narrower: wake up at a predictable time and run a command. Naly's design keeps that contract small and builds the reliability layer around it.

The useful pattern is schedule -> locked wrapper -> explicit runtime -> observable artifact. Cron supplies the clock. flock supplies single-run protection on one host. The wrapper supplies environment loading, mode selection, logging, and exit-code discipline. The application script supplies domain behavior. The artifact directory supplies the audit trail.

Where it sits in Naly

Naly's daily publishing pipeline is part of the user-growth system: it supports recurring articles, distribution checks, and smoke-mode verification for work that should create acquisition or retention value. The schedule itself is intentionally outside the Next.js request path. A page render should not be responsible for deciding that today's publishing job exists.

At a high level, the pipeline has five boundaries:

The crontab entry contains the schedule and names one wrapper.
The wrapper creates a run id, chooses full or smoke mode, and binds log and artifact locations.
flock guards the critical section so a slow run cannot overlap the next scheduled slot.
The TypeScript runtime executes the checked-in job with explicit environment loading.
The job writes deterministic artifacts, status, and logs outside the repository runtime tree.

The external log-root choice matters. Naly keeps runtime logs outside the repo, with NALY_LOG_ROOT=/tmp/logs by default and /data/logs for persistent environments. That preserves the repository as source and durable project memory, while logs live in an operational namespace designed for rotation, retention, and inspection.

The deterministic artifact directory is the second half of observability. A log line says what happened; an artifact path proves what output was produced. For a daily article job, the artifact directory should be keyed by job name, date label, schedule slot, and run id, then contain start metadata, final metadata, stdout/stderr, content outputs, smoke outputs, and any publish identifiers.

Technical mechanism

The Linux crontab(5) contract is direct: a crontab contains instructions for the cron daemon to run a command at a matching time. The manual also documents details that matter in production: cron sets a sparse environment such as SHELL, HOME, and LOGNAME; CRON_TZ can define schedule interpretation; percent characters in commands have special stdin behavior; daylight-saving transitions can skip or duplicate matching jobs; and cron entries need correct newline termination.

That is why Naly treats cron lines as narrow launchers rather than application logic. The command portion should be boring: point at a wrapper, do no inline TypeScript, do no fragile quoting gymnastics, and leave application behavior to checked-in scripts.

A useful mental model is:

cron tick
  -> wrapper starts with sparse runtime
  -> run_id and artifact_dir are assigned
  -> log files are opened under NALY_LOG_ROOT
  -> local file lock is acquired
  -> environment is loaded explicitly
  -> checked-in TypeScript job runs
  -> manifest, status, outputs, and exit code are finalized

flock(1) is the concurrency primitive. Its manual describes a command-line tool that manages file locks from shell scripts, wrapping execution of another command. It supports exclusive locks by default, nonblocking acquisition with -n, bounded waiting with -w, conflict exit codes with -E, and child exit-code propagation when the wrapped command executes. Those details are enough to encode policy: skip, wait, or fail visibly.

For Naly, the lock key should map to the idempotency domain. A daily article publisher and a distribution sender may need separate locks if they can safely run independently. Two article publishers that write the same date-labeled output need the same lock. Lock names should be stable and local to the machine, not stored on NFS or CIFS paths, because the flock manual notes limited behavior on some network filesystems.

Observability then follows the OpenTelemetry shape even when the implementation is lighter than a full collector. OpenTelemetry defines signals as system outputs used to observe underlying activity, including traces, metrics, logs, and baggage. For cron publishing, the trace is the run lifecycle, the metrics are durations and counts, the logs are event records, and the baggage-like context is the run id, mode, schedule slot, artifact directory, and version metadata carried through every step.

What the literature says

Recent arXiv work is blunt about the risk of cron-style automation. Agrawal and Jain's 2026 paper on resilient ELT pipelines reports that ad-hoc ingestion scripts, including cron jobs, produced silent failures and data gaps that eroded trust. Their proposed remedy is heavier DAG orchestration, immutable raw history, and state-based dependency management. Naly does not need all of that machinery for every daily publishing job, but it adopts the core lesson: a scheduled pipeline must leave durable state that makes silence suspicious.

Albuquerque and Correia's 2025 work on tracing and metrics design patterns argues that distributed systems become harder to diagnose as observability fragments. They separate distributed tracing, application metrics, and infrastructure metrics as distinct design patterns. For Naly's cron wrappers, that translates into a practical rule: do not let stdout be the only evidence. A publish run needs a run trace, application-level counters, and host-level context.

AgentTrace is relevant because Naly's publishing pipeline includes AI-assisted components. AlSayyad, Huang, and Pal frame structured logging as a runtime accountability layer for agent systems, capturing operational and contextual behavior so nondeterministic execution can be audited. Naly's version should avoid leaking private reasoning, but it should record prompt class, source set identifiers, model/runtime metadata, safety mode, artifact hashes, and publish decisions.

OpsAgent, revised in May 2026, reinforces the same operational point from incident management: metrics, logs, and traces become more useful when converted into structured, auditable descriptions. That matters for a small cron pipeline too. The goal is not to collect more text; it is to make the next diagnosis faster than reading a terminal transcript.

Design trade-offs

Cron plus file locks is deliberately modest. It has fewer moving parts than a workflow platform, no central scheduler database, no web UI, and no built-in DAG semantics. That is a strength when the job is a single-machine daily publisher with a clear runtime contract. It is a weakness when jobs become distributed, dependency-heavy, or need high-cardinality retry policies.

File locks are also local by nature. They are a good fit for one host and one filesystem. They are a poor substitute for database advisory locks, queue leases, or orchestration state if multiple machines can run the same publisher. Naly's current use is host-level automation; if publishing becomes multi-runner, the locking boundary should move into shared durable state.

External logs trade convenience for operational hygiene. Writing logs into the repo makes local debugging feel easy, but it pollutes source control and hides rotation problems. Using /tmp/logs or /data/logs forces the system to declare which logs are disposable and which are persistent.

Smoke mode is another trade-off. A smoke run must be cheap and non-destructive, but it must exercise the same wrapper, lock, environment loading, and artifact code as the full run. If smoke mode bypasses the hard parts, it becomes a placebo.

Deterministic artifacts cost disk space and cleanup work. The payoff is replayability: operators can compare two runs, find the exact generated output, and distinguish a publishing failure from a distribution failure without reconstructing state from memory.

Failure modes

The first failure mode is overlap. A job that usually takes three minutes eventually takes thirty, and the next cron tick starts another copy. flock prevents that only if every entry uses the same lock key, holds the lock across the full critical section, and does not accidentally let background children continue outside the guarded lifecycle.

The second failure mode is a misleading schedule. Daylight-saving transitions can skip or duplicate jobs. Field-step syntax can be misread. Percent characters can alter command stdin. A missing newline can leave a crontab partially broken. The defensive posture is UTC scheduling, minimal cron command text, and wrapper-level schedule-slot recording.

The third failure mode is sparse runtime drift. Cron's non-interactive shell may not have the same PATH, Node version, package-manager path, secrets, or locale as an interactive session. Naly's stripped-runtime bootstrap makes that explicit: load the required environment in the wrapper, then run checked-in TypeScript scripts through tsx, not inline code.

The fourth failure mode is silent success. A script can exit zero while producing zero publishable artifacts. The wrapper should treat expected output counts, final manifest presence, and publish identifiers as completion checks. Success is not merely no exception; success is a coherent final state.

The fifth failure mode is partial publish. A database row can exist without a blob, a blob can exist without a public article, or a distribution message can reference an unpublished URL. Deterministic manifests help by separating prepared, committed, published, and distributed states.

The sixth failure mode is observability failure itself. If the log root is missing, full, or unwritable, the wrapper should fail before irreversible work. If artifact finalization fails, that should be a failed run even if the content step succeeded, because the audit trail is part of the product surface.

Implementation notes

Use one wrapper per operational job family. The crontab entry should express schedule, timezone, and wrapper path; the wrapper should own every other concern. That includes run_id, mode, artifact_dir, log_path, lock acquisition, environment loading, runtime launch, and final status.

Use one lock per idempotency boundary. A daily article job should not share a lock with unrelated maintenance work, but every path that can publish the same daily article should share one lock. Prefer bounded waits or nonblocking exits over unbounded queueing, then record whether a run executed, skipped, or timed out.

Make artifact directories deterministic. A practical shape is job/YYYY-MM-DD/schedule-slot/run-id/. Put started.json at the beginning and finished.json at the end. Include mode, date label, commit or build identifier when available, package/runtime family, duration, exit code, output counts, and publish identifiers.

Keep smoke and full modes on the same rail. Smoke mode can write into a dry-run namespace and suppress public distribution, but it should still acquire the lock, load the environment, initialize Drizzle or Neon access when needed, verify blob-write assumptions when relevant, and render markdown through the same content path.

Use structured logs even when writing plain files. Each important event should include job, run id, mode, schedule slot, artifact directory, duration or timestamp, and result. This makes log files queryable later and keeps the design compatible with OpenTelemetry-style ingestion if Naly later adds a collector.

The current runtime stack fits this pattern. tsx and TypeScript support checked-in operational scripts. Drizzle ORM and Neon support durable database state. Vercel Blob supports durable publish artifacts. marked supports markdown rendering paths. Next.js and React present the result, but cron should remain outside the request lifecycle.

The broader lesson is that cron is safe only when it is not asked to remember. Naly makes cron wake the system, flock serialize the risky region, and artifacts remember what happened.