HanyanOS 部署手记
1. Overview
An autonomous agent OS is only as reliable as its background scheduler. When no human is awake to intervene, cron becomes the nervous system that keeps memory consolidated, backups flowing, certificates renewed, and the dream pipeline running. On a headless Intel N100 with 11 GB RAM running 24/7, every scheduled task must be non-blocking, idempotent, and auditable.
HanyanOS implements a three-layer scheduling architecture: Linux cron for system-level scripts, OpenClaw’s internal job scheduler for AI-orchestrated tasks, and systemd timers for health-critical watchdog services. Each layer has different reliability guarantees, logging behaviors, and failure modes.
This post dissects the full orchestration stack: the schedule, the scripts, the anti-amnesia wiring, and the production lessons from running 15+ recurring tasks on a single low-power node.
2. Architecture Overview: Three-Layer Scheduling
1 | ┌──────────────────────────────────────────────────────────────────┐ |
Design rationale: Each layer has different consistency models. Cron is the most reliable (survives reboots, no dependency on OpenClaw runtime) but has no recovery logic. OpenClaw jobs can self-heal but depend on the gateway being up. Systemd timers handle the critical path — if the gateway or vector DB goes down, the watchdog is the first responder.
3. Layer 1: Cron — The Foundation
The system cron (user michael) runs 5 recurring tasks:
1 | # ── 03:00 AEST — Nightly dream pipeline & backups ── |
Key design choices:
Staggered logging — Every task writes to its own log file, not syslog. This enables per-task log rotation and easy debugging:
tail -f /tmp/含烟梦境.logfor dream output,tail -f ~/backups/logs/daily-backup.logfor backup verification.03:00 batch window — Three heavyweight tasks (dream pipeline, full backup, incremental backup) all trigger at 03:00. This is by design: the N100 is idle, the 03:00–04:00 window is the only guaranteed continuous quiet period. The dream pipeline runs first (
30 seconds), followed by incremental backup (2 minutes), with full backup in parallel. Total CPU contention is negligible (single-core saturates at ~30% during backup compression).SSL at 20:56 — Deliberately off-peak. The
56minute offset avoids collision with system-level cron.hourly at17and avoids the 03:00 batch window. acme.sh itself handles retry internally across 3 CAs (LetsEncrypt, ZeroSSL, Buypass).
4. Layer 2: OpenClaw Internal Scheduler — AI Orchestrated
OpenClaw maintains its own job store at ~/.openclaw/cron/jobs.json. Unlike cron, these jobs can invoke agent turns — full AI reasoning cycles. This is where the “intelligent” scheduling lives.
Critical Path: The Anti-Amnesia Pipeline (02:50 → 03:00 → 03:50 → 04:00)
The most critical orchestration is the nightly memory consolidation sequence, timed to beat OpenClaw’s built-in 04:00 session reset:
1 | 02:50 ── Reminder: inject "archive now!" system event into main session |
The 03:50 Memory Archive job is the safety net. It runs in an isolated OpenClaw session, reads the main session history, extracts key facts, preferences, and events, then writes to memory/daily-archives/YYYY-MM-DD.md. This ensures that even if the 04:00 reset deletes everything, the structured summaries survive.
Job Definition Example
1 | { |
Important patterns:
sessionTarget: "isolated"— Each job runs in a fresh context, avoiding cross-contaminationdelivery.mode: "none"— Silent operation unless failures occurtimeoutSeconds: 300— Hard limit; OpenClaw will force-terminate stuck jobsmodelexplicitly pinned — Prevents drift to cheaper models
System Patrol and Morning Greeting
Beyond the nightly batch, OpenClaw runs a daily morning routine at 07:30 AEST:
1 | 1. weather fetch → wttr.in/Brisbane |
This is the only job that actively pushes notifications to the user. All other jobs run silently and write to logs.
5. Layer 3: Systemd Timers — Watchdog
For the critical-path services (gateway, vector DB), cron’s polling interval (every few hours) is too slow. Systemd timers provide sub-minute granularity:
1 | # /etc/systemd/system/hanyan-watchdog.timer |
The watchdog (detailed in Article #7) implements a 3-layer self-healing protocol:
| Layer | Check | Recovery Action |
|---|---|---|
| L1 | HTTP 200 from gateway | Restart service |
| L2 | process alive | Kill + restart |
| L3 | disk full, OOM | Maintenance mode |
Over 30 days of operation, the watchdog has never triggered L3, triggered L2 twice (Qdrant OOM on memory leak in v2.1), and L1 once (FRP tunnel flap). Mean time to recovery: ~18 seconds.
6. The Backup Pipeline — Encrypted, Geo-Redundant
Nightly at 03:00, the backup pipeline executes two complementary scripts:
Full Backup (hanyan-daily-backup.sh)
Encrypts and pushes a comprehensive snapshot to two geographies:
1 | # Collect: config + secrets + memory + SSH keys + agent workspaces |
Backup targets:
| Target | Location | Type | Retention |
|---|---|---|---|
| Shanghai gateway | Tailscale LAN | Hot | 30 days |
| Singapore Lightsail | Public VPS | Warm | 30 days |
| Local NVMe | /home/michael/backups/ | Cold | 7 days |
Incremental Backup (hanyan-incremental-backup.sh)
Uses a 7-day rotating directory scheme:
1 | /home/michael/backups/incremental/ |
Day-of-week (DOW) named directories are overwritten each cycle. This gives a 7-day rolling history without incremental diff complexity. Key detail: Monday also includes a mysqldump --all-databases of the WordPress/Stalwart databases, compressed with gzip.
7. Problems Encountered and Solutions
Problem 1: 04:00 Session Reset → Amnesia
Issue: OpenClaw’s default daily reset at 04:00 AEST deletes all session transcripts. Agents would lose all context from the previous day’s conversations. Early versions had no preservation mechanism, resulting in the model needing to re-learn user preferences every morning.
Solution: Implemented the Anti-Amnesia Pipeline (Section 4). The 03:50 Memory Archive job writes structured summaries before the reset. Additionally, a pre-cleanup-memory-save.sh script saves a snapshot of active session metadata:
1 | # pre-cleanup-memory-save.sh — runs before OpenClaw reset |
The 02:50 reminder job injects a system event into the main session as a failsafe, ensuring 含烟 is “awake” and archiving before the deadline.
Problem 2: Cron Collision — Three Jobs at 03:00
Issue: Dream pipeline, full backup, and incremental backup all trigger at exactly 03:00. On an N100 with 4 cores and 11 GB RAM, the simultaneous GPG encryption + gzip compression + Python Qdrant vector insertion caused occasional OOM kills.
Solution: Three mitigations:
- CPU affinity via
nice: Backup scripts run atnice -n 19(lowest priority) - Sequential within scripts: The backup script encrypts after compression, not during
- Resource limits: Docker containers have
--memory="2g"limits, preventing container-level OOM
Measured peak contention after fixes: 32% CPU, 5.2 GB RAM — well within the N100’s 11 GB envelope.
Problem 3: Dead OpenClaw Jobs — Silent Failures
Issue: An OpenClaw job would silently fail if the gateway was restarting or the agent was busy. The job would remain in “running” state in the job store but never complete, causing subsequent runs to be skipped (OpenClaw serializes job execution by default).
Solution: Added three safeguards:
1 | "timeoutSeconds": 300, // Hard kill after 5 minutes |
And a periodic dead-job sweeper that checks for jobs in running state older than 2× their timeout, force-terminates them:
1 | # Extracted from watchdog |
Problem 4: Timezone Drift — Cron vs. OpenClaw vs. Systemd
Issue: Cron uses the system timezone (Australia/Brisbane, UTC+10), OpenClaw jobs may drift if the gateway server is in UTC, and systemd timers use the system clock. During DST transitions, cron handles it correctly but OpenClaw’s tz field had an off-by-one.
Solution: Explicitly set TZ=Australia/Brisbane in every context:
/etc/environment:TZ=Australia/Brisbane- OpenClaw jobs JSON: Every cron entry includes
"tz": "Australia/Brisbane" - Systemd:
timedatectl set-timezone Australia/Brisbane - Backups: Use UTC timestamps in filenames (
date -u +%Y%m%d) to avoid DST ambiguity
Problem 5: Backup Encryption Password Management
Issue: The GPG symmetric password (BACKUP_PASS) was hardcoded in both hanyan-daily-backup.sh and hanyan-incremental-backup.sh. If the password rotated, both scripts needed manual updates. Worse, the password was visible in process listings.
Solution: Moved the password to ~/.hanyanos/secrets/backup-pass.gpg (GPG-encrypted file), sourced at runtime:
1 | BACKUP_PASS=$(gpg --decrypt ~/.hanyanos/secrets/backup-pass.gpg 2>/dev/null) |
This ensures the password is never in plaintext on disk and can be rotated atomically by updating a single file.
8. Monitoring and Observability
Every cron job writes structured logs for automated parsing:
1 | # Example: patrol report output |
1 | # 含烟夜巡报告 — 2026-05-28 |
The watchdog parses these reports and raises alerts on:
- Disk usage > 80%
- Memory usage > 75%
- Any container unhealthy for > 3 consecutive checks
- SSL expiry < 30 days
9. Summary
HanyanOS’s nightly automation stack runs 15+ recurring tasks across three scheduler layers on an Intel N100, with 99.8% uptime over 30 days:
- Three-layer scheduling: Cron (system), OpenClaw (AI), Systemd (watchdog) — each with different reliability guarantees
- Anti-amnesia pipeline: 02:50 reminder → 03:00 dream → 03:50 archive → 04:00 reset window
- Geo-redundant backups: Shanghai gateway (LAN) + Singapore Lightsail (cloud) + local NVMe
- 7-day rotating incremental: Day-of-week directories, Monday includes MySQL dump
- Watchdog self-healing: 3-layer protocol (L1 HTTP check → L2 process restart → L3 maintenance)
- Observability: Structured log reports parsed for anomaly detection
- Lessons learned: Staggered scheduling, isolated job contexts, explicit timezone management, encrypted secrets
The entire orchestration consumes < 1% CPU overhead on the N100 and has never missed a scheduled backup in 30 days of production operation.
Next in series: #9 — Security Defense-in-Depth — UFW + Fail2ban + SSH Hardening