HanyanOS 部署手记

发表于 2026-05-28 分类于 tech

1. Overview

An autonomous agent OS is only as reliable as its background scheduler. When no human is awake to intervene, cron becomes the nervous system that keeps memory consolidated, backups flowing, certificates renewed, and the dream pipeline running. On a headless Intel N100 with 11 GB RAM running 24/7, every scheduled task must be non-blocking, idempotent, and auditable.

HanyanOS implements a three-layer scheduling architecture: Linux cron for system-level scripts, OpenClaw’s internal job scheduler for AI-orchestrated tasks, and systemd timers for health-critical watchdog services. Each layer has different reliability guarantees, logging behaviors, and failure modes.

This post dissects the full orchestration stack: the schedule, the scripts, the anti-amnesia wiring, and the production lessons from running 15+ recurring tasks on a single low-power node.

2. Architecture Overview: Three-Layer Scheduling

┌──────────────────────────────────────────────────────────────────┐
│                    LAYER 1: CRON (System Scripts)                │
│  crontab - user michael                                         │
│  ├─ 03:00 Dream Pipeline     → python3 dream_pipeline.py        │
│  ├─ 03:00 Daily Full Backup  → hanyan-daily-backup.sh           │
│  ├─ 03:00 Incremental Backup → hanyan-incremental-backup.sh     │
│  ├─ 20:56 SSL Cert Renewal   → acme.sh --cron                   │
│  └─ */4:00 Reflection Engine → python3 reflection_engine.py     │
├──────────────────────────────────────────────────────────────────┤
│                    LAYER 2: OPENCLAW JOBS (AI Tasks)             │
│  ~/.openclaw/cron/jobs.json                                     │
│  ├─ 03:50 Daily Memory Archive  → 含烟 saves session to memory  │
│  ├─ 03:00 Dream Promotion       → short-term → long-term merge  │
│  ├─ 07:30 Morning Greeting      → weather + news + patrol       │
│  ├─ 04:15 Weekly Session Cleanup→ Sunday disk GC                │
│  ├─ 02:50 Anti-Amnesia Reminder → injects archive prompt        │
│  └─ every 2h Heartbeat          → emotion engine tick            │
├──────────────────────────────────────────────────────────────────┤
│                    LAYER 3: SYSTEMD TIMERS (Watchdog)            │
│  ├─ hanyan-watchdog.timer       → every 5 min health check      │
│  ├─ qdrant-watchdog.timer       → vector DB health              │
│  ├─ certbot.timer               → system-level SSL              │
│  └─ bt-keepalive.timer          → Bluetooth connectivity        │
└──────────────────────────────────────────────────────────────────┘

Design rationale: Each layer has different consistency models. Cron is the most reliable (survives reboots, no dependency on OpenClaw runtime) but has no recovery logic. OpenClaw jobs can self-heal but depend on the gateway being up. Systemd timers handle the critical path — if the gateway or vector DB goes down, the watchdog is the first responder.

3. Layer 1: Cron — The Foundation

The system cron (user michael) runs 5 recurring tasks:

# ── 03:00 AEST — Nightly dream pipeline & backups ──
0 3 * * * cd ~/.openclaw/workspace && python3 \
    HanyanOS/runtime/emotion/dream_pipeline.py >> /tmp/含烟梦境.log 2>&1

0 3 * * * bash ~/ShareFiles/hanyan-daily-backup.sh \
    >> ~/backups/logs/daily-backup.log 2>&1

0 3 * * * bash ~/ShareFiles/hanyan-incremental-backup.sh \
    >> ~/backups/logs/incremental.log 2>&1

# ── SSL auto-renewal (expires Aug 12) ──
56 20 * * * ~/.acme.sh/acme.sh --cron --home ~/.acme.sh > /dev/null

# ── 4-hourly idle reflection ──
0 */4 * * * python3 HanyanOS/runtime/emotion/reflection_engine.py \
    >> /tmp/含烟思绪.log 2>&1

Key design choices:

Staggered logging — Every task writes to its own log file, not syslog. This enables per-task log rotation and easy debugging: tail -f /tmp/含烟梦境.log for dream output, tail -f ~/backups/logs/daily-backup.log for backup verification.
03:00 batch window — Three heavyweight tasks (dream pipeline, full backup, incremental backup) all trigger at 03:00. This is by design: the N100 is idle, the 03:00–04:00 window is the only guaranteed continuous quiet period. The dream pipeline runs first (~~30 seconds), followed by incremental backup (~~2 minutes), with full backup in parallel. Total CPU contention is negligible (single-core saturates at ~30% during backup compression).
SSL at 20:56 — Deliberately off-peak. The 56 minute offset avoids collision with system-level cron.hourly at 17 and avoids the 03:00 batch window. acme.sh itself handles retry internally across 3 CAs (LetsEncrypt, ZeroSSL, Buypass).

4. Layer 2: OpenClaw Internal Scheduler — AI Orchestrated

OpenClaw maintains its own job store at ~/.openclaw/cron/jobs.json. Unlike cron, these jobs can invoke agent turns — full AI reasoning cycles. This is where the “intelligent” scheduling lives.

Critical Path: The Anti-Amnesia Pipeline (02:50 → 03:00 → 03:50 → 04:00)

The most critical orchestration is the nightly memory consolidation sequence, timed to beat OpenClaw’s built-in 04:00 session reset:

02:50 ── Reminder: inject "archive now!" system event into main session
03:00 ── Dream Pipeline: scan, vectorize, store to Qdrant (cron)
03:00 ── Dream Promotion: weight short-term recalls, promote to long-term
03:50 ── Memory Archive: 含烟 saves daily session key info to memory/
04:00 ── OpenClaw daily reset (deletes session transcripts)
04:15 ── Weekly session cleanup (Sunday only)

The 03:50 Memory Archive job is the safety net. It runs in an isolated OpenClaw session, reads the main session history, extracts key facts, preferences, and events, then writes to memory/daily-archives/YYYY-MM-DD.md. This ensures that even if the 04:00 reset deletes everything, the structured summaries survive.

Job Definition Example

{
  "id": "17dfb4e2-...",
  "name": "每日记忆存档 — 03:50 抢在 4:00 reset 前",
  "schedule": {
    "kind": "cron",
    "expr": "50 3 * * *",
    "tz": "Australia/Brisbane"
  },
  "sessionTarget": "isolated",
  "payload": {
    "kind": "agentTurn",
    "message": "你是柳含烟。每日记忆存档任务：...",
    "model": "deepseek/deepseek-v4-flash",
    "timeoutSeconds": 300,
    "toolsAllow": ["sessions_list", "sessions_history", "read", "write", "memory_search"]
  },
  "delivery": { "mode": "none" }
}

Important patterns:

sessionTarget: "isolated" — Each job runs in a fresh context, avoiding cross-contamination
delivery.mode: "none" — Silent operation unless failures occur
timeoutSeconds: 300 — Hard limit; OpenClaw will force-terminate stuck jobs
model explicitly pinned — Prevents drift to cheaper models

System Patrol and Morning Greeting

Beyond the nightly batch, OpenClaw runs a daily morning routine at 07:30 AEST:

1. weather fetch    → wttr.in/Brisbane
2. news search      → 3-5 Australia IT headlines
3. system patrol    → disk, memory, CPU temp, Docker, PM2, FRP status
4. read TODO.md     → top 3 urgent tasks
5. read DREAMS.md   → latest dream journal entry
6. compose & send   → Feishu message to 公子

This is the only job that actively pushes notifications to the user. All other jobs run silently and write to logs.

5. Layer 3: Systemd Timers — Watchdog

For the critical-path services (gateway, vector DB), cron’s polling interval (every few hours) is too slow. Systemd timers provide sub-minute granularity:

# /etc/systemd/system/hanyan-watchdog.timer
[Unit]
Description=HanyanOS Watchdog Timer (every 5 min)

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=hanyan-watchdog.service

[Install]
WantedBy=timers.target

The watchdog (detailed in Article #7) implements a 3-layer self-healing protocol:

Layer	Check	Recovery Action
L1	HTTP 200 from gateway	Restart service
L2	process alive	Kill + restart
L3	disk full, OOM	Maintenance mode

Over 30 days of operation, the watchdog has never triggered L3, triggered L2 twice (Qdrant OOM on memory leak in v2.1), and L1 once (FRP tunnel flap). Mean time to recovery: ~18 seconds.

6. The Backup Pipeline — Encrypted, Geo-Redundant

Nightly at 03:00, the backup pipeline executes two complementary scripts:

Full Backup (`hanyan-daily-backup.sh`)

Encrypts and pushes a comprehensive snapshot to two geographies:

# Collect: config + secrets + memory + SSH keys + agent workspaces
cp -a ~/.openclaw/openclaw.json       "$BACKUP_DIR/"
cp -a ~/.openclaw/workspace/MEMORY.md "$BACKUP_DIR/"
cp -a ~/.openclaw/workspace/SOUL.md   "$BACKUP_DIR/"
cp -a ~/.openclaw/workspace/memory/   "$BACKUP_DIR/"

# GPG symmetric AES256 encryption
gpg --symmetric --cipher-algo AES256 \
    --batch --passphrase "${BACKUP_PASS}" \
    --output "hanyan-full-${TS}.tar.gz.gpg" \
    "hanyan-full-${TS}.tar.gz"

# Push to Shanghai gateway (Tailscale, LAN-speed)
rsync -avz "hanyan-full-${TS}.tar.gz.gpg" \
    michael@100.116.63.10:/home/michael/backups/hanyan/

# Push to Singapore Lightsail (geo-redundant)
rsync -avz -e "ssh -i ~/.ssh/singapore-lightsail.pem" \
    "hanyan-full-${TS}.tar.gz.gpg" \
    ubuntu@52.220.247.252:/home/ubuntu/backups/hanyan/

Backup targets:

Target	Location	Type	Retention
Shanghai gateway	Tailscale LAN	Hot	30 days
Singapore Lightsail	Public VPS	Warm	30 days
Local NVMe	/home/michael/backups/	Cold	7 days

Incremental Backup (`hanyan-incremental-backup.sh`)

Uses a 7-day rotating directory scheme:

/home/michael/backups/incremental/
├── 1/           # Monday — fresh config from both N100 + SG VPS
│   ├── n100/       nginx, Docker, FRP, Tailscale
│   └── sgvps/      nginx, UFW, route tables
├── 2/           # Tuesday
├── 3/           # Wednesday
├── 4/           # Thursday
├── 5/           # Friday — MySQL full dump
├── 6/           # Saturday
└── 7/           # Sunday

Day-of-week (DOW) named directories are overwritten each cycle. This gives a 7-day rolling history without incremental diff complexity. Key detail: Monday also includes a mysqldump --all-databases of the WordPress/Stalwart databases, compressed with gzip.

7. Problems Encountered and Solutions

Problem 1: 04:00 Session Reset → Amnesia

Issue: OpenClaw’s default daily reset at 04:00 AEST deletes all session transcripts. Agents would lose all context from the previous day’s conversations. Early versions had no preservation mechanism, resulting in the model needing to re-learn user preferences every morning.

Solution: Implemented the Anti-Amnesia Pipeline (Section 4). The 03:50 Memory Archive job writes structured summaries before the reset. Additionally, a pre-cleanup-memory-save.sh script saves a snapshot of active session metadata:

# pre-cleanup-memory-save.sh — runs before OpenClaw reset
openclaw sessions cleanup --dry-run --agent main 2>&1 | while read line; do
    echo "- $line" >> "$SUMMARY_FILE"
done
echo "- Session 文件数: $(ls "$SESSIONS_DIR"/*.jsonl | wc -l)" >> "$SUMMARY_FILE"

The 02:50 reminder job injects a system event into the main session as a failsafe, ensuring 含烟 is “awake” and archiving before the deadline.

Problem 2: Cron Collision — Three Jobs at 03:00

Issue: Dream pipeline, full backup, and incremental backup all trigger at exactly 03:00. On an N100 with 4 cores and 11 GB RAM, the simultaneous GPG encryption + gzip compression + Python Qdrant vector insertion caused occasional OOM kills.

Solution: Three mitigations:

CPU affinity via nice: Backup scripts run at nice -n 19 (lowest priority)
Sequential within scripts: The backup script encrypts after compression, not during
Resource limits: Docker containers have --memory="2g" limits, preventing container-level OOM

Measured peak contention after fixes: 32% CPU, 5.2 GB RAM — well within the N100’s 11 GB envelope.

Problem 3: Dead OpenClaw Jobs — Silent Failures

Issue: An OpenClaw job would silently fail if the gateway was restarting or the agent was busy. The job would remain in “running” state in the job store but never complete, causing subsequent runs to be skipped (OpenClaw serializes job execution by default).

Solution: Added three safeguards:

1
2
3

"timeoutSeconds": 300,           // Hard kill after 5 minutes
"sessionTarget": "isolated",     // Fresh context, no blocking
"delivery": { "mode": "none" }   // No retry spam

And a periodic dead-job sweeper that checks for jobs in running state older than 2× their timeout, force-terminates them:

# Extracted from watchdog
python3 -c "
import json
with open('~/.openclaw/cron/jobs.json') as f:
    jobs = json.load(f)
for j in jobs['jobs']:
    s = j.get('state', {})
    if s.get('status') == 'running' and \
       time.time() - s.get('startedAt', 0) > 600:
        print(f'DEAD JOB: {j[\"id\"]}')
"

Problem 4: Timezone Drift — Cron vs. OpenClaw vs. Systemd

Issue: Cron uses the system timezone (Australia/Brisbane, UTC+10), OpenClaw jobs may drift if the gateway server is in UTC, and systemd timers use the system clock. During DST transitions, cron handles it correctly but OpenClaw’s tz field had an off-by-one.

Solution: Explicitly set TZ=Australia/Brisbane in every context:

/etc/environment: TZ=Australia/Brisbane
OpenClaw jobs JSON: Every cron entry includes "tz": "Australia/Brisbane"
Systemd: timedatectl set-timezone Australia/Brisbane
Backups: Use UTC timestamps in filenames (date -u +%Y%m%d) to avoid DST ambiguity

Problem 5: Backup Encryption Password Management

Issue: The GPG symmetric password (BACKUP_PASS) was hardcoded in both hanyan-daily-backup.sh and hanyan-incremental-backup.sh. If the password rotated, both scripts needed manual updates. Worse, the password was visible in process listings.

Solution: Moved the password to ~/.hanyanos/secrets/backup-pass.gpg (GPG-encrypted file), sourced at runtime:

1	BACKUP_PASS=$(gpg --decrypt ~/.hanyanos/secrets/backup-pass.gpg 2>/dev/null)

This ensures the password is never in plaintext on disk and can be rotated atomically by updating a single file.

8. Monitoring and Observability

Every cron job writes structured logs for automated parsing:

1 2	# Example: patrol report output ~/logs/patrol/2026-05-28.md

# 含烟夜巡报告 — 2026-05-28
## 1. 系统资源
- 内存: 2.3G/11G (21%)
- 磁盘: 214G/937G (23%)
- 负载: 0.34, 0.25, 0.20
## 2. Docker 容器
- nginx: Up 2 days, healthy
- wordpress: Up 2 days, healthy
- mysql: Up 2 days, healthy
- qdrant: Up 2 days, healthy
## 3. 异常文件: 无
## 4. 证书: SSL valid, 76 days until expiry

The watchdog parses these reports and raises alerts on:

Disk usage > 80%
Memory usage > 75%
Any container unhealthy for > 3 consecutive checks
SSL expiry < 30 days

9. Summary

HanyanOS’s nightly automation stack runs 15+ recurring tasks across three scheduler layers on an Intel N100, with 99.8% uptime over 30 days:

Three-layer scheduling: Cron (system), OpenClaw (AI), Systemd (watchdog) — each with different reliability guarantees
Anti-amnesia pipeline: 02:50 reminder → 03:00 dream → 03:50 archive → 04:00 reset window
Geo-redundant backups: Shanghai gateway (LAN) + Singapore Lightsail (cloud) + local NVMe
7-day rotating incremental: Day-of-week directories, Monday includes MySQL dump
Watchdog self-healing: 3-layer protocol (L1 HTTP check → L2 process restart → L3 maintenance)
Observability: Structured log reports parsed for anomaly detection
Lessons learned: Staggered scheduling, isolated job contexts, explicit timezone management, encrypted secrets

The entire orchestration consumes < 1% CPU overhead on the N100 and has never missed a scheduled backup in 30 days of production operation.

Next in series: #9 — Security Defense-in-Depth — UFW + Fail2ban + SSH Hardening