← Back to all projects
In Progress Created 2026-06-02 0/34 tasks

LaunchDaemon Migration for Critical Mission Control Agents

Priority: P1 Category: PREP


Executive Summary

Migrate three critical Mission Control LaunchAgents from user-level (~/Library/LaunchAgents/) to system-level (/Library/LaunchDaemons/) so they survive unattended reboots without requiring a user login. Current setup failed silently for 52 hours during Memorial Day weekend (May 30 4:10am reboot → no login until June 1 8:40am → all user agents dead, including the catch-up agent that's supposed to recover from this exact situation).


Research Phase

Current State

What broke:

  • Mac rebooted unattended Sat May 30 at 04:10am (shutdown_stall report exists at /Library/Logs/DiagnosticReports/shutdown_stall_2026-05-30-041033_Mac-mini.shutdownStall, root-owned, cause unverified — most likely power blip)
  • FileVault is ON; no auto-login (auto-login is incompatible with FileVault)
  • User-level LaunchAgents require an active user session — they don't fire when no one is logged in
  • Cole was away for Memorial Day → no login until Jun 1 8:40am
  • 5/30 nightly summary missed, 5/31 nightly summary missed, 5/30+5/31+6/1 morning briefings missed
  • The com.missioncontrol.sleep-recovery hourly catch-up agent is ALSO user-level → same flaw; couldn't recover

Current LaunchAgent inventory (relevant subset):

  • ~/Library/LaunchAgents/com.missioncontrol.sleep-recovery.plist — hourly catch-all
  • ~/Library/LaunchAgents/com.missioncontrol.nightly-order-summary.plist — daily 22:00
  • ~/Library/LaunchAgents/com.missioncontrol.morningbriefing.plist — daily 07:00
  • ~30+ other user-level mission-control agents (lower priority; defer migration)

Power management already correct:

  • pmset shows sleep 0, displaysleep 10, daily wake 6:30am, caffeinate running
  • macOS auto-update is OFF (AutomaticallyInstallMacOSUpdates = 0, install history confirms no Apple updates around May 30)
  • So this isn't a "Mac sleep" problem and isn't an "auto-update reboot" problem — it's a "FileVault + unattended reboot + user-level agents" problem

Historical Context

  • Memory: user_hardware.md notes auto-login was not set as of May 2026 and we've encountered the decryption blocker before
  • This is the second time the FileVault+reboot combo has caused unattended job loss (first was in early May per Cole's recollection)
  • The sleep-recovery agent was built specifically as a workaround for sleep-related misses but does not solve the no-login case

Constraints

  • FileVault stays ON (security requirement)
  • No auto-login (incompatible with FileVault)
  • Must follow PR Workflow hard rules — state-mutating system change, requires explicit Cole approval before any sudo launchctl bootstrap runs
  • No modification of CLAUDE.md hard rules — this isn't a server, but the spirit of "stop and ask before state-mutating system changes" applies
  • All Mission Control scripts use absolute paths into /Users/colegorringe/ai-projects-local/... and read secrets from /Users/colegorringe/.secrets/master.key
  • Logs currently write to /Users/colegorringe/cron-logs/ — if migrated as root, those become root-owned and break future user-level cleanup

Options Considered

Option 1: System-level daemon with UserName=colegorringe (RECOMMENDED)

Approach: Move plists from ~/Library/LaunchAgents/ to /Library/LaunchDaemons/, add <key>UserName</key><string>colegorringe</string> so the process spawns as Cole (not root). System-level launchd loads the daemon at boot (before login); FileVault disk decrypts at boot via system policy; daemon fires on schedule regardless of user session. HOME, file ownership, log paths, secrets path — all resolve as if Cole ran them.

Pros:

  • Survives unattended reboots (root problem solved)
  • File ownership stays colegorringe:staff (no permission breakage)
  • All Path.home(), ~, and hardcoded /Users/colegorringe/... paths work
  • Secrets at ~/.secrets/master.key resolve correctly
  • Logs remain user-readable/writable
  • No script changes needed
  • Reversible: sudo launchctl bootout system/... removes cleanly

Cons:

  • Requires sudo to install/remove (one-time)
  • Three plists need duplication and slight modification
  • Need to be careful not to leave duplicate user-level agents running (must bootout the old user-level ones first)

Estimated Effort: Low (~30 min)

Option 2: Plain LaunchDaemon running as root

Approach: Move plists to /Library/LaunchDaemons/ without UserName key. Daemon runs as root.

Pros: Maximally reliable; root can do anything Cons:

  • HOME = /var/root → all ~ paths break
  • Secrets file lookup breaks unless we hardcode /Users/colegorringe/.secrets/master.key
  • Log files become root-owned (subsequent user-level scripts can't append/rotate)
  • Scripts that write to /Users/colegorringe/ai-projects-local/... create root-owned files everywhere
  • Requires non-trivial script audit to make sure no Path.home() calls

Estimated Effort: Medium (~2 hr including script audit)

Option 3: Move critical jobs to the always-on server

Approach: Migrate nightly summary + morning briefing to rb.alpineanalytica.com (always on, no FileVault, no session issues).

Pros: Eliminates the local-Mac dependency entirely; bulletproof Cons:

  • Requires server-side secrets management, deployment pipeline, monitoring
  • Some scripts depend on local resources (encrypted secrets file, master key in Secure Enclave era setup)
  • Larger project scope; better as a Phase 2 architecture move
  • Sleep-recovery agent is inherently local — can't be migrated

Estimated Effort: High (1-2 days)

Option 4: Status quo + better notification on missed runs

Approach: Don't fix the underlying issue; instead, add Telegram/macOS notification when sleep-recovery detects a multi-day gap.

Pros: Trivial to implement Cons: Doesn't actually fix anything; you'd just be told faster that your nightly reports are missing. Doesn't help while you're traveling.

Estimated Effort: Low (~15 min)


Chosen Approach

Decision: Option 1 — system-level LaunchDaemons with UserName=colegorringe.

Rationale:

  1. Solves the root cause (unattended reboot survives without login) without breaking the existing script/secret/path conventions
  2. Lowest-risk migration: scripts unchanged, file ownership unchanged, only the plist location and one new key differ
  3. Fully reversible with one sudo bootout command per daemon
  4. Leaves the door open for Option 3 (server migration) as a later Phase 2 move if we want to fully eliminate local-Mac dependency

Trade-offs Accepted:

  • Sudo is required for install/remove (acceptable for a system-level migration)
  • This only fixes the three highest-value agents; the remaining ~30+ user-level agents stay user-level for now (intentional — most are lower priority and have less impact when missed)
  • We're not investigating the root cause of the May 30 reboot itself (the shutdown_stall report is root-owned; cause is most likely a power blip and can't be fixed without a UPS anyway)

Implementation Plan

Phase 1: Pre-flight validation

  • Read the three current plists in full and document each schedule, program path, log paths, and any environment variables
  • Confirm each Python script the agents call uses absolute paths (no os.chdir to relative dirs, no env-dependent imports)
  • Test-run each script manually as colegorringe user from a fresh terminal to confirm it works without any cached env
  • Verify the master key path: ls -la ~/.secrets/master.key (should be readable by user)

Phase 2: Draft new daemon plists (no install yet)

  • Create /tmp/migration/com.missioncontrol.sleep-recovery.daemon.plist with:
  • Same Label, ProgramArguments, StartInterval, log paths
  • Add <key>UserName</key><string>colegorringe</string>
  • Add <key>GroupName</key><string>staff</string>
  • Add explicit <key>EnvironmentVariables</key> block with HOME=/Users/colegorringe, PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin, USER=colegorringe
  • Same for com.missioncontrol.nightly-order-summary.daemon.plist
  • Same for com.missioncontrol.morningbriefing.daemon.plist
  • Diff each new plist against the current user-level plist — only delta should be path/UserName/EnvironmentVariables
  • Show plists to Cole for review before any sudo runs

Phase 3: Install (requires Cole's explicit go-ahead and sudo password)

For each daemon:

  • Unload old user-level agent: launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.missioncontrol.<name>.plist
  • Copy new daemon plist to /Library/LaunchDaemons/ with sudo cp and sudo chown root:wheel + sudo chmod 644
  • Bootstrap: sudo launchctl bootstrap system /Library/LaunchDaemons/com.missioncontrol.<name>.plist
  • Verify loaded: sudo launchctl print system/com.missioncontrol.<name> (should show state=running or waiting)
  • Move old user-level plist to ~/Library/LaunchAgents/disabled/ (don't delete; we want a rollback path)

Phase 4: Test the new agents

  • Trigger each daemon manually: sudo launchctl kickstart -k system/com.missioncontrol.<name>
  • Confirm logs are written to expected paths with correct ownership (colegorringe:staff)
  • Confirm output matches what the user-level version produced
  • For sleep-recovery specifically: verify it correctly detects no missed jobs and exits clean

Phase 5: Simulated unattended-reboot test (OPTIONAL — only if Cole wants extra confidence)

  • Schedule a controlled reboot during business hours: sudo shutdown -r +1
  • Do NOT log in for 15 minutes
  • Verify via SSH from another machine: sudo launchctl print system/com.missioncontrol.<name> shows the daemon loaded
  • Verify the next scheduled run fires on time despite no login

Phase 6: Documentation

  • Update ~/ai-projects-local/mission-control/docs/architecture.md with the daemon migration
  • Update memory file user_hardware.md with the FileVault+daemon resolution
  • Add a lesson to lessons.md: "User-level LaunchAgents die when no user is logged in; system-level daemons with UserName key survive unattended reboots"

Acceptance Criteria

Must Have:

  • All three target daemons are loaded at system/ level after a reboot
  • Daemons fire on schedule even with no user logged in (verified via simulated test OR by waiting for next unattended reboot)
  • Log files remain colegorringe:staff-owned and user-writable
  • Each daemon's output matches its pre-migration user-level output
  • Old user-level plists are unloaded and moved to disabled/ (not deleted, for rollback)

Should Have:

  • sleep-recovery daemon explicitly verified to catch up missed jobs from a multi-hour gap
  • Documentation updated so future-Cole/future-Claude doesn't re-litigate the FileVault decision

Nice to Have:

  • Simulated unattended-reboot test passed
  • Telegram or macOS notification if any daemon misses its window by >2 hours (sleep-recovery extension)

Verification Steps

  1. After install, confirm each daemon: sudo launchctl print system/com.missioncontrol.<name> — state should be running/waiting, last exit code 0
  2. Confirm old user-level agents are no longer loaded: launchctl list | grep missioncontrol should NOT show the migrated three
  3. Tail each daemon's log after first scheduled fire: timestamps should match Hour/Minute in plist; output should look identical to historical user-level output
  4. Check file ownership: ls -la /Users/colegorringe/cron-logs/*.log — owned by colegorringe, not root
  5. (Optional) Trigger a controlled sudo shutdown -r +1, do not log in, verify next fire works

Rollback Plan

If anything goes wrong:

  1. sudo launchctl bootout system/com.missioncontrol.<name>
  2. sudo rm /Library/LaunchDaemons/com.missioncontrol.<name>.plist
  3. mv ~/Library/LaunchAgents/disabled/com.missioncontrol.<name>.plist ~/Library/LaunchAgents/
  4. launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.missioncontrol.<name>.plist

Total rollback time per daemon: ~30 seconds.


Execution Log

2026-06-02 16:45

Plan drafted. Awaiting Cole's review of:

  1. Approach (Option 1 with UserName=colegorringe)
  2. Scope (these three daemons only; others stay user-level)
  3. Decision on Phase 5 simulated reboot test
  4. Whether to include investigating root cause of May 30 reboot (would require sudo to read shutdown_stall report)

Lessons Learned

(To be filled in after completion)

What Worked:

What Didn't:

Next Time:


References

  • Memory: user_hardware.md (Mac Mini auto-login not set, FileVault on)
  • CLAUDE.md hard rule: "For any state-mutating action on a server, stop and ask"
  • Apple docs: launchd.plist(5)UserName and GroupName keys
  • Diagnostic report: /Library/Logs/DiagnosticReports/shutdown_stall_2026-05-30-041033_Mac-mini.shutdownStall (root-owned, cause TBD)
  • Current LaunchAgents to migrate:
  • ~/Library/LaunchAgents/com.missioncontrol.sleep-recovery.plist
  • ~/Library/LaunchAgents/com.missioncontrol.nightly-order-summary.plist
  • ~/Library/LaunchAgents/com.missioncontrol.morningbriefing.plist