What it does ADO #582
The TATER Agent (v2.4.15+) continuously watches OneDrive for Business sync health on Windows endpoints and reports problems centrally - the "files aren't where I left them" / "OneDrive icon is gone" / "nothing is syncing" class of help-desk tickets that are otherwise invisible until a user complains. It ships in three layers that build on each other:
- Detect & report - the agent runs 10 health checks every hour and reports the worst grade plus the specific failing signals.
- Admin alerting - confirmed problems automatically open an Ops task and fire a SIEM event, with per-device suppression.
- Self-heal & notify (opt-in) - safe inline fixes for the common cases, a desktop toast for the signed-in user, and admin-dispatched scripts for the destructive heals.
The 10 health checks
| Signal | Grade | Meaning | Suggested heal |
|---|---|---|---|
od.client.missing | fail | OneDrive.exe not installed | REM_OD_001 |
od.client.not-running | fail | OneDrive process not running for the active user | REM_OD_003 (Tier 1) |
od.client.buggy-version | warn | Client is on a known data-loss version (e.g. 26.74, ADO #564) | REM_OD_002 |
od.registry.no-business1 | fail | No Business1 account registered - user not signed in | REM_OD_004 |
od.registry.sync-paused | warn | Sync is paused | REM_OD_005 (Tier 1) |
od.cldflt.disabled | critical | Cloud Files filter driver not running - placeholders inaccessible / 0-byte | REM_OD_006 (Tier 1) |
od.cldflt.reparse-orphans | critical | Reparse points with no sync provider - "files not where I left them" | REM_OD_009 (destructive) |
od.kfm.misconfigured | warn | Known Folder Move target mismatch | REM_OD_007 |
od.policy.conflicting | warn | Local ClientPolicy.ini overrides managed Intune ADMX | REM_OD_008 |
od.hydration.fails | fail | Attempted file hydration returned an error | REM_OD_010 (destructive) |
The overall device grade is the worst of any signal: critical > fail > warn > ok.
2-cycle hysteresis - no cry-wolf
OneDrive routinely reports a momentary warning during a reboot, network blip, or large download. To avoid paging IT on transient noise, a signal must appear on two consecutive hourly polls before it is marked confirmed. Only confirmed signals drive alerts, tasks, and heals. The agent persists its cycle state at %ProgramData%\TATER\onedrive-state.json.
Admin alerting
When a confirmed signal reaches the alert threshold (default fail or worse), TATER automatically:
- Opens one Ops task per device (not per signal) summarizing every confirmed problem + the suggested heal script. The task is a living ticket - if the signal set changes it is updated in place, and it auto-closes when the device returns to healthy.
- Fires a
onedrive.health.degradedSIEM webhook event (andonedrive.health.recoveredon recovery).
Per-org configuration
Everything works out of the box with no setup. To tune it, an Admin sets Settings → integrations → onedriveHealth:
| Field | Default | Effect |
|---|---|---|
enabled | true | Master switch for alerting |
minGradeForAlert | fail | Lowest grade that opens a task (warn | fail | critical) |
autoCreateTasks | true | Whether to open Ops tasks (SIEM event still fires when off) |
taskAssignee | (none) | Optional email to assign auto-created tasks to |
suppressedDevices | [] | Hostnames that never alert |
Suppressing a noisy device
For a device with an expected, accepted OneDrive issue, suppress it:
POST /api/onedrive-health/suppress
{ "hostname": "LAB-WS-07", "suppress": true }
Suppressed devices still report health (they appear dimmed in the fleet view) but never open tasks or fire events.
Self-heal & user notification (opt-in)
Self-heal is off by default. Enable it per device by setting onedriveSelfHeal: true in the agent config.json. When on, the agent attempts safe Tier 1 fixes the moment a signal confirms:
| Signal | Tier 1 action | Context |
|---|---|---|
| sync-paused | clear the pause flag + nudge the client | user session |
| client not-running | relaunch OneDrive.exe /background | user session |
| cldflt disabled | start the cldflt service | machine (SYSTEM) |
Every heal is gated for safety: it is never attempted on a known-buggy client version, user-session heals are skipped when the agent runs as SYSTEM (service mode), and the same action is not retried within 6 hours. Outcomes (✅ success / ❌ failed / ⏭️ skipped, with reason) are reported back and visible in the fleet view and the list_onedrive_issues MCP tool.
User notification (Tier 2)
In tray mode, when a confirmed user-impacting issue persists the signed-in user gets a non-alarming Windows toast - framed as "fixed automatically" if a heal succeeded, otherwise "needs attention - IT has been notified; your files are safe." Throttled to at most once per 4 hours.
Destructive heals - admin-dispatch only
The two destructive remediations are never run automatically. They are shipped as scripts an admin dispatches deliberately (e.g. via execute_script targeting the hostname), and each self-gates against known-buggy client versions - running them on a buggy build is the exact data-loss scenario ADO #564 documented.
| Script | What it does | Safety |
|---|---|---|
OneDrive/REM_OD_009_UnregisterSyncRoot.ps1 | Clears orphaned cldflt reparse points via CfUnregisterSyncRoot | SYSTEM context required; version-gated; placeholders carry no local content so no user data is deleted |
OneDrive/REM_OD_010_ResetClient.ps1 | Full onedrive.exe /reset for deep corruption | Version-gated; non-destructive to cloud data; rebuilds the local cache |
Triage workflow (MCP)
For an AI agent (Claude / Copilot) connected to TATER's MCP, the help-desk path is:
list_onedrive_issues- see every endpoint with a OneDrive problem, the confirmed signals (✓), and recent heal outcomes.find_device_by_user(query)- map a user's complaint ("Maria can't open her files") to a hostname.execute_script(hostname, "OneDrive/REM_OD_00X...")- dispatch the suggested heal.- Per the Three-Doc Rule, file a Tasker task documenting the user impact and write a TATERpedia page if the pattern recurs across the fleet.
Platform & scope
OneDrive sync health is Windows-only - macOS and Linux agents run the monitor as a no-op (there is no Microsoft OneDrive sync engine to watch). Data is retained for 30 days per device (Cosmos TTL on the OnedriveHealthInventories container).
Related
- Agent Deployment - install the agent + the v2.4.16 heartbeat/health goroutines
- Fleet Management - the Devices page where OneDrive grades surface
- MCP Setup - connect Claude / Copilot for AI-driven triage