OneDrive Sync Health

What it does ADO #582

The TATER Agent (v2.4.15+) continuously watches OneDrive for Business sync health on Windows endpoints and reports problems centrally - the "files aren't where I left them" / "OneDrive icon is gone" / "nothing is syncing" class of help-desk tickets that are otherwise invisible until a user complains. It ships in three layers that build on each other:

Detect & report - the agent runs 10 health checks every hour and reports the worst grade plus the specific failing signals.
Admin alerting - confirmed problems automatically open an Ops task and fire a SIEM event, with per-device suppression.
Self-heal & notify (opt-in) - safe inline fixes for the common cases, a desktop toast for the signed-in user, and admin-dispatched scripts for the destructive heals.

The 10 health checks

Signal	Grade	Meaning	Suggested heal
`od.client.missing`	fail	OneDrive.exe not installed	REM_OD_001
`od.client.not-running`	fail	OneDrive process not running for the active user	REM_OD_003 (Tier 1)
`od.client.buggy-version`	warn	Client is on a known data-loss version (e.g. 26.74, ADO #564)	REM_OD_002
`od.registry.no-business1`	fail	No Business1 account registered - user not signed in	REM_OD_004
`od.registry.sync-paused`	warn	Sync is paused	REM_OD_005 (Tier 1)
`od.cldflt.disabled`	critical	Cloud Files filter driver not running - placeholders inaccessible / 0-byte	REM_OD_006 (Tier 1)
`od.cldflt.reparse-orphans`	critical	Reparse points with no sync provider - "files not where I left them"	REM_OD_009 (destructive)
`od.kfm.misconfigured`	warn	Known Folder Move target mismatch	REM_OD_007
`od.policy.conflicting`	warn	Local ClientPolicy.ini overrides managed Intune ADMX	REM_OD_008
`od.hydration.fails`	fail	Attempted file hydration returned an error	REM_OD_010 (destructive)

The overall device grade is the worst of any signal: critical > fail > warn > ok.

2-cycle hysteresis - no cry-wolf

OneDrive routinely reports a momentary warning during a reboot, network blip, or large download. To avoid paging IT on transient noise, a signal must appear on two consecutive hourly polls before it is marked confirmed. Only confirmed signals drive alerts, tasks, and heals. The agent persists its cycle state at %ProgramData%\TATER\onedrive-state.json.

Admin alerting

When a confirmed signal reaches the alert threshold (default fail or worse), TATER automatically:

Opens one Ops task per device (not per signal) summarizing every confirmed problem + the suggested heal script. The task is a living ticket - if the signal set changes it is updated in place, and it auto-closes when the device returns to healthy.
Fires a onedrive.health.degraded SIEM webhook event (and onedrive.health.recovered on recovery).

Per-org configuration

Everything works out of the box with no setup. To tune it, an Admin sets Settings → integrations → onedriveHealth:

Field	Default	Effect
`enabled`	true	Master switch for alerting
`minGradeForAlert`	fail	Lowest grade that opens a task (`warn` \| `fail` \| `critical`)
`autoCreateTasks`	true	Whether to open Ops tasks (SIEM event still fires when off)
`taskAssignee`	(none)	Optional email to assign auto-created tasks to
`suppressedDevices`	[]	Hostnames that never alert

Suppressing a noisy device

For a device with an expected, accepted OneDrive issue, suppress it:

POST /api/onedrive-health/suppress
{ "hostname": "LAB-WS-07", "suppress": true }

Suppressed devices still report health (they appear dimmed in the fleet view) but never open tasks or fire events.

Self-heal & user notification (opt-in)

Self-heal is off by default. Enable it per device by setting onedriveSelfHeal: true in the agent config.json. When on, the agent attempts safe Tier 1 fixes the moment a signal confirms:

Signal	Tier 1 action	Context
sync-paused	clear the pause flag + nudge the client	user session
client not-running	relaunch OneDrive.exe /background	user session
cldflt disabled	start the cldflt service	machine (SYSTEM)

Every heal is gated for safety: it is never attempted on a known-buggy client version, user-session heals are skipped when the agent runs as SYSTEM (service mode), and the same action is not retried within 6 hours. Outcomes (✅ success / ❌ failed / ⏭️ skipped, with reason) are reported back and visible in the fleet view and the list_onedrive_issues MCP tool.

User notification (Tier 2)

In tray mode, when a confirmed user-impacting issue persists the signed-in user gets a non-alarming Windows toast - framed as "fixed automatically" if a heal succeeded, otherwise "needs attention - IT has been notified; your files are safe." Throttled to at most once per 4 hours.

Destructive heals - admin-dispatch only

The two destructive remediations are never run automatically. They are shipped as scripts an admin dispatches deliberately (e.g. via execute_script targeting the hostname), and each self-gates against known-buggy client versions - running them on a buggy build is the exact data-loss scenario ADO #564 documented.

Script	What it does	Safety
`OneDrive/REM_OD_009_UnregisterSyncRoot.ps1`	Clears orphaned cldflt reparse points via `CfUnregisterSyncRoot`	SYSTEM context required; version-gated; placeholders carry no local content so no user data is deleted
`OneDrive/REM_OD_010_ResetClient.ps1`	Full `onedrive.exe /reset` for deep corruption	Version-gated; non-destructive to cloud data; rebuilds the local cache

Triage workflow (MCP)

For an AI agent (Claude / Copilot) connected to TATER's MCP, the help-desk path is:

list_onedrive_issues - see every endpoint with a OneDrive problem, the confirmed signals (✓), and recent heal outcomes.
find_device_by_user(query) - map a user's complaint ("Maria can't open her files") to a hostname.
execute_script(hostname, "OneDrive/REM_OD_00X...") - dispatch the suggested heal.
Per the Three-Doc Rule, file a Tasker task documenting the user impact and write a TATERpedia page if the pattern recurs across the fleet.

Platform & scope

OneDrive sync health is Windows-only - macOS and Linux agents run the monitor as a no-op (there is no Microsoft OneDrive sync engine to watch). Data is retained for 30 days per device (Cosmos TTL on the OnedriveHealthInventories container).

Agent Deployment - install the agent + the v2.4.16 heartbeat/health goroutines
Fleet Management - the Devices page where OneDrive grades surface
MCP Setup - connect Claude / Copilot for AI-driven triage