← Help & Docs

OneDrive Sync Health

Automatic OneDrive sync monitoring, admin alerting, opt-in self-heal, and user notification. Last updated 2026-06-03

What it does ADO #582

The TATER Agent (v2.4.15+) continuously watches OneDrive for Business sync health on Windows endpoints and reports problems centrally - the "files aren't where I left them" / "OneDrive icon is gone" / "nothing is syncing" class of help-desk tickets that are otherwise invisible until a user complains. It ships in three layers that build on each other:

The 10 health checks

SignalGradeMeaningSuggested heal
od.client.missingfailOneDrive.exe not installedREM_OD_001
od.client.not-runningfailOneDrive process not running for the active userREM_OD_003 (Tier 1)
od.client.buggy-versionwarnClient is on a known data-loss version (e.g. 26.74, ADO #564)REM_OD_002
od.registry.no-business1failNo Business1 account registered - user not signed inREM_OD_004
od.registry.sync-pausedwarnSync is pausedREM_OD_005 (Tier 1)
od.cldflt.disabledcriticalCloud Files filter driver not running - placeholders inaccessible / 0-byteREM_OD_006 (Tier 1)
od.cldflt.reparse-orphanscriticalReparse points with no sync provider - "files not where I left them"REM_OD_009 (destructive)
od.kfm.misconfiguredwarnKnown Folder Move target mismatchREM_OD_007
od.policy.conflictingwarnLocal ClientPolicy.ini overrides managed Intune ADMXREM_OD_008
od.hydration.failsfailAttempted file hydration returned an errorREM_OD_010 (destructive)

The overall device grade is the worst of any signal: critical > fail > warn > ok.

2-cycle hysteresis - no cry-wolf

OneDrive routinely reports a momentary warning during a reboot, network blip, or large download. To avoid paging IT on transient noise, a signal must appear on two consecutive hourly polls before it is marked confirmed. Only confirmed signals drive alerts, tasks, and heals. The agent persists its cycle state at %ProgramData%\TATER\onedrive-state.json.

Admin alerting

When a confirmed signal reaches the alert threshold (default fail or worse), TATER automatically:

Per-org configuration

Everything works out of the box with no setup. To tune it, an Admin sets Settings → integrations → onedriveHealth:

FieldDefaultEffect
enabledtrueMaster switch for alerting
minGradeForAlertfailLowest grade that opens a task (warn | fail | critical)
autoCreateTaskstrueWhether to open Ops tasks (SIEM event still fires when off)
taskAssignee(none)Optional email to assign auto-created tasks to
suppressedDevices[]Hostnames that never alert

Suppressing a noisy device

For a device with an expected, accepted OneDrive issue, suppress it:

POST /api/onedrive-health/suppress
{ "hostname": "LAB-WS-07", "suppress": true }

Suppressed devices still report health (they appear dimmed in the fleet view) but never open tasks or fire events.

Self-heal & user notification (opt-in)

Self-heal is off by default. Enable it per device by setting onedriveSelfHeal: true in the agent config.json. When on, the agent attempts safe Tier 1 fixes the moment a signal confirms:

SignalTier 1 actionContext
sync-pausedclear the pause flag + nudge the clientuser session
client not-runningrelaunch OneDrive.exe /backgrounduser session
cldflt disabledstart the cldflt servicemachine (SYSTEM)

Every heal is gated for safety: it is never attempted on a known-buggy client version, user-session heals are skipped when the agent runs as SYSTEM (service mode), and the same action is not retried within 6 hours. Outcomes (✅ success / ❌ failed / ⏭️ skipped, with reason) are reported back and visible in the fleet view and the list_onedrive_issues MCP tool.

User notification (Tier 2)

In tray mode, when a confirmed user-impacting issue persists the signed-in user gets a non-alarming Windows toast - framed as "fixed automatically" if a heal succeeded, otherwise "needs attention - IT has been notified; your files are safe." Throttled to at most once per 4 hours.

Destructive heals - admin-dispatch only

The two destructive remediations are never run automatically. They are shipped as scripts an admin dispatches deliberately (e.g. via execute_script targeting the hostname), and each self-gates against known-buggy client versions - running them on a buggy build is the exact data-loss scenario ADO #564 documented.

ScriptWhat it doesSafety
OneDrive/REM_OD_009_UnregisterSyncRoot.ps1Clears orphaned cldflt reparse points via CfUnregisterSyncRootSYSTEM context required; version-gated; placeholders carry no local content so no user data is deleted
OneDrive/REM_OD_010_ResetClient.ps1Full onedrive.exe /reset for deep corruptionVersion-gated; non-destructive to cloud data; rebuilds the local cache

Triage workflow (MCP)

For an AI agent (Claude / Copilot) connected to TATER's MCP, the help-desk path is:

  1. list_onedrive_issues - see every endpoint with a OneDrive problem, the confirmed signals (✓), and recent heal outcomes.
  2. find_device_by_user(query) - map a user's complaint ("Maria can't open her files") to a hostname.
  3. execute_script(hostname, "OneDrive/REM_OD_00X...") - dispatch the suggested heal.
  4. Per the Three-Doc Rule, file a Tasker task documenting the user impact and write a TATERpedia page if the pattern recurs across the fleet.

Platform & scope

OneDrive sync health is Windows-only - macOS and Linux agents run the monitor as a no-op (there is no Microsoft OneDrive sync engine to watch). Data is retained for 30 days per device (Cosmos TTL on the OnedriveHealthInventories container).

Related