Observability for Distributed Teams: What Actually Matters

Most observability problems aren't tooling problems. They're ownership problems.

The Starting Point

We have a lot of monitoring at Citrix. Prometheus, Grafana, Datadog, Elasticsearch, custom dashboards, Slack alert channels. The tooling isn't the issue. The issue is that four teams can be looking at the same incident and seeing completely different things, because they're each watching their own slice of the stack without context about what the other slices are doing.

This is a common pattern with distributed teams. Each group builds observability for their own component — and they do it well. The session broker team watches broker metrics. The infrastructure team watches host health. The networking team watches latency and throughput. Individually, these are solid. Together, they create a situation where nobody has the full picture during an incident, and the first 15 minutes of every outage are spent figuring out who should be looking at what.

I've been thinking about this for a while, and I don't think the answer is a single unified dashboard (we tried that — more on it below). The answer is more about how teams agree on shared signals and who's responsible for what when those signals go wrong.

Dashboards vs. Understanding

There's a natural tendency to solve observability gaps by building more dashboards. Team doesn't have visibility into something? Add a panel. Cross-team dependency is unclear? Create a combined view. Executive wants to know system health? Build a status page.

We went through a phase where we had dashboards for everything. Hundreds of them. The problem wasn't a lack of data visualization — it was that most dashboards were write-once, read-never. Someone builds a dashboard during an incident or a project, it goes stale within a month, and then it sits there giving everyone a false sense of coverage.

The dashboards that actually get used share a few traits:

  • They answer a specific question that someone asks regularly ("Are we ready to release?" or "How did sessions perform overnight?")
  • They have an owner who keeps them accurate
  • They show fewer than 10 panels. Once a dashboard crosses ~12 panels, people stop reading it carefully

Everything else tends to decay. We've started pruning old dashboards quarterly, which feels wrong but has been helpful. Fewer dashboards, but the ones that remain are actually maintained.

Alert Fatigue Is a Coordination Problem

Alert fatigue comes up constantly in SRE conversations, usually framed as a tuning problem — thresholds are too sensitive, alerts fire too often, on-call engineers get paged for non-issues. Those are real, but in a distributed team setting, there's a subtler version: alerts that fire correctly but reach the wrong team.

Example: a storage latency spike on a DaaS host pool triggers an infra alert. The infra team investigates, finds nothing wrong with the storage layer, closes the alert. Meanwhile, the actual cause was a profile management change that the desktop team pushed, which doubled the I/O load during login. The infra alert was technically correct — storage latency did spike — but the team that received it couldn't fix it, and the team that caused it never saw an alert at all.

This happens more than you'd think. The symptom shows up in one team's metrics, but the root cause lives in another team's domain. Without cross-team correlation, you end up with a lot of "we investigated and it's not us" conversations.

We've had some success with what we call "symptom-to-cause" alert chains — when a symptom alert fires (e.g., session launch latency > threshold), it automatically pulls in metrics from adjacent systems (broker load, profile load time, network latency, host availability) and presents them together. It doesn't solve the ownership problem, but it shortens the "whose problem is this?" phase from 15 minutes to about 2.

Who Owns the Metric?

This sounds like a simple question but it caused more confusion than I expected. When session launch latency degrades, who owns that metric? The broker team (because the broker routes the session)? The infra team (because the hosts might be slow)? The desktop team (because profile load might be the bottleneck)? The network team (because latency between components might be the issue)?

The answer we landed on: the metric owner is the team closest to the user-facing outcome, regardless of root cause. For session launch latency, that's the broker team. They own the SLI, they triage the alert, and they pull in the right team based on what the data shows. They don't have to fix everything — they just have to figure out who should.

This was a deliberate choice. The alternative — routing alerts based on probable cause — requires accurate root cause detection, which is hard to automate and wrong often enough to cause problems. Having a stable owner for each user-facing metric creates accountability and avoids the "alert ping-pong" where tickets bounce between teams.

We formalized this in a document we call the Metric Ownership Map. It's a simple table: metric name, owner team, escalation path, and the 3-4 most common root causes with the corresponding team to pull in. Nothing fancy. It just needs to exist and be maintained.

Cross-Team Signals

The hardest part of observability in a distributed team isn't collecting data — it's getting the right data to the right people at the right time. Each team naturally instruments what they control. The gaps are in the spaces between teams.

A few patterns we've found useful:

Shared correlation IDs. Every request gets a correlation ID at the edge that propagates through every service. This is standard advice, but it's worth emphasizing how often it breaks down in practice. Legacy services that don't propagate the header, batch jobs that generate their own IDs, async processes where the correlation gets lost. Getting consistent correlation IDs across an entire stack takes sustained effort.

Change event feeds. When any team pushes a config change, deployment, or feature flag toggle, it shows up in a shared event feed that's overlaid on dashboards. This is one of the highest-value, lowest-cost observability improvements we've made. "Something changed at 10:14 AM" is often enough to immediately narrow down an incident.

SLO burn rate over threshold alerts. Instead of alerting on every metric threshold, we alert on SLO burn rate — how fast we're consuming our error budget. This naturally filters out noise (brief spikes don't burn much budget) and surfaces sustained degradation that individual metric alerts might miss. It also creates a shared language: "We've burned 40% of our monthly session launch error budget in 3 days" means something to every team, not just the one watching the metric.

What Helped

Incident reviews that focus on detection time. We already did post-incident reviews, but we added a specific section: "How long after the problem started did we detect it, and why did it take that long?" This surfaces observability gaps more reliably than any audit. The gaps that matter are the ones that actually slowed down an incident response.

On-call rotations that include a TPM. Not as a responder, but as an observer. Once a quarter, I shadow the on-call rotation for a week. I don't fix anything — I watch how people investigate. Where they go first, what data they look for, what's missing, where they get stuck. This has been a better source of observability improvement ideas than any planning session.

Killing dashboards. As mentioned above. Fewer dashboards, higher quality. We removed about 60% of our Grafana dashboards over six months. Nobody complained about any of them.

Weekly observability office hours. An open 30-minute slot where anyone can bring a question about metrics, alerts, or dashboards. Low attendance most weeks (3-5 people), but the questions that come up are always worth addressing. It's also where we catch stale alerts and orphaned dashboards.

What Didn't

The unified dashboard. We spent a few weeks building a "single pane of glass" for system health. It looked impressive in a demo. In practice, it was too high-level to be useful for investigation and too low-level to be useful for executives. It ended up in the dashboard graveyard within two months. I think these can work, but they need a very specific audience and use case. "Everyone" is not a use case.

Mandating structured logging without migration support. We said "all services must use structured JSON logging by Q2." Reasonable goal. But we didn't provide a library or migration guide, so each team implemented it differently. Field names weren't consistent, some teams used nested objects, others used flat keys. We got structured logs, but we couldn't query across them effectively. Should have started with a shared logging library.

Alert-driven SLOs. We tried defining SLOs based on existing alerts — if this alert fires X times per month, our SLO is Y. This gets it backwards. SLOs should come from user expectations, and alerts should be derived from SLOs. Starting from alerts just codifies your existing noise floor.