Splunk Observability Cloud provides pre-built templates and best practice guidelines for creating detectors that cover the most frequent and important operational scenarios.
By understanding these common use cases, you can:
Quickly build effective and reliable monitoring systems.
Reduce incident detection time.
Ensure operational health across your infrastructure and applications.
Good detectors are critical for keeping systems stable, identifying risks early, and minimizing service disruption.
Below are ten common use cases for detectors, along with examples of how to design them.
Metric: cpu.utilization
Condition: Average CPU usage greater than 80 percent for five consecutive minutes.
Alert: Set a critical severity alert if the condition is sustained for ten minutes.
High CPU usage can lead to system slowdown or crashes if not addressed quickly.
Metric: memory.used_percent
Condition: Memory usage exceeds 90 percent.
Action:
Start with a warning alert.
Escalate to critical if the memory pressure is not resolved in a set time.
Monitoring memory is important to prevent system instability and application crashes.
Metric: disk.fs.used_percent
Condition: Disk usage greater than 85 percent.
Additional Note:
Trigger the alert only if the condition persists for 15 minutes.
This avoids false positives from short-term spikes.
Running out of disk space can cause severe service disruptions, including system failures.
Metric: Heartbeat signal or system uptime metric.
Condition: No data received for more than five minutes.
Special Note:
Host down detection is essential for ensuring server availability and quick response to hardware or network failures.
Metric: http.server.errors
Condition: Error rate exceeds five percent over baseline.
Enrichment:
Tracking error rates is crucial for maintaining application reliability and user satisfaction.
Metric: db.query.latency
Condition: 95th percentile (p95) latency exceeds 300 milliseconds.
Importance:
This detector helps catch slow queries that could slow down entire systems.
Metric: network.bytes_received_per_second
Condition: Network throughput drops by 50 percent compared to the previous ten minutes' average.
Purpose:
Network problems can lead to service latency or failures, making early detection very important.
Metrics:
kubernetes.node.ready
kubernetes.pod.status
Condition:
Node is not ready.
Pod enters CrashLoopBackOff state.
Kubernetes environments are dynamic, and detecting node or pod issues early helps prevent large-scale outages.
Metric: service.availability
Condition: Availability drops below 99.9 percent.
Context:
Monitoring availability ensures that user experience and business SLAs (Service Level Agreements) are maintained.
Metric: Custom deployment markers tied to software release versions.
Condition:
Increased error rates.
Increased latency after new version deployment.
Monitoring deployments helps identify bad releases quickly so that they can be rolled back or corrected.
To make sure your detectors are as effective and reliable as possible, follow these best practices:
Include information such as:
Which host
Which service
Which Kubernetes pod
This extra context makes it much easier for teams to troubleshoot and resolve incidents quickly.
Instead of setting static values, use baseline comparisons or rate-of-change thresholds.
Dynamic thresholds adapt to normal changes in system behavior, reducing false positives.
Use multiple conditions to confirm issues before triggering an alert.
Example:
Multi-condition detectors reduce unnecessary alerts and help focus attention on real problems.
Always validate your detector logic by:
Running in a test environment.
Simulating metric anomalies.
Reviewing historical data.
Testing prevents detectors from generating excessive false positives or missing true incidents after going live.
You now understand:
The most common monitoring use cases in Splunk Observability, such as CPU usage, memory pressure, disk space monitoring, host availability, application error rates, database latency, network throughput, Kubernetes health, service availability, and deployment health.
How to design specific detectors for each use case, including the appropriate metrics, conditions, and severity settings.
Best practices for making detectors more accurate, efficient, and useful through contextualization, dynamic thresholds, multi-condition logic, and careful testing.
During planned maintenance activities, it is common for systems to behave abnormally or become temporarily unavailable.
To prevent irrelevant or unnecessary alerts during these periods, Muting Rules should be applied in Splunk Observability Cloud.
Muting Rules allow users to suppress specific detectors or entire groups of alerts based on time schedules, service names, environment tags, or other attributes.
Proper use of muting ensures that operations teams are not overwhelmed with false alarms during expected downtime, maintaining the credibility of the alerting system.
You may encounter a question like:
"How do you avoid detector noise during planned server maintenance?"
The correct answer is:
"Apply Muting Rules during maintenance windows to suppress irrelevant alerts."
Consider configuring muting rules to suppress detector alerts during scheduled maintenance periods, preventing unnecessary alert noise.
Selecting an appropriate evaluation window is critical to building reliable detectors.
If detectors trigger based on single instantaneous threshold breaches (e.g., CPU usage spikes above 80% for just one moment), they are prone to false positives or alert flapping.
Instead, it is better to configure detectors to trigger only if the condition is met on average over a reasonable period, such as:
CPU usage greater than 80% averaged over 5 minutes
Memory usage above 90% sustained over a 10-minute window
This approach smooths out short-lived spikes and ensures that alerts represent sustained, meaningful issues.
You may encounter a question like:
"Why is it important to define an evaluation window in detector configuration?"
The correct answer should highlight:
"To prevent false positives caused by short-lived metric spikes."
Use reasonable evaluation windows to prevent false positives due to short-lived metric spikes and to ensure that alerts reflect sustained conditions.
| Topic | Key Points |
|---|---|
| Muting Rules During Maintenance | Configure muting rules to suppress detector alerts during planned maintenance to avoid unnecessary noise. |
| Evaluation Window Best Practices | Use time-averaged evaluation windows to avoid alert flapping caused by temporary metric fluctuations. |
What causes alert flapping in monitoring detectors?
Alert flapping occurs when metric values repeatedly cross alert thresholds within short time intervals.
When a metric fluctuates near a threshold, detectors may rapidly switch between alert and recovery states. This produces repeated notifications that can overwhelm operators. Flapping typically occurs when thresholds are too close to normal metric variation or when evaluation windows are too short. Stabilizing detectors often requires adjusting thresholds or increasing evaluation durations so alerts trigger only for sustained problems.
Demand Score: 86
Exam Relevance Score: 90
How can detectors be configured to reduce alert flapping?
Detectors can reduce flapping by adding duration conditions that require thresholds to be exceeded for a sustained period before triggering alerts.
Instead of triggering alerts immediately when a threshold is crossed, detectors can evaluate whether the condition persists for several minutes or evaluation cycles. This prevents alerts caused by temporary metric fluctuations. Combining duration requirements with appropriate thresholds and analytic smoothing functions can significantly reduce alert noise while preserving sensitivity to genuine incidents.
Demand Score: 84
Exam Relevance Score: 88
Why is monitoring ephemeral infrastructure challenging for detectors?
Ephemeral infrastructure frequently creates and removes resources, which changes the set of metric time series being monitored.
In environments such as container orchestration systems, instances may exist only briefly. Detectors configured for static infrastructure may fail to detect issues when new instances appear or old ones disappear. Monitoring strategies must therefore rely on dynamic dimension filtering or population-based monitoring rather than fixed host identifiers. This ensures detectors adapt automatically as infrastructure changes.
Demand Score: 82
Exam Relevance Score: 87
How can detectors monitor large numbers of sources effectively?
Detectors can monitor populations of metric time series using aggregated or grouped signals instead of individual host metrics.
In large environments, creating separate detectors for every host is inefficient. Population-based detectors evaluate groups of metric time series simultaneously. For example, a detector may trigger if a percentage of hosts exceed a CPU threshold. This approach scales monitoring across large infrastructures while maintaining manageable detector configurations.
Demand Score: 80
Exam Relevance Score: 88
Why might a detector fail to trigger alerts even when a metric threshold appears exceeded?
The detector may not trigger due to evaluation window settings, missing datapoints, or incorrect signal filters.
Detectors evaluate signals according to defined time windows and analytic functions. If datapoints arrive late or fall outside the evaluation window, the detector may not register the threshold breach. Filters applied to metric dimensions may also exclude the affected infrastructure entity. Reviewing detector signal configuration and evaluation settings is necessary when alerts fail to trigger as expected.
Demand Score: 79
Exam Relevance Score: 87