Detection Engineering is the art and science of:
Designing
Developing
Testing
Deploying
Maintaining
Detection logic to find security threats inside systems.
Goal:
Detect malicious activities or unusual behaviors as early as possible,
and raise reliable and actionable alerts for security teams.
Building Correlation Searches: Smart saved searches that detect bad behavior.
Creating Notable Events: Alerts that stand out and demand attention.
Developing Use Cases: Specific attack scenarios you want to detect.
Fine-tuning alerts: Reducing false positives (wrong alerts) and increasing true positives (real threats).
Without good detection engineering, even the best logs are useless — because you won't see threats early enough.
Before you start writing detection rules, you need to think carefully:
What kinds of attacks do you want to catch?
What it means:
Study how real-world attackers behave.
Use frameworks like MITRE ATT&CK, which lists common attack techniques.
Credential Dumping (stealing passwords)
Lateral Movement (moving across systems)
Command and Control (C2) (remotely controlling infected machines)
Identify which of these techniques could happen in your environment.
Example:
What it means:
Not all threats are equally important.
Focus first on high-risk, high-impact threats.
Suspicious login attempts from foreign countries.
Privilege escalation (user becomes administrator without permission).
Unauthorized download of sensitive company data.
Always start with the attacks that could hurt your organization the most!
After identifying the types of attacks you want to catch (threat modeling and use cases),
the next step is to write the actual detection rules inside Splunk.
In Splunk, we call these rules Correlation Searches.
What it means:
You must write smart Splunk SPL (Search Processing Language) queries.
These searches must be:
Accurate (find the right events)
Fast (optimized to run quickly)
Efficient (use fewer system resources)
Use indexed fields:
Always filter on fields that are already indexed.
Example: index=auth sourcetype=wineventlog:security is faster than searching everything.
Use tstats when possible:
tstats is faster than search because it uses summarized data.
Example:
| tstats summariesonly=true count from datamodel=Authentication where Authentication.action=success
What they are:
Saved Searches in Splunk that run automatically on a schedule.
When certain conditions are met (example: too many failed logins), the search triggers an alert.
Schedule:
Trigger conditions:
Urgency Level:
Tagging:
Risk Scores:
Adaptive Response Actions:
When you build detection rules, a big problem you will face is false positives.
A false positive is an alert that triggers, but the event is not really malicious.
In other words:
The system cries "wolf," but there’s no wolf.
Too many false positives waste time, annoy security analysts, and cause real threats to be ignored.
There are three key techniques:
What it means:
If a login comes from a company office during working hours, maybe it's normal.
But if the same login happens at 3 AM from another country, it’s suspicious!
Enrich logs with:
Business hours (working time vs. off hours)
Trusted IP address lists
Known service accounts (accounts used by automated systems, not humans)
| lookup trusted_ips src_ip OUTPUT trusted
| where NOT trusted="yes"
(This search only alerts on IPs not in the trusted list.)
What it means:
If you alert on 3 failed logins, you may get hundreds of false alarms.
If you alert on 10 failed logins within 5 minutes, it’s more meaningful.
Study your normal environment behavior first.
Then set thresholds based on what is truly unusual.
What it means:
If analysts keep marking certain alerts as false positives, adjust the rule.
If they find a real threat you missed, improve the detection to catch it next time.
Detection engineering should be a living process, not "set and forget."
Regular reviews make your system smarter over time.
After building good detection rules and reducing false positives,
the next important job is to make sure you are covering enough types of attacks.
This is called Threat Coverage Mapping.
It means:
Checking which types of attacks you can detect.
Finding gaps where you still need to build detections.
This ensures your organization is protected against a wide variety of threats — not just a few.
There are two main steps:
What is MITRE ATT&CK?
A famous, open framework that lists real-world attacker techniques.
It’s like a map of how hackers operate during attacks.
T1078: Valid Accounts (stolen usernames and passwords).
T1059: Command Line Interface (using command lines to execute malicious code).
T1110: Brute Force (guessing passwords).
Every detection rule you create should map to one or more ATT&CK techniques.
Example:
You can easily see what threats you are covering.
Helps explain to management how well protected you are.
Identifies where you still have detection gaps.
What it means:
You have good detection rules for credential dumping and phishing.
But no detection rules yet for lateral movement (e.g., Pass-the-Hash attacks).
Build new detections for uncovered techniques.
Prioritize high-risk or frequently used techniques first.
Attackers constantly change methods.
You should review your threat coverage every few months and update accordingly.
After you build your detection rules,
you must test them to make sure they actually work in real life.
Good detection engineering is not just about writing rules —
it’s about proving that your rules really detect real attacks and don't waste time with bad alerts.
An untested detection rule may miss real attacks (dangerous!).
Or it may cause too many false positives (wasting analyst time).
You must test and validate to build trust in your detection system.
There are two main techniques:
What it means:
You simulate (pretend) that an attacker is doing bad things inside your network.
Then you see if your detection rules trigger alerts correctly.
Use adversary emulation frameworks like:
Atomic Red Team (easy, open-source, many test cases)
MITRE Caldera (more advanced, automated testing)
Manually perform actions:
Example: Try a few wrong logins to simulate a brute-force attack.
Run PowerShell commands that are commonly used by attackers.
It's like a fire drill for your security system.
Shows which detections work — and which need improvement.
What it means:
After an alert is triggered, carefully check:
Was it a real threat?
Was it a normal behavior falsely flagged?
Was it missing important details?
Relevance: Is the alert about something meaningful?
Actionability: Can a security analyst take clear action based on the alert?
Documentation: Is there enough context in the alert for a quick investigation?
Update the detection rule.
Adjust thresholds.
Add more context enrichment.
After building, testing, and validating your detection rules,
the next important responsibility is to measure how well your detection system is performing.
This is done through Metrics and Reporting.
Good metrics help you prove to yourself — and to your company — that your detections are working and improving over time.
Metrics are numbers or statistics that show:
How fast you detect threats
How accurate your detections are
How much noise (false positives) your system is generating
What it means:
A lower MTTD is better — it means you are finding problems faster.
What it means:
A lower FPR is better — you want real alerts, not noise.
Example:
What it means:
A higher TPR is better — it means you are catching real threats.
Example:
Detection Engineering is never "finished."
You should always:
Measure your KPIs regularly (weekly or monthly).
Analyze trends (is MTTD going up or down?).
Tune detection rules based on the results.
Build new detections for newly discovered attack techniques.
Retire detections that no longer provide value.
Treat detection engineering like a living system — constantly updated and improved.
Monthly Report showing:
Number of alerts triggered
Mean Time to Detect
Number of false positives
Coverage by ATT&CK techniques
Executive Dashboard with easy-to-read KPIs for management.
Now that you understand what Detection Engineering is and how it works,
you must also learn the best habits that real Detection Engineers follow.
Best practices help you:
Build better detections
Reduce mistakes
Make your security program stronger and more efficient
Let’s go through them step-by-step:
What it means:
High-fidelity detections are very accurate.
When they trigger, they almost always mean something bad is happening.
Low-noise means you avoid creating too many unnecessary alerts.
Analysts trust your alerts more.
Security teams are not overwhelmed by false alarms.
Real attacks are found faster.
Start by building detections for attacks that are easy to recognize and clearly bad (example: malware execution, unusual administrative actions).
Tune sensitivity carefully to avoid over-alerting.
What it means:
Write small, simple, efficient SPL queries.
Avoid giant complicated searches that are hard to read and maintain.
Easier to debug if something goes wrong.
Easier to reuse parts of a search for other detections.
Instead of writing one giant query with everything inside it,
create small searches for:
Detecting failed logins
Detecting suspicious file changes
Detecting command line abuse
Then combine them later if needed.
What it means:
Use Git repositories (just like software developers).
Or use Splunk Content Management apps.
Each change should have:
A reason (e.g., "Reduced false positives for detection X").
A version number.
You can rollback to old versions if new detections break things.
You keep history of why changes were made (very useful for audits).
What it means:
At least once per quarter (4 times a year).
More often if major changes happen (new data sources, major threat reports).
Keeps your detections sharp.
Validates that updates in IT systems (e.g., new Windows versions) didn’t break your detections.
Helps uncover weaknesses before attackers do.
What it means:
Ask them which alerts are most useful.
Get feedback on false positives they encounter.
Let them help you design better detections based on real-world experience.
Detection rules are not theoretical — they must work in real-world investigations.
IR teams help make your detections more practical and effective.
If you want to become a great Detection Engineer using Splunk,
you must master some special Splunk features that are designed to make detection building, management, and investigation easier.
What it is:
Splunk Enterprise Security (ES) is a premium app designed for security operations.
Correlation Searches are pre-built or custom saved searches inside ES that automatically detect suspicious activities.
Use out-of-the-box searches provided by Splunk.
Customize or create your own detections.
Add important metadata:
Urgency levels (Critical, High, Medium, Low)
Tags (e.g., "Brute Force", "Malware")
Risk scores
Trigger Notable Events (special alerts) when conditions are matched.
Enterprise Security makes building and managing detections much easier and more organized.
What it is:
A smarter method of alerting.
Instead of alerting on every small suspicious action,
you assign risk scores to actions and alert only when the total risk is high.
Small risky actions (e.g., failed login, suspicious PowerShell use) each add some risk points.
If the total points for a user or device go above a threshold (example: 100 points), an alert is triggered.
Reduces alert noise.
Focuses analyst attention on truly dangerous behavior.
Helps detect complex, multi-stage attacks more easily.
What it is:
Important fields (like username, IP address, time).
Tags and urgency.
Linked risk scores.
Direct links to related events or assets.
Notable Events are collected into a central dashboard called Incident Review.
Analysts can easily investigate, assign, comment, and close incidents from there.
It centralizes alert management and makes investigations smoother and faster.
What it is:
New detection rules based on the latest attack techniques.
Updated and improved older detections.
Threat intelligence integrations.
Attack techniques evolve all the time.
SCU helps you keep your detection content fresh without building everything from scratch.
Good Practice:
What it is:
See related events and assets.
Build timelines of activity.
Attach notes and evidence.
Collaborate with other analysts during investigations.
Investigation Workbench helps analysts pivot quickly from an alert to the full context.
Speeds up investigation times significantly.
When building detection rules, it is important to map not only to main MITRE ATT&CK Techniques, but also to Sub-Techniques whenever possible for improved precision.
Sub-Techniques are specific, detailed variations of a broader technique.
Example:
T1059 = Command and Scripting Interpreter (general category)
T1059.001 = PowerShell (specific use of PowerShell interpreter)
Provides finer granularity for tracking detection coverage.
Enhances threat modeling accuracy by pinpointing exactly what behavior you are detecting.
Helps during audits and reporting by clearly showing detailed defensive coverage.
Supports building more specific, targeted correlation searches.
Instead of mapping a PowerShell-based malware execution detection to just T1059, you map it specifically to T1059.001.
This makes it clear that you are covering PowerShell abuse and not just general command-line activity.
Key takeaway:
Mapping detections to Sub-Techniques improves visibility into your environment’s strengths and gaps, and supports more precise security reporting.
In Splunk Enterprise Security (ES), Adaptive Response Actions are configured to automatically or semi-automatically react to detected threats.
Sending Alert Emails:
Creating Incident Tickets:
Running Automated Scripts for Containment:
Execute predefined scripts that perform actions like:
Disabling a user account suspected of compromise.
Isolating an infected endpoint from the network.
Blocking malicious IP addresses in firewalls.
Speeds up the response process.
Reduces manual effort during high-pressure incidents.
Standardizes initial response actions across different types of alerts.
Allows for scalable and consistent reaction mechanisms.
Key takeaway:
Adaptive Response Actions are crucial for bridging the gap between detection and initial containment inside Splunk environments.
Testing and validating detection rules with realistic attack data is essential. Besides Atomic Red Team and MITRE Caldera, another important tool is the Splunk Attack Range.
A Splunk-supported open-source project.
Provides an automated lab environment for:
Simulating real-world attack behaviors.
Generating authentic telemetry data (logs, alerts, events).
Validating detection rules under near-realistic conditions.
Deploys small, controlled environments using Terraform and Ansible.
Installs Splunk forwarders and indexes for realistic ingestion pipelines.
Runs adversary simulations based on frameworks like MITRE ATT&CK.
Collects data for detection engineering teams to analyze and improve rules.
Allows safe and structured testing without risking production environments.
Supports iterative development of high-fidelity, low-false-positive detection content.
Helps validate detection coverage against new or evolving attack techniques.
Key takeaway:
Splunk Attack Range enables controlled, repeatable, and realistic validation of cybersecurity detection strategies.
Risk-Based Alerting (RBA) is an advanced detection strategy in Splunk Enterprise Security that prioritizes security events based on accumulated risk rather than individual isolated actions.
Small suspicious events (e.g., failed logins, unusual PowerShell use) are assigned risk scores.
These risk scores accumulate over time for specific entities (users, endpoints, applications).
When the aggregate risk score crosses a defined threshold, a high-risk notable event is generated.
A user fails to log in three times (small score assigned).
The same user runs a suspicious script (additional score assigned).
Later, the user accesses a sensitive database unexpectedly (more score added).
The system identifies the total risk and triggers a notable event only when enough evidence of suspicious behavior accumulates.
Manages how risk scores are:
Calculated.
Stored.
Aggregated.
Supports:
Dynamic scoring models based on event type, time, frequency, and severity.
Flexible thresholds tailored to organizational risk tolerance.
Reduces alert fatigue by not generating separate alerts for every minor anomaly.
Focuses analysts on high-priority investigations.
Detects slow, stealthy attacks that might be missed with isolated event detection.
Key takeaway:
Risk-Based Alerting transforms detection from a single-event focus to a pattern and entity risk accumulation model, enhancing overall threat identification capabilities.
By mastering these additional concepts, you will:
Build more precise detections aligned with both ATT&CK Techniques and Sub-Techniques.
Configure Adaptive Response Actions to automate containment and communication steps.
Use Splunk Attack Range to simulate and validate detections in realistic conditions.
Implement Risk-Based Alerting to prioritize real threats more intelligently and reduce noise.
What is the primary purpose of tuning a correlation search in Splunk Enterprise Security?
To reduce false positives while preserving meaningful security detections.
Correlation searches generate alerts when conditions match detection logic. However, raw detections often trigger excessive alerts due to benign behavior or environmental noise. Tuning adjusts thresholds, adds contextual filters, or introduces asset and identity enrichment to distinguish legitimate activity from suspicious patterns. Effective tuning ensures alerts remain actionable for analysts without overwhelming security operations teams. A common tuning mistake is disabling detections entirely instead of refining conditions or adding contextual enrichment.
Demand Score: 90
Exam Relevance Score: 94
In Splunk Enterprise Security risk-based alerting, what role does a risk modifier play?
A risk modifier increases the calculated risk score for an entity when a defined behavior or condition is detected.
Risk modifiers attach risk scores to risk objects such as users or systems when suspicious activity occurs. Instead of triggering immediate alerts, multiple risk events accumulate over time. When the aggregated score exceeds a threshold, a notable event is generated. This model reduces alert fatigue by correlating weaker signals into higher-confidence detections. Engineers must carefully assign risk values to behaviors to ensure meaningful scoring without triggering premature alerts.
Demand Score: 86
Exam Relevance Score: 96
What advantage does risk-based alerting provide compared to traditional correlation searches?
It correlates multiple low-confidence events into a higher-confidence detection through cumulative risk scoring.
Traditional correlation searches often trigger alerts from a single detection condition. Risk-based alerting aggregates multiple signals related to an entity and evaluates the cumulative risk score before generating a notable event. This method reduces alert noise and improves detection fidelity. It is particularly effective for identifying slow or distributed attack behaviors where individual events may appear benign. Misconfigured risk thresholds can either suppress real threats or create excessive alerts.
Demand Score: 82
Exam Relevance Score: 93
Why should context such as asset and identity information be incorporated into detection logic?
Context improves detection accuracy by distinguishing critical systems or privileged users from normal activity.
Security detections often trigger on generic behaviors such as authentication failures or network access. Without context, these detections may generate numerous irrelevant alerts. Asset and identity frameworks enrich events with metadata such as system criticality, user roles, or department information. Detection rules can then prioritize suspicious activity involving sensitive systems or privileged users. A frequent mistake is deploying detections without contextual enrichment, leading to excessive alert volume and reduced analyst efficiency.
Demand Score: 80
Exam Relevance Score: 90
What triggers the creation of a notable event in a risk-based alerting workflow?
A notable event is generated when the accumulated risk score for a risk object exceeds a configured threshold.
In risk-based alerting, risk modifiers continuously assign scores to entities such as users or hosts. These risk scores accumulate within a defined time window. When the aggregated risk exceeds a configured threshold, the system generates a notable event for investigation. This mechanism ensures that alerts represent meaningful patterns rather than isolated signals. Incorrect threshold configuration is a common issue that either suppresses detections or floods analysts with alerts.
Demand Score: 84
Exam Relevance Score: 94
What lifecycle stage focuses on validating whether detections remain effective after deployment?
The detection maintenance and validation stage.
Detection engineering is not a one-time task. After deployment, detections must be continuously evaluated to ensure they remain effective as environments evolve. Validation includes reviewing alert quality, assessing false positives, and confirming that detections still align with emerging threats. Security teams often track detection performance metrics and refine logic based on analyst feedback. Neglecting the maintenance stage can result in outdated detections that fail to identify new attack techniques.
Demand Score: 78
Exam Relevance Score: 87