Detection Engineering

Detection Engineering Detailed Explanation

1. Introduction to Detection Engineering in Cybersecurity

Detection Engineering is the art and science of:

Designing
Developing
Testing
Deploying
Maintaining

Detection logic to find security threats inside systems.

Goal:
Detect malicious activities or unusual behaviors as early as possible,
and raise reliable and actionable alerts for security teams.

In Splunk, Detection Engineering includes:

Building Correlation Searches: Smart saved searches that detect bad behavior.
Creating Notable Events: Alerts that stand out and demand attention.
Developing Use Cases: Specific attack scenarios you want to detect.
Fine-tuning alerts: Reducing false positives (wrong alerts) and increasing true positives (real threats).

Without good detection engineering, even the best logs are useless — because you won't see threats early enough.

2. Core Areas of Detection Engineering

2.1 Threat Modeling and Use Case Development

Before you start writing detection rules, you need to think carefully:
What kinds of attacks do you want to catch?

Threat Modeling

What it means:

Study how real-world attackers behave.
Use frameworks like MITRE ATT&CK, which lists common attack techniques.

Example:

Credential Dumping (stealing passwords)
Lateral Movement (moving across systems)
Command and Control (C2) (remotely controlling infected machines)

How to apply it:

Identify which of these techniques could happen in your environment.
Example:
- If you run a lot of Windows servers, Credential Dumping (like using Mimikatz) should be a priority.

Use Case Prioritization

What it means:

Not all threats are equally important.
Focus first on high-risk, high-impact threats.

Examples of important use cases:

Suspicious login attempts from foreign countries.
Privilege escalation (user becomes administrator without permission).
Unauthorized download of sensitive company data.

Always start with the attacks that could hurt your organization the most!

2.2 Writing Detection Rules (Correlation Searches)

After identifying the types of attacks you want to catch (threat modeling and use cases),
the next step is to write the actual detection rules inside Splunk.

In Splunk, we call these rules Correlation Searches.

Search Design

What it means:

You must write smart Splunk SPL (Search Processing Language) queries.
These searches must be:
- Accurate (find the right events)
- Fast (optimized to run quickly)
- Efficient (use fewer system resources)

Key tips for search design:

Use indexed fields:
- Always filter on fields that are already indexed.
- Example: index=auth sourcetype=wineventlog:security is faster than searching everything.

Use tstats when possible:

tstats is faster than search because it uses summarized data.

Example:

| tstats summariesonly=true count from datamodel=Authentication where Authentication.action=success

Correlation Searches

What they are:

Saved Searches in Splunk that run automatically on a schedule.
When certain conditions are met (example: too many failed logins), the search triggers an alert.

Correlation Search settings usually include:

Schedule:
- How often the search runs (every 5 minutes, 15 minutes, hourly, etc.).
Trigger conditions:
- When exactly should it generate an alert? (e.g., more than 5 failed logins in 5 minutes)
Urgency Level:
- Set how serious the alert is (Critical, High, Medium, Low).
Tagging:
- Add tags like "Brute Force", "Data Exfiltration" to help categorize the alert.
Risk Scores:
- Assign a risk score to the event, user, or device involved.
Adaptive Response Actions:
- (Optional) Automatically take an action, like disabling a user account or isolating a machine.

2.3 False Positive Reduction

When you build detection rules, a big problem you will face is false positives.

What is a False Positive?

A false positive is an alert that triggers, but the event is not really malicious.
In other words:

The system cries "wolf," but there’s no wolf.

Too many false positives waste time, annoy security analysts, and cause real threats to be ignored.

How to Reduce False Positives?

There are three key techniques:

a. Context Enrichment

What it means:

Add extra information to your searches to make smarter decisions.

Example:

If a login comes from a company office during working hours, maybe it's normal.
But if the same login happens at 3 AM from another country, it’s suspicious!

How to apply:

Enrich logs with:
- Business hours (working time vs. off hours)
- Trusted IP address lists
- Known service accounts (accounts used by automated systems, not humans)

Example SPL idea:

| lookup trusted_ips src_ip OUTPUT trusted
| where NOT trusted="yes"

(This search only alerts on IPs not in the trusted list.)

b. Threshold Tuning

What it means:

Adjust how sensitive your detection rules are.

Example:

If you alert on 3 failed logins, you may get hundreds of false alarms.
If you alert on 10 failed logins within 5 minutes, it’s more meaningful.

Good Practice:

Study your normal environment behavior first.
Then set thresholds based on what is truly unusual.

c. Feedback Loops

What it means:

Build a process where security analysts review alerts and give feedback.

Example:

If analysts keep marking certain alerts as false positives, adjust the rule.
If they find a real threat you missed, improve the detection to catch it next time.

Why it's important:

Detection engineering should be a living process, not "set and forget."
Regular reviews make your system smarter over time.

2.4 Threat Coverage Mapping

After building good detection rules and reducing false positives,
the next important job is to make sure you are covering enough types of attacks.

This is called Threat Coverage Mapping.

What is Threat Coverage Mapping?

It means:

Checking which types of attacks you can detect.
Finding gaps where you still need to build detections.

This ensures your organization is protected against a wide variety of threats — not just a few.

How to Do Threat Coverage Mapping?

There are two main steps:

a. ATT&CK Mapping

What is MITRE ATT&CK?

A famous, open framework that lists real-world attacker techniques.
It’s like a map of how hackers operate during attacks.

Example Techniques:

T1078: Valid Accounts (stolen usernames and passwords).
T1059: Command Line Interface (using command lines to execute malicious code).
T1110: Brute Force (guessing passwords).

Good Practice:

Every detection rule you create should map to one or more ATT&CK techniques.
Example:
- A detection for suspicious PowerShell usage could map to T1059.001 (PowerShell).

Benefits of ATT&CK Mapping:

You can easily see what threats you are covering.
Helps explain to management how well protected you are.
Identifies where you still have detection gaps.

b. Coverage Gaps Analysis

What it means:

After mapping all your detections, you analyze which attack techniques are not covered.

Example:

You have good detection rules for credential dumping and phishing.
But no detection rules yet for lateral movement (e.g., Pass-the-Hash attacks).

How to improve:

Build new detections for uncovered techniques.
Prioritize high-risk or frequently used techniques first.

Regular Review:

Attackers constantly change methods.
You should review your threat coverage every few months and update accordingly.

2.5 Testing and Validation

After you build your detection rules,
you must test them to make sure they actually work in real life.

Good detection engineering is not just about writing rules —
it’s about proving that your rules really detect real attacks and don't waste time with bad alerts.

Why Testing and Validation Matter

An untested detection rule may miss real attacks (dangerous!).
Or it may cause too many false positives (wasting analyst time).
You must test and validate to build trust in your detection system.

How to Test and Validate Detection Rules

There are two main techniques:

a. Simulated Attacks

What it means:

You simulate (pretend) that an attacker is doing bad things inside your network.
Then you see if your detection rules trigger alerts correctly.

How to simulate attacks:

Use adversary emulation frameworks like:
- Atomic Red Team (easy, open-source, many test cases)
- MITRE Caldera (more advanced, automated testing)
Manually perform actions:
- Example: Try a few wrong logins to simulate a brute-force attack.
- Run PowerShell commands that are commonly used by attackers.

Why simulation is powerful:

It's like a fire drill for your security system.
Shows which detections work — and which need improvement.

b. Alert Validation

What it means:

After an alert is triggered, carefully check:
- Was it a real threat?
- Was it a normal behavior falsely flagged?
- Was it missing important details?

Good validation checklist:

Relevance: Is the alert about something meaningful?
Actionability: Can a security analyst take clear action based on the alert?
Documentation: Is there enough context in the alert for a quick investigation?

If problems are found:

Update the detection rule.
Adjust thresholds.
Add more context enrichment.

2.6 Metrics and Reporting

After building, testing, and validating your detection rules,
the next important responsibility is to measure how well your detection system is performing.
This is done through Metrics and Reporting.

Good metrics help you prove to yourself — and to your company — that your detections are working and improving over time.

What are Detection Metrics?

Metrics are numbers or statistics that show:

How fast you detect threats
How accurate your detections are
How much noise (false positives) your system is generating

Important Detection KPIs (Key Performance Indicators)

a. Mean Time to Detect (MTTD)

What it means:

How much time (on average) it takes for a security incident to be detected after it happens.

A lower MTTD is better — it means you are finding problems faster.

b. False Positive Rate (FPR)

What it means:

The percentage of alerts that turn out to be false alarms.

A lower FPR is better — you want real alerts, not noise.

Example:

If you get 100 alerts and 30 of them are false positives,
then your FPR = 30%.

c. True Positive Rate (TPR)

What it means:

The percentage of actual threats that you successfully detect.

A higher TPR is better — it means you are catching real threats.

Example:

If there are 50 real attacks, and you detect 45 of them,
then your TPR = 90%.

Continuous Improvement

Detection Engineering is never "finished."
You should always:

Measure your KPIs regularly (weekly or monthly).
Analyze trends (is MTTD going up or down?).
Tune detection rules based on the results.
Build new detections for newly discovered attack techniques.
Retire detections that no longer provide value.

Treat detection engineering like a living system — constantly updated and improved.

Example Detection Reporting:

Monthly Report showing:
- Number of alerts triggered
- Mean Time to Detect
- Number of false positives
- Coverage by ATT&CK techniques
Executive Dashboard with easy-to-read KPIs for management.

3. Important Best Practices for Detection Engineering

Now that you understand what Detection Engineering is and how it works,
you must also learn the best habits that real Detection Engineers follow.

Best practices help you:

Build better detections
Reduce mistakes
Make your security program stronger and more efficient

Let’s go through them step-by-step:

Best Practice 1: Focus on High-Fidelity, Low-Noise Detections First

What it means:

High-fidelity detections are very accurate.
When they trigger, they almost always mean something bad is happening.
Low-noise means you avoid creating too many unnecessary alerts.

Why this is important:

Analysts trust your alerts more.
Security teams are not overwhelmed by false alarms.
Real attacks are found faster.

How to apply:

Start by building detections for attacks that are easy to recognize and clearly bad (example: malware execution, unusual administrative actions).
Tune sensitivity carefully to avoid over-alerting.

Best Practice 2: Build Modular Searches

What it means:

Write small, simple, efficient SPL queries.
Avoid giant complicated searches that are hard to read and maintain.

Why this is important:

Easier to debug if something goes wrong.
Easier to reuse parts of a search for other detections.

Example:

Instead of writing one giant query with everything inside it,
create small searches for:

Detecting failed logins
Detecting suspicious file changes
Detecting command line abuse

Then combine them later if needed.

Best Practice 3: Version-Control Your Detection Content

What it means:

Track every change you make to detection rules.

How to do it:

Use Git repositories (just like software developers).
Or use Splunk Content Management apps.
Each change should have:
- A reason (e.g., "Reduced false positives for detection X").
- A version number.

Why important:

You can rollback to old versions if new detections break things.
You keep history of why changes were made (very useful for audits).

Best Practice 4: Regularly Simulate Attacks

What it means:

Regularly test your detection rules by simulating attacks (not just once during setup).

How often:

At least once per quarter (4 times a year).
More often if major changes happen (new data sources, major threat reports).

Why important:

Keeps your detections sharp.
Validates that updates in IT systems (e.g., new Windows versions) didn’t break your detections.
Helps uncover weaknesses before attackers do.

Best Practice 5: Collaborate with Incident Response (IR) Teams

What it means:

Work closely with the people who investigate real security incidents.

How to collaborate:

Ask them which alerts are most useful.
Get feedback on false positives they encounter.
Let them help you design better detections based on real-world experience.

Why important:

Detection rules are not theoretical — they must work in real-world investigations.
IR teams help make your detections more practical and effective.

4. Key Splunk Features to Master for Detection Engineering

If you want to become a great Detection Engineer using Splunk,
you must master some special Splunk features that are designed to make detection building, management, and investigation easier.

Key Feature 1: Enterprise Security (ES) Correlation Searches

What it is:

Splunk Enterprise Security (ES) is a premium app designed for security operations.
Correlation Searches are pre-built or custom saved searches inside ES that automatically detect suspicious activities.

What you can do with ES Correlation Searches:

Use out-of-the-box searches provided by Splunk.
Customize or create your own detections.
Add important metadata:
- Urgency levels (Critical, High, Medium, Low)
- Tags (e.g., "Brute Force", "Malware")
- Risk scores
Trigger Notable Events (special alerts) when conditions are matched.

Enterprise Security makes building and managing detections much easier and more organized.

Key Feature 2: Risk-Based Alerting (RBA)

What it is:

A smarter method of alerting.
Instead of alerting on every small suspicious action,
you assign risk scores to actions and alert only when the total risk is high.

How RBA works:

Small risky actions (e.g., failed login, suspicious PowerShell use) each add some risk points.
If the total points for a user or device go above a threshold (example: 100 points), an alert is triggered.

Why RBA is powerful:

Reduces alert noise.
Focuses analyst attention on truly dangerous behavior.
Helps detect complex, multi-stage attacks more easily.

Key Feature 3: Notable Events Framework

What it is:

When a correlation search finds something suspicious,
Splunk ES creates a Notable Event.

What a Notable Event includes:

Important fields (like username, IP address, time).
Tags and urgency.
Linked risk scores.
Direct links to related events or assets.

Why it’s useful:

Notable Events are collected into a central dashboard called Incident Review.
Analysts can easily investigate, assign, comment, and close incidents from there.

It centralizes alert management and makes investigations smoother and faster.

Key Feature 4: Splunk Content Updates (SCU)

What it is:

A system that regularly provides new and updated correlation searches from Splunk.

What you get with SCU:

New detection rules based on the latest attack techniques.
Updated and improved older detections.
Threat intelligence integrations.

Why important:

Attack techniques evolve all the time.
SCU helps you keep your detection content fresh without building everything from scratch.

Good Practice:

Check for Splunk Content Updates monthly and review what's new.

Key Feature 5: Investigations Workbench

What it is:

A tool inside Splunk ES that helps analysts investigate Notable Events easily.

What you can do:

See related events and assets.
Build timelines of activity.
Attach notes and evidence.
Collaborate with other analysts during investigations.

Why important:

Investigation Workbench helps analysts pivot quickly from an alert to the full context.
Speeds up investigation times significantly.

Detection Engineering (Additional Content)

1. ATT&CK Sub-Techniques Mapping

When building detection rules, it is important to map not only to main MITRE ATT&CK Techniques, but also to Sub-Techniques whenever possible for improved precision.

What are Sub-Techniques?

Sub-Techniques are specific, detailed variations of a broader technique.
Example:
- T1059 = Command and Scripting Interpreter (general category)
- T1059.001 = PowerShell (specific use of PowerShell interpreter)

Why Map to Sub-Techniques?

Provides finer granularity for tracking detection coverage.
Enhances threat modeling accuracy by pinpointing exactly what behavior you are detecting.
Helps during audits and reporting by clearly showing detailed defensive coverage.
Supports building more specific, targeted correlation searches.

Practical Example:

Instead of mapping a PowerShell-based malware execution detection to just T1059, you map it specifically to T1059.001.
This makes it clear that you are covering PowerShell abuse and not just general command-line activity.

Key takeaway:
Mapping detections to Sub-Techniques improves visibility into your environment’s strengths and gaps, and supports more precise security reporting.

2. Adaptive Response Actions

In Splunk Enterprise Security (ES), Adaptive Response Actions are configured to automatically or semi-automatically react to detected threats.

Common Types of Adaptive Response Actions

Sending Alert Emails:
- Notify analysts or response teams immediately upon detection of critical events.
Creating Incident Tickets:
- Automatically open incidents in external systems like ServiceNow, Jira, or internal ticketing systems for structured investigation tracking.
Running Automated Scripts for Containment:
- Execute predefined scripts that perform actions like:
  - Disabling a user account suspected of compromise.
  - Isolating an infected endpoint from the network.
  - Blocking malicious IP addresses in firewalls.

Why Adaptive Responses Matter

Speeds up the response process.
Reduces manual effort during high-pressure incidents.
Standardizes initial response actions across different types of alerts.
Allows for scalable and consistent reaction mechanisms.

Key takeaway:
Adaptive Response Actions are crucial for bridging the gap between detection and initial containment inside Splunk environments.

3. Detection Testing Tools

Testing and validating detection rules with realistic attack data is essential. Besides Atomic Red Team and MITRE Caldera, another important tool is the Splunk Attack Range.

What is Splunk Attack Range?

A Splunk-supported open-source project.
Provides an automated lab environment for:
- Simulating real-world attack behaviors.
- Generating authentic telemetry data (logs, alerts, events).
- Validating detection rules under near-realistic conditions.

Key Features of Splunk Attack Range

Deploys small, controlled environments using Terraform and Ansible.
Installs Splunk forwarders and indexes for realistic ingestion pipelines.
Runs adversary simulations based on frameworks like MITRE ATT&CK.
Collects data for detection engineering teams to analyze and improve rules.

Why Important?

Allows safe and structured testing without risking production environments.
Supports iterative development of high-fidelity, low-false-positive detection content.
Helps validate detection coverage against new or evolving attack techniques.

Key takeaway:
Splunk Attack Range enables controlled, repeatable, and realistic validation of cybersecurity detection strategies.

4. Risk-Based Alerting

Risk-Based Alerting (RBA) is an advanced detection strategy in Splunk Enterprise Security that prioritizes security events based on accumulated risk rather than individual isolated actions.

How RBA Works

Small suspicious events (e.g., failed logins, unusual PowerShell use) are assigned risk scores.
These risk scores accumulate over time for specific entities (users, endpoints, applications).
When the aggregate risk score crosses a defined threshold, a high-risk notable event is generated.

Example Scenario:

A user fails to log in three times (small score assigned).
The same user runs a suspicious script (additional score assigned).
Later, the user accesses a sensitive database unexpectedly (more score added).
The system identifies the total risk and triggers a notable event only when enough evidence of suspicious behavior accumulates.

Risk Analysis Framework

Manages how risk scores are:
- Calculated.
- Stored.
- Aggregated.
Supports:
- Dynamic scoring models based on event type, time, frequency, and severity.
- Flexible thresholds tailored to organizational risk tolerance.

Why RBA is Important

Reduces alert fatigue by not generating separate alerts for every minor anomaly.
Focuses analysts on high-priority investigations.
Detects slow, stealthy attacks that might be missed with isolated event detection.

Key takeaway:
Risk-Based Alerting transforms detection from a single-event focus to a pattern and entity risk accumulation model, enhancing overall threat identification capabilities.

Final Summary

By mastering these additional concepts, you will:

Build more precise detections aligned with both ATT&CK Techniques and Sub-Techniques.
Configure Adaptive Response Actions to automate containment and communication steps.
Use Splunk Attack Range to simulate and validate detections in realistic conditions.
Implement Risk-Based Alerting to prioritize real threats more intelligently and reduce noise.

Shopping cart

Subtotal:

SPLK-5002 Detection Engineering

Detailed list of SPLK-5002 knowledge points

Detection Engineering Detailed Explanation

1. Introduction to Detection Engineering in Cybersecurity

In Splunk, Detection Engineering includes:

2. Core Areas of Detection Engineering

2.1 Threat Modeling and Use Case Development

Threat Modeling

Example:

How to apply it:

Use Case Prioritization

Examples of important use cases:

2.2 Writing Detection Rules (Correlation Searches)

Search Design

Key tips for search design:

Correlation Searches

Correlation Search settings usually include:

2.3 False Positive Reduction

What is a False Positive?

How to Reduce False Positives?

a. Context Enrichment

Example:

How to apply:

Example SPL idea:

b. Threshold Tuning

Example:

Good Practice:

c. Feedback Loops

Example:

Why it's important:

2.4 Threat Coverage Mapping

What is Threat Coverage Mapping?

How to Do Threat Coverage Mapping?

a. ATT&CK Mapping

Example Techniques:

Good Practice:

Benefits of ATT&CK Mapping:

b. Coverage Gaps Analysis

Example:

How to improve:

Regular Review:

2.5 Testing and Validation

Why Testing and Validation Matter

How to Test and Validate Detection Rules

a. Simulated Attacks

How to simulate attacks:

Why simulation is powerful:

b. Alert Validation

Good validation checklist:

If problems are found:

2.6 Metrics and Reporting

What are Detection Metrics?

Important Detection KPIs (Key Performance Indicators)

a. Mean Time to Detect (MTTD)

b. False Positive Rate (FPR)

c. True Positive Rate (TPR)

Continuous Improvement

Example Detection Reporting:

3. Important Best Practices for Detection Engineering

Best Practice 1: Focus on High-Fidelity, Low-Noise Detections First

Why this is important:

How to apply:

Best Practice 2: Build Modular Searches

Why this is important:

Example:

Best Practice 3: Version-Control Your Detection Content

How to do it:

Why important:

Best Practice 4: Regularly Simulate Attacks

How often:

Why important:

Best Practice 5: Collaborate with Incident Response (IR) Teams

How to collaborate:

Why important:

4. Key Splunk Features to Master for Detection Engineering

Key Feature 1: Enterprise Security (ES) Correlation Searches

What you can do with ES Correlation Searches: