Data Engineering

Data Engineering Detailed Explanation

1. Introduction to Data Engineering in Cybersecurity

In the context of cybersecurity, Data Engineering means setting up systems and processes to:

Collect security-related data from different systems.
Prepare and transform that data so it’s clean and organized.
Optimize it so that Splunk (or other tools) can quickly search, analyze, and detect security threats.

The main goal is:

To make sure that all incoming data is accurate (no mistakes),
Normalized (consistent and standardized),
Searchable (easy and fast to find),
Efficient (doesn’t overload the system).

Without good Data Engineering, security detection becomes slow, messy, and unreliable.

2. Core Areas of Data Engineering

2.1 Data Collection

This is the first and most important step: getting the right data into Splunk.

Identify Data Sources

You need to first know where security-related data is coming from.
Some common sources are:

Firewalls: Devices that control network traffic. They log when traffic is allowed or blocked.
IDS/IPS (Intrusion Detection/Prevention Systems): These detect or block suspicious network activities.
EDR (Endpoint Detection and Response): Tools installed on computers to monitor suspicious behavior (like CrowdStrike, SentinelOne).
Antivirus Logs: Reports from antivirus programs showing malware detection.
Active Directory Logs: Records of user authentication (login/logout) in Windows environments.
Cloud Security Logs: Monitoring activities happening inside cloud environments (AWS CloudTrail, Azure Security Center).
VPN Logs: Logs of users connecting remotely to the company’s network.
Email Security Tools: Logs of phishing attempts, malware attachments, etc.

In short:

Data Collection starts by understanding what systems you have and what data they produce.

Data Onboarding Techniques

Once you know what data you need, you must bring it into Splunk.
There are different methods:

Universal Forwarders:
- Small, lightweight Splunk agents.
- Installed on servers or devices.
- They send raw logs (without modifying them) to Splunk Indexers.
- Example: install a forwarder on a web server to send access logs.
Heavy Forwarders:
- Bigger, more powerful Splunk agents.
- Can parse, filter, and route data before sending it.
- Used when you want to transform or reduce data before it reaches Splunk.
HTTP Event Collectors (HEC):
- An API endpoint in Splunk.
- Applications or services send data directly to Splunk over HTTP/HTTPS.
- Example: a cloud app sending user activity logs directly into Splunk.
Syslog Servers:
- Used especially for network devices (firewalls, routers, switches).
- Devices send logs over the network using the Syslog protocol.
- Syslog servers can collect these logs and forward them into Splunk.

2.2 Data Normalization

Once you collect the raw data into Splunk, the next big step is Data Normalization.
Why?
Because every device sends data in its own format. If you don't normalize, it’s hard to search and correlate across different systems.

Common Information Model (CIM) Mapping

The Common Information Model (CIM) is a standard format created by Splunk.
It helps make data from different sources look the same.

What this means:

A firewall log might call an IP address "src_ip".
An EDR tool might call it "source_ip".
A Windows log might call it "SourceAddress".

Without normalization, searching across different logs is very messy.
With CIM mapping, you transform all of them into a common field, like src or dest_ip, so your searches become simple and unified.

Why CIM Mapping Matters:

You can write one search that works across multiple data sources.
You can use Splunk Enterprise Security (ES) features that expect standardized fields.
You can build correlation searches without worrying about device-specific differences.

Field Extraction

Field Extraction is about pulling important pieces of information from raw logs and labeling them.

Example:

A raw log might look like this:

May 1 10:00:00 server1 sshd[12345]: Accepted password for user1 from 192.168.1.10 port 55432

Important fields you want to extract are:

username = user1
source IP = 192.168.1.10
port = 55432

Splunk needs to know how to find these inside the raw text.

How to Extract Fields:

Field Extractor (UI tool):
- Splunk provides an easy point-and-click tool called the Field Extractor.
- You highlight examples of the data you want, and Splunk builds a regular expression (regex) for you.
Regular Expressions (Regex):
- If you’re comfortable, you can manually write patterns to extract fields.
- Example Regex for the above log:
```
Accepted password for (?<username>\w+) from (?<src_ip>\d{1,3}(?:\.\d{1,3}){3}) port (?<port>\d+)
```

2.3 Data Enrichment

After your data is collected and normalized, the next important step is Data Enrichment.

What is Data Enrichment?
It means adding more useful information to your existing data.
This extra information makes your searches, detections, and investigations more powerful and more meaningful.

Asset and Identity Correlation

What it means:

Assets = devices, computers, servers, phones.
Identities = users, employees, accounts.

You connect your raw event data (like an IP address or username) to more details about the device or person involved.

Example:

Suppose a log shows:

Login attempt from IP 10.1.2.3 by user jsmith

With enrichment, you can also know:

10.1.2.3 belongs to a laptop named Sales-Laptop-01
jsmith works in the Sales Department, is a Regional Manager, and is based in New York Office.

This makes a huge difference when investigating!
If you see a login from jsmith in Germany, but you know he normally works in New York — that could be a sign of compromise!

How Asset and Identity Correlation is done:

Create lookup tables in Splunk.
Lookups map IP addresses, MAC addresses, or usernames to extra details like:
- Department
- Location
- Device type
- Owner's manager
Splunk can automatically enrich event data using these lookups during search or data ingestion.

Threat Intelligence Integration

What it means:

Bringing in external threat data into Splunk to enrich your logs with known bad indicators.

Example:

You subscribe to a threat feed that lists:

Known malicious IP addresses
Phishing domain names
Malware file hashes

Now, when you search your own logs:

If you see an outgoing connection to a malicious IP address, you can alert immediately.
If a user downloads a file matching a known malware hash, you can investigate fast.

How Threat Intelligence is integrated:

You can import threat feeds using:
- HTTP Event Collectors (HEC)
- Lookup files (CSV, JSON)
- Splunk Threat Intelligence Management (Add-ons, ES apps)
Once the data is inside Splunk, you can compare your internal events to the threat indicators easily.

2.4 Data Quality Assurance

Now that we have collected, normalized, and enriched our data,
the next important step is to make sure that the data is good and trustworthy.
This is called Data Quality Assurance.

Why is it important?
Because if your data is incomplete, wrong, late, or inconsistent,
then your alerts, reports, and investigations could be wrong too.

Bad data = Bad security decisions.

Main Goals of Data Quality Assurance

Completeness

Make sure you are receiving all expected data.
Example: If you expect firewall logs every minute, and they suddenly stop, you need to know immediately.
Missing data can mean:
- A system is down.
- A device was compromised.
- A network problem occurred.

Best Practice:

Set up alerts in Splunk to warn you if no data arrives for a certain time.

Accuracy

Make sure data fields are extracted correctly.
Example:
- If src_ip is extracted as dst_ip by mistake, your searches and detections will be wrong.
Wrong field mappings can cause missed threats or false alarms.

Best Practice:

Randomly check sample events after new field extractions.
Validate extracted fields manually during onboarding.

Timeliness

Make sure data is arriving at Splunk quickly and without long delays.
Example:
- If an alert about a malware infection arrives 6 hours late, it might be too late to act.

Best Practice:

Measure the difference between event time (when the event happened) and index time (when Splunk ingested it).
Set alerts for delays beyond acceptable limits (e.g., >5 minutes).

Consistency

Make sure field names and formats are standardized across all sources.
Example:
- IP addresses should always use the same field name, like src_ip for source IPs.
- Timestamps should use consistent formats (e.g., UTC time).

Best Practice:

Follow CIM mapping rules strictly.
Review data models regularly to enforce consistency.

2.5 Data Storage and Indexing

After the data is collected, normalized, enriched, and checked for quality,
we now need to store it properly so that it is:

Easy to search
Organized
Efficient to manage

In Splunk, data is stored in Indexes.

Index Strategy

An Index in Splunk is like a folder or bucket where specific types of data are stored.
Choosing how you organize your indexes is very important for performance and security.

How to plan Indexes:

Separate by Data Type:
- Example:
  - Create one index for network logs (index=network).
  - Create one index for endpoint logs (index=endpoint).
  - Create one index for authentication logs (index=auth).
Benefits of separating data into different indexes:
- Faster searches (search only the indexes you need).
- Easier to manage data retention and backup.
- Better control access (e.g., only network team can see network index).

Retention Policies:

Different types of data should be kept for different lengths of time.
Example Retention Periods:
- Raw security event logs (e.g., firewall, VPN) → Keep for 90 days.
- Critical audit logs (e.g., user login history) → Keep for 1 year or longer.
In Splunk, you can configure each index to automatically delete old data when it is no longer needed.

Storage Optimization

Because data can grow very fast, we need to store it smartly to save space and money.

Two main techniques:

Summary Indexing

Instead of storing every tiny detail, you can store summarized results.
Example:
- Instead of saving every web server request, you save daily counts of successful logins.

Benefits:

Dramatically reduces storage needs.
Speeds up reports and dashboards.

Data Aging Strategies

In Splunk, indexed data passes through different storage stages:

Hot Bucket:
- Newest data, being actively written.
Warm Bucket:
- Older data, not being written but still actively searchable.
Cold Bucket:
- Older data, stored more cheaply but still searchable.
Frozen Bucket:
- Very old data. Normally deleted or archived outside of Splunk (e.g., moved to Amazon S3).

You can configure how much data stays at each stage based on:

Storage costs
Search performance needs
Compliance requirements

2.6 Data Security and Compliance

Once your data is collected, normalized, enriched, quality-checked, and stored,
you must also make sure that the data is protected and meets compliance rules.

In cybersecurity, keeping your security data secure is just as important as collecting it!

Encryption

What is Encryption?

Encryption means scrambling the data so that only authorized people or systems can read it.

In Splunk environments, you must encrypt:

Encryption in Transit

Protect data while it moves from one system to another.
Example:
- When a Splunk Forwarder sends logs to a Splunk Indexer, that data should be encrypted using SSL/TLS.
Without encryption, attackers could intercept and read the logs during transmission.

Encryption at Rest

Protect data when it is stored on disk (in indexes, archives, buckets).
Example:
- If someone steals a hard drive from your server, encrypted files would still be unreadable without the encryption key.
Methods include:
- File system encryption (e.g., using LUKS, BitLocker, etc.).
- Splunk’s native support for some encryption configurations.

Access Control

What is Access Control?

It means only allowing the right people to access certain types of data.

In Splunk, access control is role-based:

Roles:
- A role defines what a user can see and do.
- Example roles:
  - SOC Analyst: Can search indexes, view incidents.
  - SOC Manager: Can create reports, manage alerts.
  - Administrator: Full access to everything.

Important practices:

Only allow access to sensitive indexes (like authentication logs or financial data) to authorized users.
Use least privilege principle:
- Give users the minimum access they need to do their jobs — no more.

Auditing

What is Auditing?

Auditing means recording and monitoring who is accessing what, when, and how.

In Splunk:

Every login, search, dashboard access, and configuration change can be recorded.
These audit logs help answer questions like:
- Who viewed sensitive data?
- Who changed an important alert rule?
- Was there any suspicious behavior inside the system?

Why is auditing important?

Helps detect insider threats.
Required for compliance with regulations like:
- PCI-DSS
- HIPAA
- SOX
- GDPR
Provides evidence during security investigations.

3. Important Best Practices for Data Engineering

Now that you understand the core tasks of Data Engineering,
you must also learn the best habits (best practices) that real professionals use.
Following these best practices ensures that your Splunk environment stays:

Reliable
Easy to manage
High-quality for security operations

Let’s go through each practice carefully:

Always validate and normalize data before using it in correlation searches

Never trust raw incoming data blindly.
Validate:
- Check if the data format is correct.
- Make sure important fields (like usernames, IPs, timestamps) are present.
Normalize:
- Map the data fields correctly to the Common Information Model (CIM).
Why important?
- Correlation searches and security alerts rely on accurate, standardized fields.
- Wrong fields = Wrong alerts = Missed attacks.

Maintain detailed documentation for each data source

For every system that sends data to Splunk (firewalls, servers, EDR tools, etc.),
you should document important information, like:

What type of logs are sent
What fields are extracted
What index stores the data
Retention policy (how long to keep it)

Why important?

New team members can understand the setup easily.
Troubleshooting becomes faster when problems occur.
Helps during audits and compliance checks.

Regularly audit and review data source health

Even after setting everything up, things can break over time:

Forwarders can crash.
Network problems can interrupt log transmission.
New versions of devices may change log formats.

Best Practice:

Set up health checks in Splunk Monitoring Console:
- Check if data is still flowing.
- Alert if data volume suddenly drops.
Review ingestion dashboards weekly.

Continuously enrich data to improve detection capabilities

Threats evolve. Your enrichment must evolve too:

Update asset and identity lookups.
Add new threat intelligence feeds.
Integrate additional data sources (cloud security logs, IoT devices, etc.)

Why important?

New context = better, faster threat detection.
Staying current keeps your security operations strong.

Implement alerting on data pipeline issues

Don’t wait to find out by accident that data stopped flowing!

Set up alerts for:

No logs received from a critical device in the last 15 minutes.
Forwarder heartbeats (Forwarders not reporting to Deployment Server).
Excessive parsing errors (meaning log formats may have changed).

Good Example Alert:

Alert if any critical index receives 0 events in 15 minutes

Why important?

Early warnings help fix problems before they become security gaps.

4. Key Splunk Features to Master for Data Engineering

If you want to become good at Data Engineering for cybersecurity using Splunk,
you must learn to use certain tools and features inside Splunk.

Here’s a patient and beginner-friendly explanation of the essential ones:

Splunk Add-on Builder

What it is:

A tool to help you easily create custom Splunk add-ons.

What it helps with:

Data Collection: Configure how Splunk collects custom logs.
Field Extraction: Define how Splunk should pull important fields.
CIM Mapping: Help map your data to the Common Information Model (CIM).

Why it's useful:

If you're onboarding a new or custom log source (like a rare security tool), you can use this builder instead of writing complicated configurations manually.
Makes your data "ready" for Splunk Enterprise Security (ES) and other apps.

Example use:

Build a custom add-on for a new threat intelligence platform your company just started using.

Splunk Data Model Acceleration

What it is:

A way to make searches much faster by pre-computing results.

How it works:

Splunk builds summarized datasets behind the scenes.
When you run a search, it reads from the summary, not from raw data.
This makes dashboards, reports, and alerts run in seconds instead of minutes.

Why it's important for Data Engineering:

Keeps your security dashboards (especially in Splunk Enterprise Security) fast and responsive.
Helps scale to large amounts of data without slowing down searches.

Example use:

Accelerate the "Authentication" data model so that all login-related searches become super quick.

Splunk Apps for Specific Sources

What they are:

Pre-built packages created by Splunk or the Splunk community.

What they provide:

Data collection configurations
Field extractions
CIM mappings
Dashboards and reports

Examples:

Splunk Add-on for AWS: For collecting AWS CloudTrail logs.
Splunk Add-on for Windows Infrastructure: For Active Directory and Windows server logs.

Why they are useful:

Save time: No need to build everything from scratch.
Best practices built-in: Mapping, extraction, and visualization are already optimized.

Good Practice:

Always search Splunkbase (the Splunk app store) when onboarding a new data source — chances are, someone already built a great add-on for it!

tstats and datamodel commands for efficient searches

tstats command:

A super-fast Splunk command that reads from data models instead of raw events.

Why tstats is amazing:

Much faster than normal searches (search or stats).
Great for big environments with lots of data.

Example of a normal search:

index=auth sourcetype=wineventlog:security action=success

Example of a faster tstats search:

| tstats summariesonly=true count from datamodel=Authentication where Authentication.action=success

Same result, but much faster and lighter on system resources!

Important:

To use tstats, your data must be properly mapped to a Data Model.

Splunk Monitoring Console

What it is:

A built-in Splunk app to monitor the health of your entire Splunk deployment.

What you can see:

Indexer and Search Head health
Forwarder connection status
Data ingestion rates
Storage usage
Search performance

Why it's critical:

Helps you catch problems early (e.g., slow searches, missing data, forwarders down).
Essential for maintaining a stable and healthy Splunk environment.

Good Practice:

Regularly check Monitoring Console dashboards — make it a part of your weekly routine!

Data Engineering (Additional Content)

1. Data Normalization – Relationship Between Data Models and CIM Mapping

In Splunk, Data Models and the Common Information Model (CIM) are closely related and critical for effective data normalization.

Data Models

Data Models in Splunk, such as Authentication, Network Traffic, and Malware, provide predefined, structured schemas that organize data for faster and more consistent searching.
These models define data object categories, expected field names, and field types.

CIM Mapping

CIM Mapping means ensuring that your ingested event data fields align with the standard field names defined by the CIM framework.
Proper CIM Mapping enables your data to fit correctly into Splunk’s Data Models.
For example, authentication logs must map usernames to user, IP addresses to src, and authentication outcomes to action.

Why this Relationship Matters

If CIM Mapping is accurate:
- You can use powerful, accelerated datamodel searches like | tstats and | datamodel, making searches faster and more efficient.
If CIM Mapping is incorrect or missing:
- Accelerated Data Models might not populate correctly.
- tstats searches may return no results or incomplete results.
- Security detection, dashboards, and reporting become unreliable.

Key takeaway:
Proper CIM Mapping is not just about standardized field names. It directly impacts search performance, detection accuracy, and Splunk Enterprise Security operations.

2. Threat Intelligence Integration – Auto-Lookups

In Splunk Enterprise Security (ES), Auto-Lookups are an important feature to automatically enrich security events with threat intelligence at search time.

How Auto-Lookups Work

Threat intelligence data (malicious IPs, domains, file hashes) is imported into Splunk as Lookup Tables.
Auto-Lookup configurations are applied to specific data sources or sourcetypes.
When a search runs, Splunk automatically joins relevant events with threat intelligence lookups without requiring manual intervention.

Example Scenario

A firewall log shows an outbound connection to an external IP.
During search time, Splunk checks if the IP matches a known malicious IP in the threat intelligence lookup table.
If a match is found, the event is enriched with extra fields such as threat_match_field and threat_collection_name.

Why Auto-Lookups Are Important

They improve detection accuracy without needing to modify the raw event data.
They allow dynamic enrichment even if the original log source lacks complete threat context.
They save analyst time, as matches are highlighted immediately in search results.

Key takeaway:
Auto-Lookups enable seamless threat enrichment during searches, enhancing threat detection capability without increasing storage needs.

3. Data Storage and Indexing – Index-time vs Search-time Field Extraction

Understanding the difference between index-time and search-time field extraction is essential for building efficient Splunk deployments.

Index-time Field Extraction

Fields are extracted and stored during indexing, before the data is saved in Splunk.
Advantages:
- Faster search performance because field values are already stored.
Disadvantages:
- Increases index size significantly.
- Field extraction errors are permanent and hard to fix later.
- Modifications require re-indexing data, which is complex and resource-intensive.

Search-time Field Extraction

Fields are extracted dynamically during search execution.
Advantages:
- More flexible. Changes to field extraction rules take effect immediately without re-indexing.
- Saves storage space compared to index-time extraction.
Best Practice:
- Prefer search-time field extractions unless absolutely necessary (e.g., extremely high-frequency fields needed in every search).

SPLK-5002 Exam Focus

Emphasis is placed on search-time field extraction best practices.
Candidates must understand when and why index-time extraction might be used sparingly.

Key takeaway:
Search-time extraction is the preferred method because it maintains flexibility, saves storage space, and is safer for maintaining long-term system performance.

4. Auditing – Using the `_audit` Index for System Audit

Splunk’s built-in _audit index is a key source of audit data for monitoring activities within Splunk itself.

What the `_audit` Index Tracks

User authentication activities (login success, login failures)
Search execution activities (who ran what search and when)
Configuration changes (modifying dashboards, alerts, saved searches, roles, etc.)
System access patterns

Key Fields in `_audit`

user: The username associated with the action.
action: The action performed, such as search, login, or edit.
info: The outcome of the action, such as success or failure.
search: If applicable, the SPL query that was run.

Why the `_audit` Index is Important

Helps detect insider threats by identifying suspicious user activity.
Supports incident investigations involving misuse of Splunk access.
Provides mandatory evidence for compliance audits, including frameworks like SOX, HIPAA, and PCI-DSS.

Example Query for Exam

Find users who failed to log in within the past 24 hours:

index=_audit action="login attempt" info="failed" earliest=-24h

This query lists failed login attempts, which can indicate account compromise attempts or misconfigurations.

Key takeaway:
Monitoring the _audit index is essential for maintaining operational security, detecting misuse, and fulfilling compliance requirements.

Final Summary

By mastering these additional knowledge points, you will:

Deepen your understanding of CIM Mapping’s role in data models.
Be able to configure dynamic threat enrichment through Auto-Lookups.
Correctly choose between index-time and search-time field extractions.
Confidently use Splunk’s audit data to meet security and compliance goals.

Shopping cart

Subtotal:

SPLK-5002 Data Engineering

Detailed list of SPLK-5002 knowledge points

Data Engineering Detailed Explanation

1. Introduction to Data Engineering in Cybersecurity

2. Core Areas of Data Engineering

2.1 Data Collection

Identify Data Sources

Data Onboarding Techniques

2.2 Data Normalization

Common Information Model (CIM) Mapping

What this means:

Why CIM Mapping Matters:

Field Extraction

Example:

How to Extract Fields:

2.3 Data Enrichment

Asset and Identity Correlation

What it means:

Example:

How Asset and Identity Correlation is done:

Threat Intelligence Integration

What it means:

Example:

How Threat Intelligence is integrated:

2.4 Data Quality Assurance

Main Goals of Data Quality Assurance

Completeness

Accuracy

Timeliness

Consistency

2.5 Data Storage and Indexing

Index Strategy

How to plan Indexes:

Retention Policies:

Storage Optimization

Two main techniques:

Summary Indexing

Data Aging Strategies

2.6 Data Security and Compliance

Encryption

What is Encryption?

Encryption in Transit

Encryption at Rest

Access Control

What is Access Control?

Important practices:

Auditing

What is Auditing?

Why is auditing important?

3. Important Best Practices for Data Engineering

Always validate and normalize data before using it in correlation searches

Maintain detailed documentation for each data source

Regularly audit and review data source health

Continuously enrich data to improve detection capabilities

Implement alerting on data pipeline issues

4. Key Splunk Features to Master for Data Engineering

Splunk Add-on Builder

What it is:

What it helps with:

Why it's useful:

Example use:

Splunk Data Model Acceleration

What it is:

How it works:

Why it's important for Data Engineering:

Example use:

Splunk Apps for Specific Sources

What they are:

What they provide:

Examples:

Why they are useful:

Good Practice:

tstats and datamodel commands for efficient searches

tstats command:

Why tstats is amazing:

Example of a normal search:

Example of a faster tstats search:

4. Auditing – Using the `_audit` Index for System Audit

What the `_audit` Index Tracks

Key Fields in `_audit`

Why the `_audit` Index is Important