Data Collection

Data Collection Detailed Explanation

Splunk is a data analytics platform, and its value begins with the ability to collect data from a wide range of sources. This data is ingested through inputs defined in configuration files, processed, and indexed for later searching.

1. Data Input Types

Splunk supports multiple input methods, and these are defined primarily using the inputs.conf file. Each input method is designed to collect data from different types of systems and formats.

File and Directory Monitoring

Monitors a specific file or a directory containing files.
Common for collecting application logs, web server logs, or custom log files.
Splunk watches for new content in the file and ingests it line by line.

TCP/UDP Streams

Accepts data over network ports.
Used in environments where systems send logs over the network, such as syslog servers.
TCP provides reliability, while UDP is faster but does not guarantee delivery.

HTTP Event Collector (HEC)

Accepts data via REST API POST requests.
Data must be in JSON format.
Uses token-based authentication to secure input.
Common for modern apps and services, cloud environments, or IoT devices.

Example: A cloud app can send log events using HTTP POST to Splunk Cloud via HEC.

Scripted Inputs

Runs a script (written in Python, Bash, etc.) to collect data periodically.
Useful for running API calls or commands on remote systems and collecting the results.
The script's output becomes Splunk events.

Example: A script that queries a database and sends the results to Splunk every 10 minutes.

Modular Inputs

Created using Splunk Add-on Builder or SDK.
More advanced and reusable than simple scripted inputs.
Allows integration with external systems through APIs in a structured, scalable way.

Example: A modular input to pull data from ServiceNow or Salesforce.

Windows Inputs

Splunk can natively collect data from Windows systems using built-in features such as:

WMI (Windows Management Instrumentation)
Event Logs (System, Application, Security)
Registry Monitoring
Performance Counters (CPU, memory, network usage)

Cloud and Streaming Inputs (Kafka, AWS S3, Azure Blob)

Splunk integrates with cloud storage and streaming platforms using:
- Splunk Connect for Kafka
- Splunk Add-on for AWS
- Splunk Add-on for Microsoft Cloud Services

These allow you to bring in data from cloud-native sources like:

Amazon S3 buckets (logs, CSVs, JSON data)
Kafka topics (real-time event streaming)
Azure Event Hub or Blob Storage

2. Forwarder Types

Splunk uses Forwarders to send data from source systems to Indexers.

Universal Forwarder (UF)

A lightweight Splunk agent.
Cannot parse, transform, or filter data.
Ideal for large-scale production environments.
Low resource usage and stable for 24/7 log collection.

Example: Installed on hundreds of servers to send logs to central Indexers.

Heavy Forwarder (HF)

A full Splunk instance with all capabilities, including parsing and indexing (though indexing is typically disabled).
Can apply props.conf and transforms.conf to:
- Route events to different Indexers
- Mask or extract fields
- Modify timestamp formats
Used when preprocessing is required before the data reaches the Indexer.

Example: Used at the edge of a secure network to sanitize and forward logs to the core Splunk environment.

3. Parsing Pipeline (Data Processing Flow)

Splunk processes incoming data through a pipeline that consists of several key phases. Each phase transforms the data and prepares it for storage and searching.

Input Phase

Splunk reads raw data from the input defined in inputs.conf.
It could be a file, a TCP stream, a script output, or an HEC POST.

Parsing Phase

Data is broken into individual events based on line breaks and timestamps.
Timestamp extraction happens here.
This phase uses configuration from props.conf.

Indexing Phase

Events are written to the index.
Metadata such as host, source, sourcetype, index, and timestamp is assigned and stored.
The data is compressed and saved in buckets.

Search Phase

Once the data is indexed, it becomes searchable using the Splunk Search Processing Language (SPL).
This phase happens when a user or scheduled job runs a query.

Understanding this pipeline helps in troubleshooting issues like delayed data, improperly formatted events, or wrong timestamps.

4. Timestamps and Time Zone Handling

Correct time handling is essential in Splunk because all search queries are based on time ranges.

Timestamp Extraction

Splunk uses the following settings in props.conf to identify the correct timestamp in the raw event:

TIME_FORMAT: Defines the exact format of the timestamp.
TIME_PREFIX: A string or regex that appears before the timestamp.
MAX_TIMESTAMP_LOOKAHEAD: How many characters to scan after the prefix.

Example: If your logs look like this:

[2024-04-25 10:15:30] User logged in

Then you would use:

TIME_PREFIX = \[
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20

Fallback Timestamp

If Splunk cannot extract a timestamp from the event:

It will use the file modification time (for file inputs)
Or the current system time (for other inputs)

Disable Timestamp Extraction

You can disable timestamp extraction entirely by setting:

DATETIME_CONFIG = NONE

This might be useful for binary data or metrics that do not require time-based indexing.

5. Event Line Breaking

Some logs contain multiple lines per event (e.g., Java stack traces), so Splunk must know where each event starts and ends. This is called line breaking, and it is configured in props.conf.

Key Settings

LINE_BREAKER: A regular expression that defines the exact point where a new event starts.
SHOULD_LINEMERGE: When set to false, tells Splunk to use LINE_BREAKER instead of guessing.
BREAK_ONLY_BEFORE: Defines a pattern that marks the start of a new event.
BREAK_ONLY_AFTER: Defines a pattern that marks the end of the current event.

Correct configuration ensures that:

One event is not split into multiple parts.
Multiple lines that belong together are not incorrectly separated.

Example: If each event starts with a date, like:

2024-04-25 10:10:00 ERROR something happened
Details:
Line 1
Line 2

Then:

Set BREAK_ONLY_BEFORE = ^\d{4}-\d{2}-\d{2}

Data Collection (Additional Content)

1. Data Collection Performance Optimization

Optimizing how Splunk collects data from files can significantly impact indexing efficiency and prevent duplication. Several parameters in inputs.conf and props.conf can be tuned to improve performance:

`crcSalt`

Used to alter the file signature (checksum) that Splunk uses to determine if a file has already been indexed.
Prevents duplicate indexing when:
- Multiple files have the same name and size (e.g., rotated logs)
- Files are moved between systems with similar paths
Common usage: append the source path to the checksum

[monitor:///var/log/app.log]
crcSalt = <SOURCE>

Without this, Splunk might skip files it thinks it already indexed.

`initCrcLength`

Sets the number of bytes that Splunk reads during the initial CRC checksum generation.
A lower value improves performance on large files when only the beginning is sufficient for unique identification.

[monitor:///data/logs]
initCrcLength = 256

This setting helps accelerate the file discovery and tracking process.

`ignoreOlderThan`

Tells Splunk to skip files whose last modification time is older than the specified time window.
Useful when indexing historical logs is unnecessary or undesirable.

[monitor:///data/archive/]
ignoreOlderThan = 30d

Prevents Splunk from spending resources reading outdated logs during startup or deployment.

2. Throughput Limiting on Universal Forwarder

When deploying Splunk Universal Forwarders (UF) across large environments, you may need to limit their bandwidth usage to avoid saturating network links.

This is configured in limits.conf on the Universal Forwarder:

[thruput]
maxKBps = 256

This example limits the UF to 256 KB per second.
The default is unlimited (0), which may not be suitable for constrained environments.
This setting helps prevent performance bottlenecks on low-bandwidth WANs or remote offices.

This is frequently tested in exams via questions about how to “throttle” data ingestion at the source.

3. Pre-Indexing Data Cleansing on Heavy Forwarder (HF)

Heavy Forwarders (HF) can parse and transform data before it is forwarded to the Indexer. This is useful for masking sensitive data, rewriting fields, or extracting new fields.

Example: Masking sensitive information (e.g., credit card numbers)

This is done using a combination of props.conf and transforms.conf.

`props.conf`

[host::appserver*]
TRANSFORMS-mask = mask_cc

`transforms.conf`

[mask_cc]
REGEX = \d{4}-\d{4}-\d{4}-\d{4}
FORMAT = ****-****-****-****
DEST_KEY = _raw

REGEX matches the sensitive pattern (e.g., credit card number)
FORMAT replaces it with a masked version
DEST_KEY = _raw tells Splunk to rewrite the actual event

This method ensures sensitive data is never indexed, helping organizations comply with compliance standards like PCI DSS.

This type of config is commonly tested under the topic of data preprocessing or field manipulation.

Summary

Area	Enhancement
File-level performance tuning	`crcSalt`, `initCrcLength`, and `ignoreOlderThan` optimize file monitoring
Bandwidth throttling	`[thruput] maxKBps` limits data forwarding rate from UF
Pre-index transformations	Use `props.conf` + `transforms.conf` on HFs to clean or mask data

Shopping cart

Subtotal:

SPLK-3003 Data Collection

Detailed list of SPLK-3003 knowledge points