Splunk is a data analytics platform, and its value begins with the ability to collect data from a wide range of sources. This data is ingested through inputs defined in configuration files, processed, and indexed for later searching.
Splunk supports multiple input methods, and these are defined primarily using the inputs.conf file. Each input method is designed to collect data from different types of systems and formats.
Monitors a specific file or a directory containing files.
Common for collecting application logs, web server logs, or custom log files.
Splunk watches for new content in the file and ingests it line by line.
Accepts data over network ports.
Used in environments where systems send logs over the network, such as syslog servers.
TCP provides reliability, while UDP is faster but does not guarantee delivery.
Accepts data via REST API POST requests.
Data must be in JSON format.
Uses token-based authentication to secure input.
Common for modern apps and services, cloud environments, or IoT devices.
Example: A cloud app can send log events using HTTP POST to Splunk Cloud via HEC.
Runs a script (written in Python, Bash, etc.) to collect data periodically.
Useful for running API calls or commands on remote systems and collecting the results.
The script's output becomes Splunk events.
Example: A script that queries a database and sends the results to Splunk every 10 minutes.
Created using Splunk Add-on Builder or SDK.
More advanced and reusable than simple scripted inputs.
Allows integration with external systems through APIs in a structured, scalable way.
Example: A modular input to pull data from ServiceNow or Salesforce.
Splunk can natively collect data from Windows systems using built-in features such as:
WMI (Windows Management Instrumentation)
Event Logs (System, Application, Security)
Registry Monitoring
Performance Counters (CPU, memory, network usage)
Splunk integrates with cloud storage and streaming platforms using:
Splunk Connect for Kafka
Splunk Add-on for AWS
Splunk Add-on for Microsoft Cloud Services
These allow you to bring in data from cloud-native sources like:
Amazon S3 buckets (logs, CSVs, JSON data)
Kafka topics (real-time event streaming)
Azure Event Hub or Blob Storage
Splunk uses Forwarders to send data from source systems to Indexers.
A lightweight Splunk agent.
Cannot parse, transform, or filter data.
Ideal for large-scale production environments.
Low resource usage and stable for 24/7 log collection.
Example: Installed on hundreds of servers to send logs to central Indexers.
A full Splunk instance with all capabilities, including parsing and indexing (though indexing is typically disabled).
Can apply props.conf and transforms.conf to:
Route events to different Indexers
Mask or extract fields
Modify timestamp formats
Used when preprocessing is required before the data reaches the Indexer.
Example: Used at the edge of a secure network to sanitize and forward logs to the core Splunk environment.
Splunk processes incoming data through a pipeline that consists of several key phases. Each phase transforms the data and prepares it for storage and searching.
Splunk reads raw data from the input defined in inputs.conf.
It could be a file, a TCP stream, a script output, or an HEC POST.
Data is broken into individual events based on line breaks and timestamps.
Timestamp extraction happens here.
This phase uses configuration from props.conf.
Events are written to the index.
Metadata such as host, source, sourcetype, index, and timestamp is assigned and stored.
The data is compressed and saved in buckets.
Once the data is indexed, it becomes searchable using the Splunk Search Processing Language (SPL).
This phase happens when a user or scheduled job runs a query.
Understanding this pipeline helps in troubleshooting issues like delayed data, improperly formatted events, or wrong timestamps.
Correct time handling is essential in Splunk because all search queries are based on time ranges.
Splunk uses the following settings in props.conf to identify the correct timestamp in the raw event:
TIME_FORMAT: Defines the exact format of the timestamp.
TIME_PREFIX: A string or regex that appears before the timestamp.
MAX_TIMESTAMP_LOOKAHEAD: How many characters to scan after the prefix.
Example: If your logs look like this:
[2024-04-25 10:15:30] User logged in
Then you would use:
TIME_PREFIX = \[
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20
If Splunk cannot extract a timestamp from the event:
It will use the file modification time (for file inputs)
Or the current system time (for other inputs)
You can disable timestamp extraction entirely by setting:
DATETIME_CONFIG = NONE
This might be useful for binary data or metrics that do not require time-based indexing.
Some logs contain multiple lines per event (e.g., Java stack traces), so Splunk must know where each event starts and ends. This is called line breaking, and it is configured in props.conf.
LINE_BREAKER: A regular expression that defines the exact point where a new event starts.
SHOULD_LINEMERGE: When set to false, tells Splunk to use LINE_BREAKER instead of guessing.
BREAK_ONLY_BEFORE: Defines a pattern that marks the start of a new event.
BREAK_ONLY_AFTER: Defines a pattern that marks the end of the current event.
Correct configuration ensures that:
One event is not split into multiple parts.
Multiple lines that belong together are not incorrectly separated.
Example: If each event starts with a date, like:
2024-04-25 10:10:00 ERROR something happened
Details:
Line 1
Line 2
Then:
BREAK_ONLY_BEFORE = ^\d{4}-\d{2}-\d{2}Optimizing how Splunk collects data from files can significantly impact indexing efficiency and prevent duplication. Several parameters in inputs.conf and props.conf can be tuned to improve performance:
crcSaltUsed to alter the file signature (checksum) that Splunk uses to determine if a file has already been indexed.
Prevents duplicate indexing when:
Multiple files have the same name and size (e.g., rotated logs)
Files are moved between systems with similar paths
Common usage: append the source path to the checksum
[monitor:///var/log/app.log]
crcSalt = <SOURCE>
Without this, Splunk might skip files it thinks it already indexed.
initCrcLengthSets the number of bytes that Splunk reads during the initial CRC checksum generation.
A lower value improves performance on large files when only the beginning is sufficient for unique identification.
[monitor:///data/logs]
initCrcLength = 256
This setting helps accelerate the file discovery and tracking process.
ignoreOlderThanTells Splunk to skip files whose last modification time is older than the specified time window.
Useful when indexing historical logs is unnecessary or undesirable.
[monitor:///data/archive/]
ignoreOlderThan = 30d
Prevents Splunk from spending resources reading outdated logs during startup or deployment.
When deploying Splunk Universal Forwarders (UF) across large environments, you may need to limit their bandwidth usage to avoid saturating network links.
This is configured in limits.conf on the Universal Forwarder:
[thruput]
maxKBps = 256
This example limits the UF to 256 KB per second.
The default is unlimited (0), which may not be suitable for constrained environments.
This setting helps prevent performance bottlenecks on low-bandwidth WANs or remote offices.
This is frequently tested in exams via questions about how to “throttle” data ingestion at the source.
Heavy Forwarders (HF) can parse and transform data before it is forwarded to the Indexer. This is useful for masking sensitive data, rewriting fields, or extracting new fields.
This is done using a combination of props.conf and transforms.conf.
props.conf[host::appserver*]
TRANSFORMS-mask = mask_cc
transforms.conf[mask_cc]
REGEX = \d{4}-\d{4}-\d{4}-\d{4}
FORMAT = ****-****-****-****
DEST_KEY = _raw
REGEX matches the sensitive pattern (e.g., credit card number)
FORMAT replaces it with a masked version
DEST_KEY = _raw tells Splunk to rewrite the actual event
This method ensures sensitive data is never indexed, helping organizations comply with compliance standards like PCI DSS.
This type of config is commonly tested under the topic of data preprocessing or field manipulation.
| Area | Enhancement |
|---|---|
| File-level performance tuning | crcSalt, initCrcLength, and ignoreOlderThan optimize file monitoring |
| Bandwidth throttling | [thruput] maxKBps limits data forwarding rate from UF |
| Pre-index transformations | Use props.conf + transforms.conf on HFs to clean or mask data |
Why might a Universal Forwarder connect to an indexer but fail to send data?
A forwarder may connect successfully but fail to send data if the input configuration is incorrect or the monitored files contain no new data.
The Universal Forwarder establishes a TCP connection to the receiving indexer, but data ingestion only occurs when configured inputs detect new events. If inputs.conf is misconfigured or the monitored file paths are incorrect, Splunk will not detect new events to send. Another common scenario occurs when the monitored file has already been indexed and no additional lines have been appended. Troubleshooting involves verifying input configuration, checking splunkd.log, and confirming that the forwarder recognizes the monitored file paths.
Demand Score: 88
Exam Relevance Score: 90
When should HTTP Event Collector (HEC) be used instead of a Universal Forwarder?
HEC should be used when applications need to send structured events directly to Splunk using HTTP or REST APIs.
The Universal Forwarder is typically used for collecting logs from operating systems or file systems. However, modern applications often generate event data programmatically. HTTP Event Collector allows these applications to send JSON-formatted events directly to Splunk via HTTP endpoints. This approach simplifies integration with cloud platforms, microservices, and serverless environments where installing a forwarder may not be feasible. HEC also supports token-based authentication and can accept batched events from application code or automation scripts.
Demand Score: 80
Exam Relevance Score: 83
What is the purpose of the outputs.conf configuration in a Universal Forwarder?
outputs.conf defines where the forwarder sends collected data.
The configuration specifies the destination indexer or indexer cluster that receives forwarded events. It also defines load balancing settings, failover targets, and SSL configuration for secure communication. If outputs.conf is misconfigured or contains incorrect indexer addresses, the forwarder cannot deliver events to the receiving tier. Administrators typically verify the configuration using splunk list forward-server and review forwarder logs to confirm connectivity.
Demand Score: 82
Exam Relevance Score: 87
Why might a data input appear correctly configured but still produce no events in Splunk searches?
The input may be configured correctly but assigned to the wrong index or sourcetype.
When events are indexed under an unexpected index or sourcetype, searches may fail to return results if the query filters exclude those values. Administrators often mistakenly assume ingestion failed when events were simply stored with different metadata. Verifying the index and sourcetype configuration in inputs.conf and searching across all indexes can help identify this issue.
Demand Score: 77
Exam Relevance Score: 81
Why is the Universal Forwarder preferred over heavy forwarders for most data collection scenarios?
The Universal Forwarder is preferred because it uses minimal system resources and focuses solely on forwarding data.
Heavy forwarders include indexing and parsing components, which increase CPU and memory consumption. The Universal Forwarder removes these components and performs lightweight data forwarding to indexers. This design reduces overhead on monitored systems and improves scalability when deploying large numbers of forwarders across infrastructure environments.
Demand Score: 79
Exam Relevance Score: 82
What troubleshooting step should be performed first when no data is appearing in Splunk after configuring a new input?
The first step is to verify whether the input is active and recognized by the Splunk instance.
Administrators should confirm that the input configuration is loaded correctly using CLI or the Splunk Web interface. Checking splunkd.log can reveal configuration errors or permission issues preventing the input from running. If the input is not active, events will never enter the ingestion pipeline. Confirming active inputs helps isolate whether the problem lies in configuration, connectivity, or indexing.
Demand Score: 81
Exam Relevance Score: 85