Shopping cart

Subtotal:

$0.00

SPLK-1003 Getting Data In

Getting Data In

Detailed list of SPLK-1003 knowledge points

Getting Data In Detailed Explanation

Splunk’s ability to ingest data from various sources is one of its core features. This guide explores input methods, metadata assignment, and best practices for efficient data ingestion.

1. Input Methods

Splunk offers several ways to ingest data, depending on the data source and format. Let’s break down the most common input methods.

1.1 File and Directory Monitoring

  • Overview:

    • This method is used to monitor log files, CSVs, JSON files, or any other file-based data. Splunk tracks changes and ingests new data as it appears.
  • Key Configuration:

    • Use the inputs.conf file to specify file paths or directories for monitoring.
  • Example:

    • Monitor a single log file:

      [monitor:///var/log/syslog]
      disabled = false
      index = main
      sourcetype = syslog
      
    • Monitor all files in a directory (including subdirectories):

      [monitor:///var/log/app/]
      disabled = false
      index = app_logs
      sourcetype = app_log
      recursive = true
      
  • Use Cases:

    • System log files (/var/log/).
    • Application logs for troubleshooting (e.g., Apache access logs).

1.2 Network Inputs

  • Overview:

    • Collect data sent over the network, such as Syslog messages or data from IoT devices.
    • Splunk listens on specified TCP or UDP ports.
  • Key Configuration:

    • Use inputs.conf to configure network inputs.
  • Example:

    • Listen for Syslog messages on UDP port 514:

      [udp://514]
      disabled = false
      index = syslog_index
      sourcetype = syslog
      
    • Listen for custom data on TCP port 10514:

      [tcp://10514]
      disabled = false
      index = custom_index
      sourcetype = custom_data
      
  • Use Cases:

    • Collecting real-time logs from network devices like routers and firewalls.
    • Receiving application telemetry over TCP.

1.3 Scripted Inputs

  • Overview:

    • Scripted inputs run custom scripts to generate or collect data dynamically.
    • Commonly used for fetching external data (e.g., calling APIs or executing system commands).
  • Key Configuration:

    • Define the script path in inputs.conf.
  • Example:

    • Run a Python script to fetch data every minute:

      [script://./bin/fetch_data.py]
      disabled = false
      interval = 60
      sourcetype = custom_script
      index = script_index
      
  • Use Cases:

    • Fetching weather data from an API.
    • Executing system commands to collect performance metrics.

1.4 APIs

  • Overview:

    • Use Splunk’s REST API or HTTP Event Collector (HEC) to send data programmatically.
  • Key Configuration:

    • Enable the HTTP Event Collector in Splunk Web:
      • Navigate to Settings > Data Inputs > HTTP Event Collector.
      • Create a token for authentication.
  • Example:

    • Send JSON data using HEC:

      curl -k "https://splunk-server:8088/services/collector/event" \
      -H "Authorization: Splunk <token>" \
      -d '{"event": "test_event", "sourcetype": "json_data", "index": "api_index"}'
      
  • Use Cases:

    • Sending logs from custom applications.
    • Integrating Splunk with third-party tools.

2. Metadata Assignment

Metadata defines key attributes of the data being ingested. Proper metadata assignment ensures that data is parsed and categorized correctly.

2.1 Host

  • Definition:

    • Identifies the source system where the data originated.
  • Default Behavior:

    • Splunk assigns the hostname of the system where the data was collected.
  • Custom Assignment:

    • Define a custom host in inputs.conf:

      [monitor:///var/log/syslog]
      host = custom_host
      

2.2 Source

  • Definition:
    • Specifies the origin of the data, such as the file path, port, or API endpoint.
  • Example:
    • Default source for /var/log/syslog: source=/var/log/syslog.

2.3 Sourcetype

  • Definition:

    • Determines how Splunk parses and extracts fields from the data.
  • Examples:

    • Predefined sourcetypes:

      • syslog: For system logs.
      • access_combined: For Apache access logs.
    • Custom sourcetype:

      [monitor:///var/log/custom.log]
      sourcetype = custom_log
      

3. Best Practices

3.1 Validate Data Using Regular Expressions

  • Use regex to ensure incoming data matches expected patterns.

  • Example:

    • Validate IP addresses in logs:

      ^(\d{1,3}\.){3}\d{1,3}$
      

3.2 Monitor Ingestion Performance

  • Use the Monitoring Console to track:

    • Data ingestion rates.
    • Indexer performance.
  • Example SPL Query:

    index=_internal source=*metrics.log group=per_host_thruput
    | stats sum(kbps) as total_kbps by host
    

3.3 Use Index-Specific Inputs

  • Assign specific data to dedicated indexes for better organization and performance:

    • Example:

      [monitor:///var/log/webserver.log]
      index = web_logs
      

Hands-On Exercises

Exercise 1: Monitor a Directory for Log Files

Goal: Configure Splunk to monitor and ingest logs from /var/log/app.

Steps:
  1. Edit inputs.conf:

    • Add the following configuration:

      [monitor:///var/log/app/]
      disabled = false
      index = app_logs
      sourcetype = app_log
      recursive = true
      
  2. Restart Splunk:

    • Restart to apply the changes:

      ./splunk restart
      
  3. Verify Data Ingestion:

    • Use the following SPL query to search for the ingested data:

      index=app_logs | stats count by sourcetype
      

Exercise 2: Ingest Syslog Data from Network Devices

Goal: Configure Splunk to listen for Syslog messages on UDP port 514.

Steps:
  1. Edit inputs.conf:

    • Add the following configuration:

      [udp://514]
      disabled = false
      index = syslog_index
      sourcetype = syslog
      
  2. Verify Listening Port:

    • Check if Splunk is listening on port 514:

      netstat -tuln | grep 514
      
  3. Send Test Data:

    • Use a tool like logger to send a Syslog message:

      logger -n <splunk_server_ip> -P 514 "Test syslog message"
      
  4. Search the Data:

    • Use SPL to find the test message:

      index=syslog_index sourcetype=syslog
      

Exercise 3: Use Scripted Inputs to Fetch Data

Goal: Use a custom Python script to fetch data from an API and ingest it into Splunk.

Steps:
  1. Create a Script:

    • Save the following Python script as fetch_data.py in the $SPLUNK_HOME/bin/scripts/ directory:

      import time
      import json
      import requests
      
      # Fetch data from a sample API
      response = requests.get("https://jsonplaceholder.typicode.com/todos/1")
      data = response.json()
      
      # Print data in Splunk-friendly format
      print(json.dumps(data))
      
  2. Configure inputs.conf:

    • Add the following entry:

      [script://$SPLUNK_HOME/bin/scripts/fetch_data.py]
      disabled = false
      interval = 60
      sourcetype = json_data
      index = api_index
      
  3. Restart Splunk:

    • Restart to activate the input:

      ./splunk restart
      
  4. Verify Data:

    • Search for the fetched data:

      index=api_index sourcetype=json_data
      

Real-World Scenarios

Scenario 1: Centralized Log Monitoring for Multiple Servers

Goal: Collect logs from multiple servers using Splunk Universal Forwarders.

Steps:
  1. Install Universal Forwarder:

    • Install the Splunk Universal Forwarder on each server.
  2. Configure Forwarders:

    • Edit outputs.conf on the forwarders:

      [tcpout]
      defaultGroup = indexer_group
      
      [tcpout:indexer_group]
      server = <indexer_ip>:9997
      
  3. Add Inputs:

    • Define inputs in inputs.conf on the forwarders:

      [monitor:///var/log/server_logs/]
      disabled = false
      sourcetype = server_logs
      index = main
      
  4. Verify Data on Indexer:

    • Run a search to confirm logs are received:

      index=main sourcetype=server_logs
      

Scenario 2: Parsing and Extracting Fields from Custom Logs

Goal: Parse a custom log format to extract specific fields.

Steps:
  1. Define Parsing Rules in props.conf:

    • Example log: 192.168.1.1 - - [01/Jan/2025:12:00:00] "GET /index.html" 200

    • Add this to props.conf:

      [custom_log]
      REPORT-custom_fields = extract_custom_fields
      
  2. Create Extraction Rules in transforms.conf:

    • Add the following:

      [extract_custom_fields]
      REGEX = ^(?P<ip>\d+\.\d+\.\d+\.\d+) .+ \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<uri>[^\s]+)" (?P<status>\d+)
      FORMAT = ip::$1 timestamp::$2 method::$3 uri::$4 status::$5
      
  3. Verify Field Extraction:

    • Run a search:

      index=custom_index sourcetype=custom_log
      

Troubleshooting Data Ingestion Issues

Common Issues and Fixes

Issue 1: Data Not Appearing in Splunk
  • Cause:

    • Incorrect inputs.conf configuration.
  • Solution:

    • Verify inputs.conf settings using btool:

      splunk cmd btool inputs list --debug
      
Issue 2: Duplicate Events
  • Cause:

    • File rotation or re-ingestion of data.
  • Solution:

    • Ensure crcSalt is not used unnecessarily in inputs.conf.

    • Example:

      [monitor:///var/log/syslog]
      crcSalt = <SOURCE>
      
Issue 3: High Latency in Data Ingestion
  • Cause:

    • Indexer or forwarder bottlenecks.
  • Solution:

    • Monitor ingestion performance:

      index=_internal source=*metrics.log group=per_host_thruput
      

Best Practices for Data Ingestion

  1. Use Index-Specific Inputs:

    • Assign specific inputs to dedicated indexes for better organization and performance.
  2. Enable Compression for Forwarders:

    • Compress data sent from forwarders to reduce bandwidth usage:

      [tcpout]
      compressed = true
      
  3. Monitor and Tune Regularly:

    • Use the Monitoring Console to identify bottlenecks and optimize configurations.

Getting Data In (Additional Content)

Getting data into Splunk involves much more than simply configuring inputs. Advanced configurations, such as crcSalt, whitelist/blacklist filters, and HEC performance tuning, are essential for large-scale, reliable, and efficient ingestion.

1. Understanding crcSalt

The crcSalt setting plays a critical role in how Splunk determines whether a file has been previously indexed.

What is CRC?

Splunk uses a CRC (cyclic redundancy check) hash calculated from the beginning of a file (up to the first 256 KB by default) to determine if a file is new or has already been indexed.

Why Use crcSalt?

  • To force Splunk to reindex a file even if its contents are mostly the same.

  • Helpful in environments where files rotate but retain similar headers.

Configuration Syntax:

[monitor:///var/log/my_app.log]
crcSalt = <SOURCE>

Effect of crcSalt = <SOURCE>:

  • Appends the full file path to the CRC hash calculation.

  • Ensures that files with the same content but different paths are treated as distinct files.

  • Can prevent false de-duplication during ingestion.

Warning:

  • If used incorrectly (e.g., with static log file names across environments), it may cause unintended duplication of data.

Use Case:

You have multiple files with the same content but different file names or directories:

/var/log/host1/app.log  
/var/log/host2/app.log

By setting crcSalt = <SOURCE>, Splunk indexes both files independently.

2. Input Filtering with Whitelist and Blacklist

When using monitor stanzas, it is important to ingest only the necessary files to avoid wasted storage and performance issues.

Whitelist / Blacklist Overview:

  • whitelist: a regex pattern to include files.

  • blacklist: a regex pattern to exclude files.

They can be combined for precise control over file selection.

Configuration Example:

[monitor:///var/log/]
whitelist = \.log$
blacklist = debug.*

Explanation:

  • Only files ending with .log will be considered (whitelist).

  • Files matching debug.* (e.g., debug.log, debug_trace.log) will be excluded (blacklist).

Tip: Use anchored patterns (^, $) to match exact file names when needed:

blacklist = ^debug\.log$

Use Case:

  • In a large /var/log/ directory, include only .log files except for debug.log.

3. HEC: High-Performance Settings (batchMode, enableAck)

When using the HTTP Event Collector (HEC) for scalable, reliable ingestion from cloud-native or custom apps, advanced parameters can optimize throughput and delivery assurance.

3.1 batchMode

  • Used to group multiple events into a single HTTP POST.

  • Reduces HTTP overhead and increases throughput.

How to Use:

In the payload:

{
  "event": "first_event"
},
{
  "event": "second_event"
}
API Behavior:

Send an array of events instead of one-at-a-time.

Use Case:
  • IoT or log aggregators sending thousands of small events.

  • Use batching to optimize network usage and reduce indexing delay.

3.2 enableAck

  • Enables event delivery acknowledgment between sender and HEC.

  • Ensures that data is not lost due to network or system failure.

  • Recommended in high-assurance environments (e.g., financial systems, compliance-critical logs).

Enable it in Splunk:

Splunk Web → Settings → Data Inputs → HTTP Event Collector → Select Token → Enable Index Acknowledgment

Sender-side Header:
"X-Splunk-Request-Channel": "<UUID>"
Expected Workflow:
  1. Client sends data with channel ID.

  2. Splunk responds with an event UUID.

  3. Client polls the /ack endpoint to confirm event ingestion.

Use Case for enableAck:

  • Required for "exactly once" delivery semantics.

  • Use it when logs must not be lost under any circumstance (e.g., audit logs).

Summary Best Practices

Feature Best Practice
crcSalt Use crcSalt = <SOURCE> when filenames differ but file contents are similar. Avoid unintentional duplication.
Whitelist/Blacklist Use both for precise file filtering and to reduce ingest volume.
HEC batchMode Enable for bulk data ingestion from microservices or high-volume pipelines.
HEC enableAck Use when event acknowledgment is critical; monitor /services/collector/ack for tracking.

Frequently Asked Questions

What is the primary function of a Splunk Universal Forwarder?

Answer:

To collect and forward raw data to Splunk indexers.

Explanation:

The Universal Forwarder is a lightweight Splunk component designed specifically for collecting data from source systems and forwarding it to indexers for processing and storage. It consumes minimal system resources and does not perform indexing or advanced processing locally. Instead, it sends raw event data to indexers where parsing and indexing occur. Because of its efficiency, the Universal Forwarder is typically installed on production servers, application hosts, and infrastructure systems where performance impact must be minimized. A common misunderstanding is that forwarders can perform indexing; however, indexing occurs only on indexer instances.

Demand Score: 88

Exam Relevance Score: 94

When should a Heavy Forwarder be used instead of a Universal Forwarder?

Answer:

When data requires parsing, filtering, or transformation before forwarding.

Explanation:

A Heavy Forwarder runs a full Splunk Enterprise instance and includes parsing and processing capabilities that are not available in the Universal Forwarder. It can perform tasks such as event filtering, routing, and data transformation before sending events to indexers. This makes Heavy Forwarders useful when organizations need intermediate data processing or when data must be forwarded to multiple destinations. However, because Heavy Forwarders require more system resources than Universal Forwarders, they are typically deployed only when advanced processing is required rather than on every data source.

Demand Score: 86

Exam Relevance Score: 92

Which configuration file is used to define data inputs in Splunk?

Answer:

inputs.conf.

Explanation:

The inputs.conf file defines the data sources that Splunk monitors and ingests. Administrators configure this file to specify inputs such as monitored files, directories, network ports, scripts, and other data sources. Each input is defined within a stanza that identifies the input type and includes parameters such as file paths, index destinations, and sourcetypes. Proper configuration of inputs.conf ensures that Splunk collects the correct data from the desired sources. Misconfigured inputs may result in missing data or duplicate ingestion.

Demand Score: 84

Exam Relevance Score: 93

Which CLI command can be used to add a file monitoring input on a forwarder?

Answer:

splunk add monitor.

Explanation:

The Splunk CLI provides commands that allow administrators to configure inputs directly from the command line. The splunk add monitor command adds a file or directory monitoring input that enables Splunk to track changes and ingest new data from the specified path. When executed, the command updates the inputs.conf configuration to include the new monitoring stanza. Administrators may also specify additional options such as the target index and sourcetype. This method is commonly used during automated deployments or scripting scenarios where manual configuration through Splunk Web is not practical.

Demand Score: 81

Exam Relevance Score: 91

In the Splunk data ingestion pipeline, which component typically receives data forwarded from Universal Forwarders?

Answer:

Indexers.

Explanation:

Universal Forwarders send collected data directly to indexers in most Splunk architectures. The indexers are responsible for parsing the incoming data, extracting timestamps, creating searchable indexes, and storing the events on disk. In larger deployments, multiple indexers work together in clusters to handle high ingestion volumes and provide redundancy. While forwarders gather and transmit data, they do not perform indexing or long-term storage. This separation of roles allows Splunk deployments to scale efficiently as data volumes increase.

Demand Score: 78

Exam Relevance Score: 92

SPLK-1003 Training Course