Getting Data In

Getting Data In Detailed Explanation

Splunk’s ability to ingest data from various sources is one of its core features. This guide explores input methods, metadata assignment, and best practices for efficient data ingestion.

1. Input Methods

Splunk offers several ways to ingest data, depending on the data source and format. Let’s break down the most common input methods.

1.1 File and Directory Monitoring

Overview:
- This method is used to monitor log files, CSVs, JSON files, or any other file-based data. Splunk tracks changes and ingests new data as it appears.
Key Configuration:
- Use the inputs.conf file to specify file paths or directories for monitoring.

Example:

Monitor a single log file:

[monitor:///var/log/syslog]
disabled = false
index = main
sourcetype = syslog

Monitor all files in a directory (including subdirectories):

[monitor:///var/log/app/]
disabled = false
index = app_logs
sourcetype = app_log
recursive = true

Use Cases:
- System log files (/var/log/).
- Application logs for troubleshooting (e.g., Apache access logs).

1.2 Network Inputs

Overview:
- Collect data sent over the network, such as Syslog messages or data from IoT devices.
- Splunk listens on specified TCP or UDP ports.
Key Configuration:
- Use inputs.conf to configure network inputs.

Example:

Listen for Syslog messages on UDP port 514:

[udp://514]
disabled = false
index = syslog_index
sourcetype = syslog

Listen for custom data on TCP port 10514:

[tcp://10514]
disabled = false
index = custom_index
sourcetype = custom_data

Use Cases:
- Collecting real-time logs from network devices like routers and firewalls.
- Receiving application telemetry over TCP.

1.3 Scripted Inputs

Overview:
- Scripted inputs run custom scripts to generate or collect data dynamically.
- Commonly used for fetching external data (e.g., calling APIs or executing system commands).
Key Configuration:
- Define the script path in inputs.conf.

Example:

Run a Python script to fetch data every minute:

[script://./bin/fetch_data.py]
disabled = false
interval = 60
sourcetype = custom_script
index = script_index

Use Cases:
- Fetching weather data from an API.
- Executing system commands to collect performance metrics.

1.4 APIs

Overview:
- Use Splunk’s REST API or HTTP Event Collector (HEC) to send data programmatically.
Key Configuration:
- Enable the HTTP Event Collector in Splunk Web:
  - Navigate to Settings > Data Inputs > HTTP Event Collector.
  - Create a token for authentication.

Example:

Send JSON data using HEC:

curl -k "https://splunk-server:8088/services/collector/event" \
-H "Authorization: Splunk <token>" \
-d '{"event": "test_event", "sourcetype": "json_data", "index": "api_index"}'

Use Cases:
- Sending logs from custom applications.
- Integrating Splunk with third-party tools.

2. Metadata Assignment

Metadata defines key attributes of the data being ingested. Proper metadata assignment ensures that data is parsed and categorized correctly.

2.1 Host

Definition:
- Identifies the source system where the data originated.
Default Behavior:
- Splunk assigns the hostname of the system where the data was collected.
Custom Assignment:
- Define a custom host in inputs.conf:
```
[monitor:///var/log/syslog]
host = custom_host
```

2.2 Source

Definition:
- Specifies the origin of the data, such as the file path, port, or API endpoint.
Example:
- Default source for /var/log/syslog: source=/var/log/syslog.

2.3 Sourcetype

Definition:
- Determines how Splunk parses and extracts fields from the data.
Examples:
- Predefined sourcetypes:
  - syslog: For system logs.
  - access_combined: For Apache access logs.
- Custom sourcetype:
```
[monitor:///var/log/custom.log]
sourcetype = custom_log
```

3. Best Practices

3.1 Validate Data Using Regular Expressions

Use regex to ensure incoming data matches expected patterns.
Example:
- Validate IP addresses in logs:
```
^(\d{1,3}\.){3}\d{1,3}$
```

3.2 Monitor Ingestion Performance

Use the Monitoring Console to track:
- Data ingestion rates.
- Indexer performance.

Example SPL Query:

index=_internal source=*metrics.log group=per_host_thruput
| stats sum(kbps) as total_kbps by host

3.3 Use Index-Specific Inputs

Assign specific data to dedicated indexes for better organization and performance:
- Example:
```
[monitor:///var/log/webserver.log]
index = web_logs
```

Hands-On Exercises

Exercise 1: Monitor a Directory for Log Files

Goal: Configure Splunk to monitor and ingest logs from /var/log/app.

Steps:

Edit inputs.conf:

Add the following configuration:

[monitor:///var/log/app/]
disabled = false
index = app_logs
sourcetype = app_log
recursive = true

Restart Splunk:
- Restart to apply the changes:
```
./splunk restart
```
Verify Data Ingestion:
- Use the following SPL query to search for the ingested data:
```
index=app_logs | stats count by sourcetype
```

Exercise 2: Ingest Syslog Data from Network Devices

Goal: Configure Splunk to listen for Syslog messages on UDP port 514.

Steps:

Edit inputs.conf:

Add the following configuration:

[udp://514]
disabled = false
index = syslog_index
sourcetype = syslog

Verify Listening Port:
- Check if Splunk is listening on port 514:
```
netstat -tuln | grep 514
```
Send Test Data:
- Use a tool like logger to send a Syslog message:
```
logger -n <splunk_server_ip> -P 514 "Test syslog message"
```
Search the Data:
- Use SPL to find the test message:
```
index=syslog_index sourcetype=syslog
```

Exercise 3: Use Scripted Inputs to Fetch Data

Goal: Use a custom Python script to fetch data from an API and ingest it into Splunk.

Steps:

Create a Script:

Save the following Python script as fetch_data.py in the $SPLUNK_HOME/bin/scripts/ directory:

import time
import json
import requests

# Fetch data from a sample API
response = requests.get("https://jsonplaceholder.typicode.com/todos/1")
data = response.json()

# Print data in Splunk-friendly format
print(json.dumps(data))

Configure inputs.conf:

Add the following entry:

[script://$SPLUNK_HOME/bin/scripts/fetch_data.py]
disabled = false
interval = 60
sourcetype = json_data
index = api_index

Restart Splunk:
- Restart to activate the input:
```
./splunk restart
```
Verify Data:
- Search for the fetched data:
```
index=api_index sourcetype=json_data
```

Real-World Scenarios

Scenario 1: Centralized Log Monitoring for Multiple Servers

Goal: Collect logs from multiple servers using Splunk Universal Forwarders.

Steps:

Install Universal Forwarder:
- Install the Splunk Universal Forwarder on each server.

Configure Forwarders:

Edit outputs.conf on the forwarders:

[tcpout]
defaultGroup = indexer_group

[tcpout:indexer_group]
server = <indexer_ip>:9997

Add Inputs:

Define inputs in inputs.conf on the forwarders:

[monitor:///var/log/server_logs/]
disabled = false
sourcetype = server_logs
index = main

Verify Data on Indexer:
- Run a search to confirm logs are received:
```
index=main sourcetype=server_logs
```

Scenario 2: Parsing and Extracting Fields from Custom Logs

Goal: Parse a custom log format to extract specific fields.

Steps:

Define Parsing Rules in props.conf:
- Example log: 192.168.1.1 - - [01/Jan/2025:12:00:00] "GET /index.html" 200
- Add this to props.conf:
```
[custom_log]
REPORT-custom_fields = extract_custom_fields
```

Create Extraction Rules in transforms.conf:

Add the following:

[extract_custom_fields]
REGEX = ^(?P<ip>\d+\.\d+\.\d+\.\d+) .+ \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<uri>[^\s]+)" (?P<status>\d+)
FORMAT = ip::$1 timestamp::$2 method::$3 uri::$4 status::$5

Verify Field Extraction:

Run a search:

index=custom_index sourcetype=custom_log

Troubleshooting Data Ingestion Issues

Common Issues and Fixes

Issue 1: Data Not Appearing in Splunk

Cause:
- Incorrect inputs.conf configuration.
Solution:
- Verify inputs.conf settings using btool:
```
splunk cmd btool inputs list --debug
```

Issue 2: Duplicate Events

Cause:
- File rotation or re-ingestion of data.
Solution:
- Ensure crcSalt is not used unnecessarily in inputs.conf.
- Example:
```
[monitor:///var/log/syslog]
crcSalt = <SOURCE>
```

Issue 3: High Latency in Data Ingestion

Cause:
- Indexer or forwarder bottlenecks.

Solution:

Monitor ingestion performance:

index=_internal source=*metrics.log group=per_host_thruput

Best Practices for Data Ingestion

Use Index-Specific Inputs:
- Assign specific inputs to dedicated indexes for better organization and performance.
Enable Compression for Forwarders:
- Compress data sent from forwarders to reduce bandwidth usage:
```
[tcpout]
compressed = true
```
Monitor and Tune Regularly:
- Use the Monitoring Console to identify bottlenecks and optimize configurations.

Getting Data In (Additional Content)

Getting data into Splunk involves much more than simply configuring inputs. Advanced configurations, such as crcSalt, whitelist/blacklist filters, and HEC performance tuning, are essential for large-scale, reliable, and efficient ingestion.

1. Understanding `crcSalt`

The crcSalt setting plays a critical role in how Splunk determines whether a file has been previously indexed.

What is CRC?

Splunk uses a CRC (cyclic redundancy check) hash calculated from the beginning of a file (up to the first 256 KB by default) to determine if a file is new or has already been indexed.

Why Use `crcSalt`?

To force Splunk to reindex a file even if its contents are mostly the same.
Helpful in environments where files rotate but retain similar headers.

Configuration Syntax:

[monitor:///var/log/my_app.log]
crcSalt = <SOURCE>

Effect of `crcSalt = <SOURCE>`:

Appends the full file path to the CRC hash calculation.
Ensures that files with the same content but different paths are treated as distinct files.
Can prevent false de-duplication during ingestion.

Warning:

If used incorrectly (e.g., with static log file names across environments), it may cause unintended duplication of data.

Use Case:

You have multiple files with the same content but different file names or directories:

/var/log/host1/app.log  
/var/log/host2/app.log

By setting crcSalt = <SOURCE>, Splunk indexes both files independently.

2. Input Filtering with Whitelist and Blacklist

When using monitor stanzas, it is important to ingest only the necessary files to avoid wasted storage and performance issues.

Whitelist / Blacklist Overview:

whitelist: a regex pattern to include files.
blacklist: a regex pattern to exclude files.

They can be combined for precise control over file selection.

Configuration Example:

[monitor:///var/log/]
whitelist = \.log$
blacklist = debug.*

Explanation:

Only files ending with .log will be considered (whitelist).
Files matching debug.* (e.g., debug.log, debug_trace.log) will be excluded (blacklist).

Tip: Use anchored patterns (^, $) to match exact file names when needed:

blacklist = ^debug\.log$

Use Case:

In a large /var/log/ directory, include only .log files except for debug.log.

3. HEC: High-Performance Settings (`batchMode`, `enableAck`)

When using the HTTP Event Collector (HEC) for scalable, reliable ingestion from cloud-native or custom apps, advanced parameters can optimize throughput and delivery assurance.

3.1 `batchMode`

Used to group multiple events into a single HTTP POST.
Reduces HTTP overhead and increases throughput.

How to Use:

In the payload:

{
  "event": "first_event"
},
{
  "event": "second_event"
}

API Behavior:

Send an array of events instead of one-at-a-time.

Use Case:

IoT or log aggregators sending thousands of small events.
Use batching to optimize network usage and reduce indexing delay.

3.2 `enableAck`

Enables event delivery acknowledgment between sender and HEC.
Ensures that data is not lost due to network or system failure.
Recommended in high-assurance environments (e.g., financial systems, compliance-critical logs).

Enable it in Splunk:

Splunk Web → Settings → Data Inputs → HTTP Event Collector → Select Token → Enable Index Acknowledgment

Sender-side Header:

"X-Splunk-Request-Channel": "<UUID>"

Expected Workflow:

Client sends data with channel ID.
Splunk responds with an event UUID.
Client polls the /ack endpoint to confirm event ingestion.

Use Case for `enableAck`:

Required for "exactly once" delivery semantics.
Use it when logs must not be lost under any circumstance (e.g., audit logs).

Summary Best Practices

Feature	Best Practice
`crcSalt`	Use `crcSalt = <SOURCE>` when filenames differ but file contents are similar. Avoid unintentional duplication.
Whitelist/Blacklist	Use both for precise file filtering and to reduce ingest volume.
HEC `batchMode`	Enable for bulk data ingestion from microservices or high-volume pipelines.
HEC `enableAck`	Use when event acknowledgment is critical; monitor `/services/collector/ack` for tracking.

Shopping cart

Subtotal:

SPLK-1003 Getting Data In

Detailed list of SPLK-1003 knowledge points

Getting Data In Detailed Explanation

1. Input Methods

1.1 File and Directory Monitoring

1.2 Network Inputs

1.3 Scripted Inputs

1.4 APIs

2. Metadata Assignment

2.1 Host

2.2 Source

2.3 Sourcetype

3. Best Practices

3.1 Validate Data Using Regular Expressions

3.2 Monitor Ingestion Performance

3.3 Use Index-Specific Inputs

Hands-On Exercises

Exercise 1: Monitor a Directory for Log Files

Steps:

Exercise 2: Ingest Syslog Data from Network Devices

Steps:

Exercise 3: Use Scripted Inputs to Fetch Data

Steps:

Real-World Scenarios

Scenario 1: Centralized Log Monitoring for Multiple Servers

Steps:

Scenario 2: Parsing and Extracting Fields from Custom Logs

Steps:

Troubleshooting Data Ingestion Issues

Common Issues and Fixes

Issue 1: Data Not Appearing in Splunk

Issue 2: Duplicate Events

Issue 3: High Latency in Data Ingestion

Best Practices for Data Ingestion

Getting Data In (Additional Content)

1. Understanding crcSalt

What is CRC?

Why Use crcSalt?

Configuration Syntax:

Effect of crcSalt = <SOURCE>:

Warning:

Use Case:

2. Input Filtering with Whitelist and Blacklist

Whitelist / Blacklist Overview:

Configuration Example:

Explanation:

Use Case:

3. HEC: High-Performance Settings (batchMode, enableAck)

3.1 batchMode

How to Use:

API Behavior:

Use Case:

3.2 enableAck

Enable it in Splunk:

Sender-side Header:

Expected Workflow:

Use Case for enableAck:

Summary Best Practices

Frequently Asked Questions

1. Understanding `crcSalt`

Why Use `crcSalt`?

Effect of `crcSalt = <SOURCE>`:

3. HEC: High-Performance Settings (`batchMode`, `enableAck`)

3.1 `batchMode`

3.2 `enableAck`

Use Case for `enableAck`: