Monitor and Maintain Azure Resources

Monitor and Maintain Azure Resources Detailed Explanation

This domain focuses on observability, alerting, backup, disaster recovery, and ongoing maintenance of Azure environments.

1. Monitor Resources Using Azure Monitor

Azure Monitor is a comprehensive platform for collecting, analyzing, and acting on telemetry data from Azure and non-Azure resources.

1.1 Azure Monitor Core Components

1.1.1 Metrics

Numerical, time-series data.
Examples:
- CPU usage
- Disk IOPS
- Network traffic

Used for:

Real-time monitoring
Triggering alerts

1.1.2 Logs

Detailed events and telemetry, collected from:
- Azure resources
- Virtual machines (via agent)
- Applications
Stored in Log Analytics Workspaces

Logs support complex queries using KQL (Kusto Query Language).

1.1.3 Workbooks

Visual dashboards that combine:
- Metrics
- Logs
- Text
- Queries
Interactive and shareable for analysis or reporting

1.2 Diagnostic Settings

1.2.1 What They Do

Capture resource logs (e.g., from VMs, Storage, App Services).
Allow routing to other systems.

1.2.2 Where You Can Send Logs

Log Analytics Workspace: for queries and analysis
Event Hubs: for streaming logs to external SIEM tools (like Splunk)
Azure Storage: for archiving long-term

Example:

Send storage account logs to Log Analytics for querying
Send VM logs to Event Hub for off-Azure security analysis

1.3 Data Sources

Azure Monitor collects data from:

Azure resources: e.g., metrics from VMs, SQL DBs, storage
Guest OS metrics/logs: via Azure Monitor Agent (AMA)
Subscriptions and tenants: for audit logs, activity logs

Supports both platform and custom telemetry.

2. Configure and Interpret Metrics and Logs

Understanding and working with metrics and logs is essential for diagnosing performance issues, identifying trends, and responding to incidents.

2.1 Metrics Explorer

2.1.1 What is Metrics Explorer?

Metrics Explorer is a tool within Azure Monitor that allows you to:

Visualize metrics in real-time
Build charts (line, bar, etc.)
Set up alerts based on metrics

2.1.2 How to Use

Go to a resource (e.g., VM, Storage Account)
Select Monitoring > Metrics
Choose:
- Metric namespace (e.g., “Virtual Machine Host”)
- Metric (e.g., CPU percentage)
- Aggregation: avg, max, min, sum
- Time range
(Optional) Apply filters and splitting by instance

Helps identify performance trends and anomalies

2.2 Log Analytics Queries (KQL)

2.2.1 What is KQL?

Kusto Query Language (KQL) is used to query and analyze logs collected by Azure Monitor.

Similar to SQL, but optimized for time-series and diagnostic data.

2.2.2 Basic Query Example

Perf  
| where ObjectName == "Processor"  
| where CounterName == "% Processor Time"  
| where InstanceName == "_Total"  
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m)

This shows the average CPU usage every 5 minutes.

2.2.3 Where to Run KQL Queries

In the Log Analytics workspace
From Azure Monitor > Logs
From VM > Logs

2.3 Custom Logs

2.3.1 What Are Custom Logs?

You can collect your own application or system log files (e.g., logs from a web server) into Log Analytics.

2.3.2 Setup Steps

Install Azure Monitor Agent
In the Log Analytics workspace, configure a Custom Log source
Define:
- File path (e.g., C:\logs\myapp.log)
- Delimiters or record patterns
Azure ingests the logs and lets you query them via KQL

2.3.3 Parse Logs with KQL

Once collected, use KQL functions like parse, split, or extract() to analyze them.

Example:

CustomLog_CL  
| extend Level = extract("Level=([A-Z]+)", 1, RawData)  
| summarize count() by Level

3. Configure Alerts and Actions

Azure Monitor enables you to create alerts based on metrics, logs, or activity logs, and tie them to automated actions like notifications or remediation steps.

3.1 Alerts

3.1.1 Types of Alerts

Alert Type	Triggers on	Example
Metric Alert	Real-time performance data	CPU > 80%
Log Alert	KQL query result	Number of 500 errors > 10
Activity Log Alert	Azure control-plane events	Resource deletion or creation

3.1.2 Scope of Alerts

Alerts can be scoped to:

A single resource
A resource group
An entire subscription

3.1.3 Key Alert Configuration Options

Thresholds: e.g., CPU > 80%
Frequency: How often the condition is evaluated
Evaluation period: Over what time range data is analyzed
Severity levels:
- 0 – Critical
- 1 – Error
- 2 – Warning
- 3 – Informational
- 4 – Verbose

Alerts are managed under:
Azure Monitor > Alerts

3.2 Action Groups

3.2.1 What is an Action Group?

An action group is a reusable set of response actions that are triggered when an alert fires.

You define:

Who to notify
What action to take

3.2.2 Notification Methods

Email
SMS
Push notification (via Azure app)
Voice call

3.2.3 Automation Actions

Webhook: Notify a 3rd-party service
Azure Function: Run serverless code in response
Logic App: Trigger a workflow (e.g., post to Teams or Slack)

3.2.4 How to Create and Use

Go to Azure Monitor > Alerts > Action groups
Click + Create
Add:
- Recipients
- Actions
- (Optional) Tags
When creating an alert rule, select this action group

Reuse the same action group across multiple alerts.

4. Implement Backup and Recovery

Azure provides built-in backup and restore services to protect your workloads and data. These tools support both Azure-native resources and on-premises systems.

4.1 Azure Backup

4.1.1 What Is Azure Backup?

A cloud-based backup solution that eliminates the need for on-prem backup infrastructure. It uses a component called the Recovery Services Vault.

4.1.2 Supported Backup Workloads

Azure VMs: Full snapshot-based backups
On-premises servers:
- Using MARS Agent (Microsoft Azure Recovery Services)
- Or Azure Backup Server (for more advanced workloads)
SQL Server in Azure VMs: App-consistent backups using VSS

4.1.3 Steps to Back Up a VM

Create a Recovery Services Vault
Register the Azure VM
Define a backup policy
Initiate backup now or wait for scheduled backup

VMs can be restored to a new instance, or you can restore individual files.

4.2 Backup Policies

4.2.1 Define Retention Rules

Daily / Weekly / Monthly / Yearly backup points
Retention duration configurable (e.g., keep weekly backups for 12 weeks)

Helps organizations meet data retention requirements

4.2.2 Instant Restore

Keeps up to 5 recovery points using snapshot-based backups.
Enables quick recovery without needing to fetch from vault storage.

4.3 Restore Options

4.3.1 File-Level Restore

For Azure VMs:
- Mount the backup as a virtual drive
- Browse and copy needed files

4.3.2 Full VM Restore

Options:
- Create new VM
- Restore disks only
- Replace existing VM

Use when recovering from ransomware, accidental deletion, or configuration failure.

5. Implement and Configure Azure Site Recovery (ASR)

Azure Site Recovery (ASR) is a disaster recovery solution that replicates workloads to a secondary location, allowing business continuity in case of failures.

5.1 Replication Setup

5.1.1 Azure VM to Azure Region

You can replicate an Azure VM from one region to another.
Requires:
- Recovery Services Vault
- Enabling replication settings per VM
Azure replicates OS disks and data disks asynchronously to the secondary region.

Use case: Protect production VMs from regional outages

5.1.2 On-Prem to Azure

ASR supports replication from:
- Hyper-V (with or without System Center VMM)
- VMware
- Physical servers

Steps:

Install ASR agent on-prem
Set up Process Server and Configuration Server
Create replication policy
Enable replication to Azure

Azure becomes your disaster recovery site, reducing physical infrastructure needs.

5.2 Recovery Plans

What Are Recovery Plans?

Recovery plans allow you to orchestrate the failover process.

They support:

Grouping VMs by application tier
Sequencing startup
Custom scripts or manual steps

Useful for multi-VM applications where order of startup matters (e.g., DB → App → Web)

5.3 Failover and Failback

5.3.1 Types of Failover

Type	Description
Planned Failover	Controlled, no data loss, used for maintenance or migration
Unplanned Failover	Used during unexpected outages
Test Failover	Non-disruptive; validates DR setup in isolation

All types support automated and manual testing via portal or PowerShell.

5.3.2 Failback

Once the primary site is restored:
- You can replicate data back from Azure to on-prem or original region
- Then failback systems safely

6. Perform Maintenance Tasks

These tasks help ensure your Azure environment remains secure, up-to-date, and cost-efficient over time.

6.1 Update Management

6.1.1 What Is Update Management?

A solution in Azure Automation that helps:

Track missing updates
Schedule patch installation
Monitor compliance

Supports:

Windows VMs
Linux VMs

Helps maintain security compliance by ensuring systems are patched.

6.1.2 How to Set Up

Enable Update Management from your VM or Automation Account
Link the VM to Log Analytics
Define schedule and maintenance window
Select:
- Update types (critical, security)
- Reboot options

6.1.3 Monitoring

Use Update Compliance reports to track:

Successful/failed patch attempts
Overall compliance percentage
Time of last scan

6.2 Monitor Service Health

6.2.1 Azure Service Health Overview

Provides personalized alerts and status notifications for:

Azure outages
Planned maintenance
Regional issues

Access via:
Azure Portal > Service Health

6.2.2 Types of Alerts

Health Advisories: Best practices, changes to service behavior
Security Advisories: Threat detections or patches
Maintenance Notifications: Scheduled updates
Service Incidents: Real-time issue tracking

You can subscribe to email/SMS alerts for your services and regions.

6.3 Resource Optimization

Azure Advisor

Azure Advisor analyzes your environment and provides recommendations for:

Category	Example
High Availability	Enable backup for critical VMs
Security	Enable MFA or NSG
Performance	Resize underutilized VMs
Cost	Remove idle resources or use Reserved Instances

Each recommendation includes:

Potential impact
Estimated cost savings
Remediation steps

6.4 Performance Tuning

6.4.1 VM Performance

Resize VMs if:
- CPU/memory usage is consistently high or low
Switch to different VM series based on workload

6.4.2 App Service Tuning

Upgrade or downgrade App Service Plans
Tune autoscale rules and diagnostic settings

6.4.3 Database and Storage Tuning

Modify SQL Database DTUs/vCores or Service Tiers
Tune Blob storage access tiers or replication settings for cost-performance balance

Monitor and Maintain Azure Resources (Additional Content)

1. Alert Suppression (Alert Rule Suppression Configuration)

What is Alert Suppression?

Alert suppression allows you to control the frequency of alert notifications to avoid alert storms during high-frequency conditions.

Key Parameters

Suppression Interval:
The minimum duration between successive notifications for the same alert condition.
Use case:
For example, if CPU stays above 90% for 2 hours, you may want only one notification per hour, not one per evaluation cycle.

How to Configure

In the alert rule creation wizard, under the Actions section.
Set a "Suppression" time window (e.g., 30 minutes, 1 hour).

Why It Matters

Avoids alert fatigue
Helps teams focus on unique or meaningful alerts

Exam Tip: Understand how to reduce alert noise without disabling alerts entirely.

2. Azure Managed Grafana

Azure Managed Grafana provides a fully managed Grafana environment integrated with Azure Monitor data sources.

Key Features

Native integration with:
- Azure Monitor
- Log Analytics
- Application Insights
Supports:
- Custom dashboards
- Role-based access control
- Team collaboration
No need to maintain infrastructure or apply patches

Benefits

Advanced, custom visualizations (beyond Workbooks)
Ideal for SRE and Ops teams
Supports mixed sources (e.g., Prometheus + Azure Monitor)

Deployment Path

Search for Azure Managed Grafana in Azure Marketplace
Assign users with proper roles (e.g., Viewer, Editor)

Note: AZ-104 does not test Grafana directly, but awareness of monitoring extensibility can help in hybrid environments.

3. KQL Advanced Functions (Join Example)

Kusto Query Language (KQL) supports powerful analytics features like joins, parsing, and data shaping.

Join Example: Combine Performance and Heartbeat Data

Heartbeat  
| where TimeGenerated > ago(1h)  
| join kind=inner (  
    Perf  
    | where ObjectName == "Processor"  
    | where CounterName == "% Processor Time"  
) on Computer  
| project TimeGenerated, Computer, CounterName, CounterValue

Explanation

Heartbeat shows VM availability.
Perf shows performance metrics.
The query joins both tables on the Computer name, allowing correlation of CPU usage with availability.

Other Advanced Functions (Optional)

extend — Add calculated columns
parse_json — Extract structured data from JSON fields
summarize — Aggregate over time windows or categories

Exam Tip: AZ-104 focuses on basic KQL, but join examples may appear in practical or case-study questions.

4. Azure Resource Graph (Optional but Valuable)

Azure Resource Graph is a service for high-performance querying across large-scale Azure environments.

Use Cases

Query across subscriptions or management groups
Inventory reporting (e.g., list all VMs not in a backup vault)
Governance and compliance auditing

Query Example

Resources  
| where type == "microsoft.compute/virtualmachines"  
| project name, location, resourceGroup

Key Benefits

Instant, read-only access to resource metadata
Uses KQL-like syntax
Supports filtering by tags, properties, policies

Access Methods

Azure Portal → Resource Graph Explorer
Azure CLI: az graph query -q "<query>"

Note: Not directly tested in AZ-104, but helpful for real-world automation, inventory, and governance.

Shopping cart

Subtotal:

AZ-104 Monitor and Maintain Azure Resources

Detailed list of AZ-104 knowledge points