Microscopic technical focus: Mapping test datasets to groundedness, relevance, coherence, fluency, safety, built-in metrics, and custom metrics.
Beginner explanation: Evaluation converts subjective response quality into repeatable checks. Dataset mapping is the first dependency because metrics are meaningless if fields are wrong.
Operational split for this point: start with Evaluation dataset, then verify Data mapping and Groundedness metric before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.
For this knowledge point, the target objects are Evaluation dataset, Data mapping, Groundedness metric, Safety evaluator, Custom metric, Evaluation workflow. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.
Why-layer: Evaluation dataset becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Scores are misleading when question, context, expected answer, and labels are mapped to the wrong evaluator fields. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.
Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.
Common mistakes: Selecting a familiar Azure service without checking the missing dependency in the scenario. Treating a successful create operation as proof of runtime behavior. Choosing a monitoring action when the scenario asks for configuration or access remediation.
Practice question: A GenAI app returns plausible but unsupported answers, and the team needs a repeatable evaluation gate before release.
A. Map question, context, expected answer, and safety labels correctly, then run quality and safety evaluators with release thresholds. B. Review five generated responses manually and approve the prompt if they look reasonable. C. Increase model temperature to improve groundedness scores. D. Add more application logs instead of creating an evaluation dataset.
Correct Answer: A
Explanation: A is correct because evaluation depends on mapped test data and measurable gates. B is not repeatable, C can worsen consistency, and D observes requests but does not score output quality.
The common decision point is: Manual answer review cannot replace repeatable evaluation when prompt and model versions must be compared before release. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Evaluation dataset | Field mapping | Input, context, expected answer, labels | Rows are inert until mapped | Evaluator schema | Scores are invalid or missing |
| Groundedness metric | Context support | Score or pass/fail threshold | No release gate | Retrieved context or reference data | Unsupported answers pass review |
| Safety evaluator | Risk category | Hate, violence, self-harm, sexual, custom policy | Not enforced until configured | Safety test set | Unsafe output reaches production |
| Custom metric | Domain-specific rule | Function, rubric, or evaluator prompt | No domain gate | Business acceptance criteria | Correct-looking answer violates domain rule |
python evaluate.py --dataset evaluations/app.jsonl --dry-run
Command type: local lab rehearsal script, not an official Azure command.
Reason: Dry-run mapping first because a mislabeled context or expected-answer field makes every score misleading.
Checkpoint: Output shows required evaluator fields mapped.
python evaluate.py --dataset evaluations/app.jsonl --metrics groundedness relevance coherence fluency
Command type: local lab rehearsal script, not an official Azure command.
Reason: Run quality metrics so the release decision is based on repeatable scores rather than manual impressions.
Checkpoint: Aggregate metric report is produced for the prompt or app version.
python evaluate.py --dataset evaluations/safety.jsonl --metrics safety
Command type: local lab rehearsal script, not an official Azure command.
Reason: Run safety cases separately because functional correctness does not prove safe behavior under adversarial prompts.
Checkpoint: Unsafe cases are flagged and threshold failures return nonzero status.
gh workflow run evaluate-prompts.yml
Command type: GitHub CLI workflow verification.
Reason: Execute evaluation in CI so release approval depends on the same gate every time.
Checkpoint: Workflow summary stores metric artifacts and pass/fail status.
A user, workflow, or deployment command targets Evaluation dataset and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Data mapping, Groundedness metric, Safety evaluator. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Scores are misleading when question, context, expected answer, and labels are mapped to the wrong evaluator fields. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Dry-run evaluation mapping | python evaluate.py --dataset evaluations/app.jsonl --dry-run |
Required fields map to evaluator inputs before scoring. Command type: local lab rehearsal script, not an official Azure command. |
| Review quality metrics | python evaluate.py --dataset evaluations/app.jsonl --metrics groundedness relevance coherence fluency |
Metric output is produced for the prompt or app version. Command type: local lab rehearsal script, not an official Azure command. |
| Review safety metrics | python evaluate.py --dataset evaluations/safety.jsonl --metrics safety |
Unsafe cases are flagged and threshold failures are visible. Command type: local lab rehearsal script, not an official Azure command. |
| Inspect CI evaluation result | gh run view <run-id> --log |
CI output stores metric artifacts and pass/fail status. Command type: GitHub CLI workflow verification. |
Microscopic technical focus: Monitoring latency, throughput, token consumption, resource usage, logs, traces, and production debugging signals.
Beginner explanation: Observability connects one user request to retrieval, prompt creation, model call, tool calls, token cost, and response. HTTP success is only one small signal.
Operational split for this point: start with Trace span, then verify Correlation ID and Latency metric before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.
For this knowledge point, the target objects are Trace span, Correlation ID, Latency metric, Token metric, Application log, Alert rule. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.
Why-layer: Trace span becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Production incidents cannot be reconstructed when retrieval, prompt assembly, model call, tool call, and response serialization lack a shared correlation ID. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.
Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.
Common mistakes: Selecting a familiar Azure service without checking the missing dependency in the scenario. Treating a successful create operation as proof of runtime behavior. Choosing a monitoring action when the scenario asks for configuration or access remediation.
Practice question: A GenAI app has intermittent poor answers and high latency, but logs only show HTTP 200 responses.
A. Instrument correlation IDs and traces across retrieval, prompt assembly, model calls, tool calls, latency, token usage, and errors. B. Track only HTTP 200 responses because successful transport proves response quality. C. Switch model versions before collecting trace evidence. D. Increase max tokens to troubleshoot latency spikes.
Correct Answer: A
Explanation: A is correct because GenAI incidents span multiple runtime stages. B ignores quality/cost/tool behavior, C changes the system before diagnosis, and D may increase latency and cost.
The common decision point is: A 200 response only proves transport success; quality, cost, trace completeness, and model latency require separate telemetry. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Correlation ID | Trace linkage | Request-level unique ID | Absent unless generated | App instrumentation | Incident cannot be reconstructed |
| Trace span | Operation boundary | Retrieval, prompt, model, tool, response | No span until instrumented | Telemetry SDK or Microsoft Foundry tracing | Latency source is hidden |
| Token metric | Cost signal | Prompt, completion, total tokens | Untracked by default in app logs | Model response telemetry | Cost spike has no owner |
| Alert rule | Operational threshold | Latency, failure, token, safety, quality | No notification | Metric source and action group | Degradation remains dashboard-only |
curl -H "x-correlation-id: <id>" <app-endpoint>
Command type: network/API rehearsal command for the selected app or Azure OpenAI endpoint; confirm URL, header, and API version before use.
Reason: Send a known correlation ID so the full request path can be searched across app logs, traces, and model telemetry.
Checkpoint: The same ID appears in downstream logs or traces.
az monitor log-analytics query --workspace <workspace-id> --analytics-query "AppTraces | where CorrelationId == '<id>'"
Command type: Azure Monitor Logs verification; confirm workspace schema and query table names in the lab environment.
Reason: Query traces by correlation ID because HTTP status alone cannot show where the request spent time.
Checkpoint: Results show retrieval, prompt, model, tool, and response spans.
az monitor metrics list --resource <resource-id> --metric Latency,TotalTokens
Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type.
Reason: Inspect latency and token metrics together because slow or expensive responses can come from prompt size, model latency, or tool calls.
Checkpoint: Metrics show time-series latency and token usage.
az monitor metrics alert create --name genai-latency-token-alert --resource-group rg-ai300 --scopes <resource-id>
Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type.
Reason: Create alerts so degradation is actionable instead of discovered after user complaints.
Checkpoint: Alert rule is enabled and bound to an action group.
A user, workflow, or deployment command targets Trace span and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Correlation ID, Latency metric, Token metric. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Production incidents cannot be reconstructed when retrieval, prompt assembly, model call, tool call, and response serialization lack a shared correlation ID. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Send correlated request | curl -H "x-correlation-id: <id>" <app-endpoint> |
The same correlation ID appears in downstream traces. Command type: network/API rehearsal command for the selected app or Azure OpenAI endpoint; confirm URL, header, and API version before use. |
| Query traces by correlation ID | `az monitor log-analytics query --workspace |
where CorrelationId == ' |
| Review latency and token metrics | az monitor metrics list --resource <resource-id> --metric Latency,TotalTokens |
Metrics expose latency and token consumption together. Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type. |
| Verify alert rule | az monitor metrics alert show --name genai-latency-token-alert --resource-group rg-ai300 |
Alert rule exists at the intended scope. Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type. |
What should be included in an evaluation dataset for a generative AI application or agent?
Include representative inputs, expected outcomes or grading criteria, data mappings, edge cases, safety cases, and source references when groundedness matters.
Generative AI evaluation requires more than random prompts. The dataset should reflect real user tasks, expected behavior, domain-specific risks, and the data sources the application is supposed to use. Proper data mapping ensures evaluation metrics judge the right fields and outputs. AI-300 tests this because reliable quality assurance starts with evaluation data that represents production behavior.
Demand Score: 91
Exam Relevance Score: 97
How should groundedness, relevance, coherence, and fluency be used in quality evaluation?
Use them as complementary metrics that measure whether responses are source-supported, on-task, logically consistent, and readable.
Each metric captures a different quality dimension. Groundedness checks whether the answer is supported by source context. Relevance checks whether it addresses the user request. Coherence checks logical consistency, and fluency checks readability. A response can be fluent but ungrounded, or grounded but not relevant. AI-300 scenarios often require choosing evaluation metrics that match the failure mode.
Demand Score: 88
Exam Relevance Score: 95
When should risk and safety evaluations be added to a generative AI evaluation workflow?
Add them whenever the application may produce harmful, unsafe, policy-violating, or sensitive outputs in production.
Quality metrics alone do not prove that a generative AI system is safe for users. Risk and safety evaluations help detect harmful content, unsafe completions, policy violations, and other output risks before and after deployment. The exam may present a scenario where a model meets relevance goals but creates safety exposure. The correct answer adds automated safety evaluation and monitoring rather than optimizing only for task accuracy.
Demand Score: 89
Exam Relevance Score: 96
Why are automated evaluation workflows important in GenAIOps?
They allow prompt, model, retrieval, and application changes to be tested consistently before promotion.
Manual review cannot scale across frequent GenAI changes. Automated workflows can run built-in and custom metrics against the same datasets, compare variants, and block unsafe or low-quality releases. This is especially important when prompts, retrieval settings, model versions, or agents change in CI/CD. AI-300 emphasizes automation because operational quality must be repeatable, measurable, and connected to deployment decisions.
Demand Score: 86
Exam Relevance Score: 94
Which observability signals are most useful when a production generative AI application becomes slow or expensive?
Review latency, throughput, response time, token consumption, model calls, retrieval calls, errors, and cost metrics.
Performance and cost issues can come from model selection, prompt length, retrieval depth, tool calls, rate limits, or inefficient agent behavior. Observability should separate latency and cost by component so the team can identify whether the model, retrieval layer, application code, or downstream service is responsible. AI-300 expects candidates to troubleshoot with telemetry instead of making blind model or infrastructure changes.
Demand Score: 93
Exam Relevance Score: 98
How do logs and traces support troubleshooting for generative AI applications and agents?
They show the request path, prompt version, retrieved context, model calls, tool calls, errors, and timing for each execution.
Agents and RAG applications can fail at several layers. Logs and traces help reconstruct what happened, including which prompt ran, what context was retrieved, which tools were called, and where time or errors occurred. This evidence is needed to distinguish retrieval problems, prompt defects, model behavior, and integration failures. AI-300 scenarios often reward answers that collect detailed debugging evidence before changing the system.
Demand Score: 90
Exam Relevance Score: 97