Designing and implementing GenAIOps infrastructure

Designing and implementing GenAIOps infrastructure Detailed Explanation

Implement Foundry environments and platform configuration

Exam Radar

Official Blueprint Mapping: Create and configure Foundry resources and project environments; Configure identity and access management with managed identities and role-based access control (RBAC); Implement network security and private networking configurations; Deploy infrastructure using Bicep templates and Azure CLI.
Domain Weight: Design and implement a GenAIOps infrastructure represents 20-25% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement a GenAIOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Microsoft Foundry resource, Microsoft Foundry project, Managed identity, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: A valid token is insufficient when the client resolves the public endpoint while public network access is disabled.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: GenAI applications fail before inference when project RBAC, managed identity binding, private endpoint approval, or DNS integration is incomplete.
Operational Dependency: The task depends on Microsoft Foundry resource, Microsoft Foundry project, Managed identity, RBAC assignment, Private endpoint, Bicep deployment and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A GenAI application cannot call a Microsoft Foundry model deployment from a private network even though the application has a valid identity token.
How Distractors Are Designed: Distractors rotate keys, change the prompt, select another model, or add application logging. These do not fix RBAC scope or private endpoint name resolution.
Why the Correct Answer Works: The correct answer configures project/resource access, managed identity permissions, private endpoint approval, and DNS resolution because inference requires both authorization and reachable network path.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Configuring Microsoft Foundry resources, projects, managed identities, RBAC, private networking, Bicep, and Azure CLI deployment.

Beginner explanation: Microsoft Foundry configuration gives GenAI apps a governed project, identity boundary, and network path. A model call must pass both permission and connectivity checks.

Operational split for this point: start with Microsoft Foundry resource, then verify Microsoft Foundry project and Managed identity before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Microsoft Foundry resource, Microsoft Foundry project, Managed identity, RBAC assignment, Private endpoint, Bicep deployment. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Microsoft Foundry resource becomes exam-relevant only when the surrounding dependency chain can run. In this topic, GenAI applications fail before inference when project RBAC, managed identity binding, private endpoint approval, or DNS integration is incomplete. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.

Common mistakes: Selecting a familiar Azure service without checking the missing dependency in the scenario. Treating a successful create operation as proof of runtime behavior. Choosing a monitoring action when the scenario asks for configuration or access remediation.

Practice question: A GenAI application cannot call a Microsoft Foundry model deployment from a private network even though the application has a valid identity token.

A. Assign the application identity the required Microsoft Foundry or Azure OpenAI resource role and validate private endpoint DNS resolution from the application network.
B. Rotate the API key because every Microsoft Foundry access issue is a credential leak.
C. Switch to a different foundation model deployment without changing network or RBAC settings.
D. Add more application logging before testing private endpoint connectivity.

Correct Answer: A

Explanation: A is correct because model invocation requires both authorization and network reachability. B, C, and D do not repair the missing RBAC or private DNS dependency.

The common decision point is: A valid token is insufficient when the client resolves the public endpoint while public network access is disabled. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Microsoft Foundry project	Access boundary	User, group, service principal, managed identity	No project operation until role assigned	Microsoft Foundry resource and Entra ID	Deployment or evaluation action is denied
Managed identity	Token use	System-assigned or user-assigned	Not bound to caller by default	Application hosting service	API request receives 401 or 403
Private endpoint	Connection state	Pending, approved, rejected	No private route until approved	VNet and resource networking	Client cannot reach endpoint
Private DNS	Name resolution	Private IP for service FQDN	Public resolution by default	DNS zone link	Requests route to blocked public endpoint

Step-by-Step Execution Path

Execute the operational step.

az role assignment create --assignee <principal-id> --role "Cognitive Services User" --scope <foundry-resource-id>

Command type: Azure CLI RBAC verification for Entra identity and Azure AI resource scope.

Reason: Assign the caller identity at the resource scope because token possession alone does not grant model invocation rights.

Checkpoint: Role assignment list shows Cognitive Services User for the managed identity.

Execute the operational step.

az network private-endpoint-connection list --id <foundry-resource-id>

Command type: Azure CLI network verification for the Microsoft Foundry or Azure OpenAI resource; confirm the exact resource ID from the active environment.

Reason: Check private endpoint approval because a pending connection does not create a usable private route; confirm the exact resource ID from Microsoft Foundry management center or Azure resource properties.

Checkpoint: Connection state is Approved.

Execute the operational step.

nslookup <resource-name>.openai.azure.com

Command type: network/DNS rehearsal command for private endpoint validation.

Reason: Validate DNS from the application network because private endpoint traffic depends on resolving the public FQDN to a private IP.

Checkpoint: Name resolves to a private address.

Execute the operational step.

curl -H "Authorization: Bearer <token>" https://<resource-name>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=<version>

Command type: network/API rehearsal command for the selected app or Azure OpenAI endpoint; confirm URL, header, and API version before use.

Reason: Call the endpoint after RBAC and DNS checks to prove inference works from the intended network.

Checkpoint: API returns a model response rather than 401, 403, or network timeout.

Technical Chain

A user, workflow, or deployment command targets Microsoft Foundry resource and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Microsoft Foundry project, Managed identity, RBAC assignment. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: GenAI applications fail before inference when project RBAC, managed identity binding, private endpoint approval, or DNS integration is incomplete. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Verify caller authorization	`az role assignment list --assignee <principal-id> --scope <foundry-resource-id> -o table`	The application identity has the required Microsoft Foundry or Azure OpenAI resource role at the expected scope. Command type: Azure CLI RBAC verification for Entra identity and Azure AI resource scope.
Check private endpoint approval	`az network private-endpoint-connection list --id <foundry-resource-id>`	The private endpoint connection is Approved before private traffic is expected to work. Command type: Azure CLI network verification for the Microsoft Foundry or Azure OpenAI resource; confirm the exact resource ID from the active environment.
Validate private DNS from workload network	`nslookup <resource-name>.openai.azure.com`	The service FQDN resolves to a private address from the application network. Command type: network/DNS rehearsal command for private endpoint validation.
Prove endpoint reachability	`curl -H "Authorization: Bearer <token>" https://<resource-name>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=<version>`	The call returns a model response instead of authorization or network errors. Command type: network/API rehearsal command for the selected app or Azure OpenAI endpoint; confirm URL, header, and API version before use.

Deploy and manage foundation models for production workloads

Exam Radar

Official Blueprint Mapping: Deploy foundation models by using serverless API endpoints and managed compute options; Select appropriate models for specific use cases; Implement model versioning and production deployment strategies; Configure provisioned throughput units for high-volume workloads.
Domain Weight: Design and implement a GenAIOps infrastructure represents 20-25% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement a GenAIOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Foundation model, Model deployment, Serverless API endpoint, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: Model selection must balance latency, context window, modality, region, cost, and throughput instead of choosing the largest model by default.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: High-volume workloads receive throttling or unstable latency when provisioned throughput and quota are not planned.
Operational Dependency: The task depends on Foundation model, Model deployment, Serverless API endpoint, Managed compute option, Provisioned throughput unit, Model version and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A workload needs predictable latency and capacity for a selected foundation model, and the team must decide between available deployment and throughput options.
How Distractors Are Designed: Distractors choose the largest model, ignore region availability, increase prompt length, or monitor only total calls. These do not reserve capacity or match model constraints.
Why the Correct Answer Works: The correct answer selects a supported model/version/region and configures the appropriate deployment capacity or provisioned throughput based on latency, volume, quota, and cost constraints.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Selecting models, deploying serverless endpoints, managing versions, and configuring provisioned throughput.

Beginner explanation: A foundation model deployment is not just model selection. It is a capacity, version, region, quota, and endpoint decision.

Operational split for this point: start with Foundation model, then verify Model deployment and Serverless API endpoint before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Foundation model, Model deployment, Serverless API endpoint, Managed compute option, Provisioned throughput unit, Model version. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Foundation model becomes exam-relevant only when the surrounding dependency chain can run. In this topic, High-volume workloads receive throttling or unstable latency when provisioned throughput and quota are not planned. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Practice question: A workload needs predictable latency and capacity for a selected foundation model, and the team must decide between available deployment and throughput options.

A. Select a supported model/version/region and configure deployment capacity or provisioned throughput based on latency and volume requirements.
B. Choose the largest available model because larger models always reduce production latency.
C. Increase max output tokens to prevent deployment throttling.
D. Monitor total calls only after users begin receiving 429 responses.

Correct Answer: A

Explanation: A is correct because production deployment is a model, region, quota, and capacity decision. B can increase latency/cost, C affects generation length, and D detects throttling after capacity planning failed.

The common decision point is: Model selection must balance latency, context window, modality, region, cost, and throughput instead of choosing the largest model by default. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Model deployment	Model and version	Supported model catalog entry and version	No endpoint until deployed	Region availability and quota	Application references deployment that does not exist
Serverless endpoint	Capacity mode	Provider-managed throughput	No reserved isolation	Model support and endpoint access	Variable latency under load
Provisioned throughput	Reserved capacity	Throughput unit count	Unavailable until quota approved	Supported model and region	429 throttling or missed latency target
Deployment metric	Operational signal	Total calls, latency, throttling, token usage	Unobserved until monitored	Azure Monitor metric stream	Capacity issue is misdiagnosed

Step-by-Step Execution Path

Execute the operational step.

az cognitiveservices account deployment list -g rg-ai300-genai -n <account>

Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.

Reason: List deployments first so the application uses a deployment name that actually exists in the target account.

Checkpoint: The intended deployment appears with succeeded state.

Execute the operational step.

az cognitiveservices account show-usage -g rg-ai300-genai -n <account>

Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.

Reason: Check quota before deployment because capacity failures are quota constraints, not prompt or application bugs.

Checkpoint: Usage output shows available quota for the selected model family.

Execute the operational step.

az cognitiveservices account deployment create --name <account> --resource-group rg-ai300-genai --deployment-name gpt-prod --model-name <model> --model-version <version> --model-format OpenAI --sku-name Standard --sku-capacity 30

Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.

Reason: Create the deployment with explicit model and capacity so runtime calls target a stable endpoint configuration.

Checkpoint: Deployment provisioning state is Succeeded.

Execute the operational step.

az monitor metrics list --resource <deployment-resource-id> --metric TotalCalls,ThrottledCalls,Latency

Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type.

Reason: Monitor deployment metrics because successful creation does not prove capacity is adequate under production traffic.

Checkpoint: Metrics show call volume, throttling, and latency by time window.

Technical Chain

A user, workflow, or deployment command targets Foundation model and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Model deployment, Serverless API endpoint, Managed compute option. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: High-volume workloads receive throttling or unstable latency when provisioned throughput and quota are not planned. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
List model deployments	`az cognitiveservices account deployment list -g rg-ai300-genai -n <account>`	The intended deployment name appears in the target Azure OpenAI account. Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.
Inspect account quota	`az cognitiveservices account show-usage -g rg-ai300-genai -n <account>`	Usage and limit values show whether requested capacity is available. Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.
Verify deployment state	`az cognitiveservices account deployment show --name <account> --resource-group rg-ai300-genai --deployment-name gpt-prod`	Provisioning state and model/version match the release plan. Command type: Azure CLI verification for Azure OpenAI/Cognitive Services deployment state; confirm current parameters, region support, and model availability.
Review serving metrics	`az monitor metrics list --resource <deployment-resource-id> --metric TotalCalls,ThrottledCalls,Latency`	Metrics expose traffic, throttling, and latency after deployment. Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type.

Implement prompt versioning and management with source control

Exam Radar

Official Blueprint Mapping: Design and develop prompts; Create prompt variants and compare performance across different prompts; Implement version control for prompts by using Git repositories.
Domain Weight: Design and implement a GenAIOps infrastructure represents 20-25% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement a GenAIOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Prompt file, Prompt variant, Evaluation dataset, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: Editing a production prompt in a UI gives speed but loses repeatable comparison, review, rollback, and release traceability.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: A prompt regression cannot be diagnosed when the active prompt text, model deployment, dataset, and metric output are not tied to a commit.
Operational Dependency: The task depends on Prompt file, Prompt variant, Evaluation dataset, Git branch, Pull request, Release tag and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A prompt change improves one test case but breaks production behavior, and the team needs review, evaluation evidence, and rollback.
How Distractors Are Designed: Distractors edit the prompt directly in a portal, rename the prompt file, or rely on chat history screenshots. These do not provide release traceability.
Why the Correct Answer Works: The correct answer stores prompt variants, evaluation datasets, and metric outputs in source control with pull-request review because prompt changes are deployable artifacts.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Designing prompt variants, comparing prompt performance, and controlling releases through Git repositories.

Beginner explanation: A prompt is production logic. Treat prompt text like code: version it, review it, evaluate it, and keep a rollback point.

Operational split for this point: start with Prompt file, then verify Prompt variant and Evaluation dataset before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Prompt file, Prompt variant, Evaluation dataset, Git branch, Pull request, Release tag. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Prompt file becomes exam-relevant only when the surrounding dependency chain can run. In this topic, A prompt regression cannot be diagnosed when the active prompt text, model deployment, dataset, and metric output are not tied to a commit. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Practice question: A prompt change improves one test case but breaks production behavior, and the team needs review, evaluation evidence, and rollback.

A. Commit the prompt variant and evaluation dataset to Git, run evaluation in CI, and tag the approved prompt version.
B. Edit the production prompt directly in the portal and document the change in chat history.
C. Rename the prompt file so reviewers can distinguish it from the old version.
D. Rely on a few manual chat transcripts as release evidence.

Correct Answer: A

Explanation: A is correct because prompt changes need review, evaluation evidence, and rollback. B bypasses source control, C is naming not governance, and D is anecdotal testing.

The common decision point is: Editing a production prompt in a UI gives speed but loses repeatable comparison, review, rollback, and release traceability. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Prompt file	Version source	Git commit, branch, tag	Uncontrolled text until committed	Repository and review process	Active prompt cannot be traced
Evaluation dataset	Regression cases	JSONL or tabular test cases	Not linked until committed	Prompt evaluation workflow	Prompt passes anecdotal testing only
Pull request	Review gate	Approvals, checks, comments	No gate on direct edit	Branch protection and CI	Regression enters production
Release tag	Rollback pointer	Semantic tag or commit SHA	No rollback marker	Deployment process	Previous stable prompt cannot be restored

Step-by-Step Execution Path

Execute the operational step.

git checkout -b prompt/claims-routing-v2

Command type: Git source-control verification.

Reason: Use a branch so prompt experimentation is isolated from the production prompt path.

Checkpoint: Branch name appears in git status.

Execute the operational step.

git add prompts/claims-routing.prompt.yml evaluations/claims-routing.dataset.jsonl

Command type: Git source-control verification.

Reason: Commit prompt and evaluation data together because a prompt version without its test evidence is not release-ready.

Checkpoint: Git status shows both files staged.

Execute the operational step.

gh workflow run evaluate-prompts.yml --ref prompt/claims-routing-v2

Command type: GitHub CLI workflow verification.

Reason: Run evaluation before merge so quality and safety regressions block the prompt release.

Checkpoint: Workflow result contains metric output for the branch.

Execute the operational step.

git tag prompt-claims-routing-v2-approved

Command type: Git source-control verification.

Reason: Tag the approved prompt commit so rollback can target a known stable version.

Checkpoint: Git log and tag list show the approved release point.

Technical Chain

A user, workflow, or deployment command targets Prompt file and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Prompt variant, Evaluation dataset, Git branch. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: A prompt regression cannot be diagnosed when the active prompt text, model deployment, dataset, and metric output are not tied to a commit. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Check active branch	`git status --short --branch`	Prompt work is isolated on the expected feature branch. Command type: Git source-control verification.
Verify prompt and evaluation files	`git diff --name-only --cached`	Prompt and evaluation dataset changes are staged together. Command type: Git source-control verification.
Inspect evaluation workflow result	`gh run view <run-id> --log`	Evaluation workflow logs include metric output for the prompt change. Command type: GitHub CLI workflow verification.
Confirm release tag	`git tag --list prompt-claims-routing-v2-approved`	The approved prompt version has a rollback marker. Command type: Git source-control verification.

Shopping cart

Subtotal:

AI-300 Designing and implementing GenAIOps infrastructure

Detailed list of AI-300 knowledge points