Designing and implementing MLOps infrastructure

Designing and implementing MLOps infrastructure Detailed Explanation

Create and manage resources in an Azure Machine Learning workspace

Exam Radar

Official Blueprint Mapping: Create and manage a workspace; Create and manage datastores; Create and manage compute targets; Configure identity and access management for workspaces.
Domain Weight: Design and implement an MLOps infrastructure represents 15-20% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement an MLOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Machine Learning workspace, Workspace managed identity, Datastore, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: A contributor can create workspace objects but still fail data reads when the job identity lacks storage data-plane access.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: Training jobs queue or fail when compute quota, datastore identity, private DNS, or storage firewall dependencies are missing.
Operational Dependency: The task depends on Machine Learning workspace, Workspace managed identity, Datastore, Compute cluster, Storage account, Private endpoint and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A company creates an Azure Machine Learning workspace, but training jobs cannot mount data from a locked-down storage account. The question asks which workspace dependency or identity permission must be configured before jobs can run.
How Distractors Are Designed: Distractors point to workspace Contributor permissions, increasing compute size, enabling Application Insights, or changing the experiment name. These are plausible Azure ML actions, but none grants the job identity data-plane access through the storage boundary.
Why the Correct Answer Works: The correct answer binds the workspace or compute managed identity to the storage account with the required Storage Blob Data role and validates datastore access, because the runtime needs data-plane authorization after the workspace exists.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Provisioning workspace dependencies, datastore bindings, compute targets, and IAM boundaries.

Beginner explanation: Think of the workspace as the operations control room. It records jobs and assets, but the actual data, secrets, images, and compute capacity live in dependent Azure resources that must be permissioned separately.

Operational split for this point: start with Machine Learning workspace, then verify Workspace managed identity and Datastore before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Machine Learning workspace, Workspace managed identity, Datastore, Compute cluster, Storage account, Private endpoint. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Machine Learning workspace becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Training jobs queue or fail when compute quota, datastore identity, private DNS, or storage firewall dependencies are missing. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.

Common mistakes: Confusing workspace Contributor with Storage Blob Data Contributor. Creating a datastore before verifying that the job identity can read the storage container. Forgetting private DNS when public access is disabled.

Practice question: A company creates an Azure Machine Learning workspace, but training jobs cannot mount data from a locked-down storage account. The question asks which workspace dependency or identity permission must be configured before jobs can run.

A. Assign Storage Blob Data Contributor to the workspace or compute managed identity on the storage scope, then validate the datastore with a smoke-test job.
B. Assign Contributor on the Azure Machine Learning workspace to the data scientist group.
C. Increase the compute cluster maximum node count to provide more training capacity.
D. Enable Application Insights on the workspace and review request telemetry.

Correct Answer: A

Explanation: A is correct because the failure is data-plane storage authorization during job execution. B affects management-plane workspace operations, C affects capacity, and D improves observability but does not grant blob access.

The common decision point is: A contributor can create workspace objects but still fail data reads when the job identity lacks storage data-plane access. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Machine Learning workspace	Managed identity	System-assigned or user-assigned	Identity absent until enabled or assigned	Entra ID principal creation	Jobs cannot authenticate to protected datastores
Datastore	Credential mode	Account key, SAS, service principal, user identity, managed identity	Metadata only until credential or identity path works	Storage account, container, RBAC, firewall	Mount or download step fails with authorization error
Compute cluster	Node range	min=0 or greater, max within quota	Scaled down when idle	Regional VM quota and workspace	Job remains queued or provisioning fails
Private endpoint	DNS resolution	Approved private IP with linked private DNS zone	Public endpoint used unless restricted	VNet, subnet, DNS zone, firewall	Studio or job traffic cannot reach workspace or storage

Step-by-Step Execution Path

Execute the operational step.

az ml workspace show -g rg-ai300 -n mlw-ai300-dev --query identity.principalId -o tsv

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Read the workspace principal ID first because role assignment cannot target an identity that has not materialized in Entra ID.

Checkpoint: A non-empty principal ID is returned.

Execute the operational step.

az role assignment create --assignee <principal-id> --role "Storage Blob Data Contributor" --scope <storage-id>

Command type: Azure CLI RBAC verification for Entra identity and Azure AI resource scope.

Reason: Assign data-plane storage access because workspace Contributor does not authorize blob reads inside the training container.

Checkpoint: Role assignment list shows Storage Blob Data Contributor at the storage account or container scope.

Execute the operational step.

az ml datastore create --file datastore.yml -g rg-ai300 -w mlw-ai300-dev

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Register the datastore after RBAC is available so the datastore reference resolves to a storage path the job identity can actually use.

Checkpoint: Datastore show output contains the expected account, container, and identity-based access mode.

Execute the operational step.

az ml job create --file train-smoke-test.yml -g rg-ai300 -w mlw-ai300-dev

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Submit a small smoke-test job because datastore creation alone does not prove runtime mount behavior.

Checkpoint: The job reaches Completed and logs show successful data access.

Technical Chain

A user, workflow, or deployment command targets Machine Learning workspace and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Workspace managed identity, Datastore, Compute cluster. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Training jobs queue or fail when compute quota, datastore identity, private DNS, or storage firewall dependencies are missing. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Confirm workspace identity	`az ml workspace show -g rg-ai300 -n mlw-ai300-dev --query identity.principalId -o tsv`	A principal ID is returned. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.
Verify storage data-plane role	`az role assignment list --assignee <principal-id> --scope <storage-id> -o table`	Storage Blob Data Contributor appears at the expected scope. Command type: Azure CLI RBAC verification for Entra identity and Azure AI resource scope.
Inspect datastore binding	`az ml datastore show -n trainingdata -g rg-ai300 -w mlw-ai300-dev`	Account, container, and credential mode match the intended design. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.
Prove runtime data access	`az ml job show --name <smoke-test-job> -g rg-ai300 -w mlw-ai300-dev --query status`	Smoke-test job reaches Completed after reading the datastore. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Create and manage assets in an Azure Machine Learning workspace

Exam Radar

Official Blueprint Mapping: Create and manage data assets; Create and manage environments; Create and manage components; Share assets across workspaces by using registries.
Domain Weight: Design and implement an MLOps infrastructure represents 15-20% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement an MLOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Data asset, Environment, Component, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: A local workspace component cannot be reused in another workspace until it is published or referenced through a registry.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: Pipelines fail when an asset version is omitted, archived, or resolved from the wrong workspace scope.
Operational Dependency: The task depends on Data asset, Environment, Component, Registry, Model asset, Workspace asset reference and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A pipeline in a production workspace cannot reuse a component or environment created in a development workspace. The question asks how to make the asset version discoverable and reproducible across workspaces.
How Distractors Are Designed: Distractors suggest copying source files manually, renaming the component, editing the pipeline display name, or giving users workspace Reader. These do not create a versioned asset contract that another workspace can resolve.
Why the Correct Answer Works: The correct answer registers or publishes the asset with an explicit version, preferably through a registry for cross-workspace reuse, because pipeline resolution depends on asset name, version, and scope.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Versioning data assets, environments, components, and registry-shared artifacts.

Beginner explanation: An asset is a versioned contract. A file path, Conda file, or script folder becomes exam-relevant only when Azure ML can resolve it by name, version, and scope.

Operational split for this point: start with Data asset, then verify Environment and Component before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Data asset, Environment, Component, Registry, Model asset, Workspace asset reference. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Data asset becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Pipelines fail when an asset version is omitted, archived, or resolved from the wrong workspace scope. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Common mistakes: Selecting a familiar Azure service without checking the missing dependency in the scenario. Treating a successful create operation as proof of runtime behavior. Choosing a monitoring action when the scenario asks for configuration or access remediation.

Practice question: A pipeline in a production workspace cannot reuse a component or environment created in a development workspace. The question asks how to make the asset version discoverable and reproducible across workspaces.

A. Publish the component or environment as a versioned asset in an Azure ML registry and reference that registry version from the production pipeline.
B. Copy the component YAML file into the production repository and keep the same display name.
C. Give the production workspace Reader access to the development workspace.
D. Rename the pipeline job so it matches the development job name.

Correct Answer: A

Explanation: A is correct because cross-workspace reuse depends on asset name, version, and registry scope. B copies text but not a registered asset, C does not create a resolvable component contract, and D changes only metadata.

The common decision point is: A local workspace component cannot be reused in another workspace until it is published or referenced through a registry. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Data asset	Version	Immutable named version or latest label	Unversioned local path until created	Datastore or URI path	Pipeline input resolves to stale or missing data
Environment	Image and dependencies	Curated image, custom image, Conda specification	Draft YAML before registration	ACR/base image/package feed	Job fails during image build or import
Component	Interface contract	Inputs, outputs, command, environment	Local YAML until registered	Workspace or registry scope	Pipeline compilation cannot resolve component
Registry asset	Sharing scope	Registry name plus asset name/version	Unavailable outside workspace	Registry permissions and region support	Production workspace cannot consume approved asset

Step-by-Step Execution Path

Execute the operational step.

az ml environment create --file environment.yml -g rg-ai300 -w mlw-ai300-dev

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Register the environment first because component execution must reference a reproducible runtime image.

Checkpoint: Environment list shows the expected name and version.

Execute the operational step.

az ml component create --file component.yml -g rg-ai300 -w mlw-ai300-dev

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Create the component after its environment exists so the component contract can resolve its runtime dependency.

Checkpoint: Component show output contains inputs, outputs, command, and environment reference.

Execute the operational step.

az ml component create --file component.yml --registry-name ai300registry

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Publish to registry when another workspace must consume the same component without copying YAML by hand.

Checkpoint: Registry component list shows the approved version.

Execute the operational step.

az ml job create --file pipeline.yml -g rg-ai300-prod -w mlw-ai300-prod

Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Reason: Run the pipeline in the consuming workspace to prove registry-scoped asset resolution works.

Checkpoint: Pipeline job graph resolves the registry component version and starts execution.

Technical Chain

A user, workflow, or deployment command targets Data asset and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Environment, Component, Registry. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Pipelines fail when an asset version is omitted, archived, or resolved from the wrong workspace scope. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
List registered environment version	`az ml environment show -n <env-name> --version <version> -g rg-ai300 -w mlw-ai300-dev`	Environment resolves with expected image and dependencies. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.
Inspect component contract	`az ml component show -n <component-name> --version <version> -g rg-ai300 -w mlw-ai300-dev`	Inputs, outputs, command, and environment reference are present. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.
Verify registry asset	`az ml component show -n <component-name> --version <version> --registry-name ai300registry`	Registry returns the shared component version. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.
Confirm consuming pipeline state	`az ml job show --name <pipeline-job> -g rg-ai300-prod -w mlw-ai300-prod --query status`	Pipeline compiles and starts with registry asset references. Command type: Azure ML CLI verification; confirm extension/version in the lab environment.

Implement infrastructure as code for Azure Machine Learning

Exam Radar

Official Blueprint Mapping: Configure GitHub integration with Machine Learning to enable secure access; Deploy Machine Learning workspaces and resources by using Bicep and Azure CLI; Automate resource provisioning by using GitHub Actions workflows; Restrict network access to Machine Learning workspaces; Manage source control for machine learning projects by using Git.
Domain Weight: Design and implement an MLOps infrastructure represents 15-20% of the official skills measured.
Core Priority: The exam tests whether the candidate can operate this sub-skill as a concrete Azure workflow inside Design and implement an MLOps infrastructure, not merely identify the service name.
High Frequency: Expect scenario wording that combines Bicep module, GitHub Actions workflow, Federated credential, and verification evidence from commands, logs, metrics, or portal state.
Confusion Alert: Secret-based deployment can work but creates rotation and leakage risk; OIDC federation provides short-lived workflow authentication.
Scenario Logic: Choose the answer that creates a verifiable dependency chain: configure the object, bind identity or version, run the operation, then prove the resulting state.
Version Delta: AI-300 combines Azure Machine Learning and Microsoft Foundry under AIOps, so this point must be read as an operational platform task rather than a standalone concept.
Failure Trigger: Automation fails before resource deployment when the workflow lacks id-token permission or the federated credential subject does not match the branch.
Operational Dependency: The task depends on Bicep module, GitHub Actions workflow, Federated credential, Azure CLI deployment, Network rule, Git repository and a validation step that proves the configured state is actually usable.
How the Exam Asks It: A team needs repeatable Azure ML environments and wants GitHub Actions to deploy workspaces, storage, network rules, and compute without storing long-lived secrets.
How Distractors Are Designed: Distractors use manual portal creation, broad subscription Owner secrets, post-deployment screenshots, or local CLI commands from an engineer laptop. These approaches cannot prove repeatability or secure pipeline identity.
Why the Correct Answer Works: The correct answer uses Bicep or ARM plus GitHub Actions OIDC/federated credentials and validates deployment outputs, because the infrastructure state must be reproducible and attributable to CI.

Atomic Deconstruction - Operational Level

Microscopic technical focus: Deploying secure Machine Learning infrastructure with Bicep, Azure CLI, GitHub Actions, and source control.

Beginner explanation: IaC means the environment can be recreated by a pipeline. The exam cares less about the template language itself and more about whether identity, network, and resource dependencies are repeatable.

Operational split for this point: start with Bicep module, then verify GitHub Actions workflow and Federated credential before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.

For this knowledge point, the target objects are Bicep module, GitHub Actions workflow, Federated credential, Azure CLI deployment, Network rule, Git repository. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.

Why-layer: Bicep module becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Automation fails before resource deployment when the workflow lacks id-token permission or the federated credential subject does not match the branch. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.

Practice question: A team needs repeatable Azure ML environments and wants GitHub Actions to deploy workspaces, storage, network rules, and compute without storing long-lived secrets.

A. Use Bicep or ARM deployment from GitHub Actions with OIDC federation and validate the resource-group deployment output.
B. Create the workspace manually in the portal and export screenshots for audit evidence.
C. Store an Owner client secret in GitHub repository secrets and reuse it for every environment.
D. Run Azure CLI commands locally from an engineer workstation after each merge.

Correct Answer: A

Explanation: A is correct because it provides repeatable infrastructure and short-lived CI identity. B and D are not repeatable pipeline controls, while C works technically but creates long-lived credential risk.

The common decision point is: Secret-based deployment can work but creates rotation and leakage risk; OIDC federation provides short-lived workflow authentication. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Bicep module	Resource graph	Workspace, storage, key vault, ACR, networking	Template only until deployed	Resource group and provider registration	Deployment fails or creates incomplete dependency graph
Federated credential	Subject claim	Repository, branch, environment, or workflow subject	No trust relationship	Entra app and GitHub OIDC token	azure/login fails before deployment
GitHub workflow	Permissions	id-token: write and contents: read	Token unavailable by default	OIDC federation and Azure login	Workflow cannot request Azure token
Deployment output	State evidence	Succeeded, failed, outputs JSON	No evidence until queried	Azure Resource Manager deployment record	Pipeline cannot prove created resources

Step-by-Step Execution Path

Execute the operational step.

az deployment group validate -g rg-ai300 -f infra/main.bicep

Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.

Reason: Validate before deployment because syntax and parameter errors should fail before any resource mutation occurs.

Checkpoint: Validation returns no template errors.

Execute the operational step.

az deployment group create -g rg-ai300 -f infra/main.bicep

Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.

Reason: Deploy the full dependency graph together so workspace resources and dependent services remain consistent.

Checkpoint: Deployment provisioningState is Succeeded.

Execute the operational step.

az ad app federated-credential create --id <app-id> --parameters credential.json

Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.

Reason: Create OIDC trust so GitHub can obtain short-lived Azure tokens without storing a client secret.

Checkpoint: The federated credential subject matches the repository branch or environment.

Execute the operational step.

gh run view <run-id> --log

Command type: GitHub CLI workflow verification.

Reason: Inspect workflow logs because CI evidence must show the template was deployed by the pipeline identity.

Checkpoint: Logs show azure/login and deployment steps completed successfully.

Technical Chain

A user, workflow, or deployment command targets Bicep module and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through GitHub Actions workflow, Federated credential, Azure CLI deployment. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Automation fails before resource deployment when the workflow lacks id-token permission or the federated credential subject does not match the branch. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate template syntax	`az deployment group validate -g rg-ai300 -f infra/main.bicep`	Validation completes without template errors. Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.
Inspect deployment state	`az deployment group show -g rg-ai300 -n main --query properties.provisioningState`	Provisioning state is Succeeded. Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.
Verify federated credential	`az ad app federated-credential list --id <app-id> -o table`	Credential subject matches GitHub repository branch or environment. Command type: Azure CLI verification; confirm parameters against the active Azure CLI version.
Inspect workflow evidence	`gh run view <run-id> --log`	Log shows azure/login and deployment steps succeeded. Command type: GitHub CLI workflow verification.

Shopping cart

Subtotal:

AI-300 Designing and implementing MLOps infrastructure

Detailed list of AI-300 knowledge points