Microscopic technical focus: Tuning chunk sizes, similarity thresholds, retrieval strategies, embedding models, hybrid search, relevance metrics, and A/B tests.
Beginner explanation: RAG quality usually fails before the model answers: the wrong chunks are retrieved, ranked, filtered, or packed into the prompt.
Operational split for this point: start with Chunking policy, then verify Vector index and Embedding model before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.
For this knowledge point, the target objects are Chunking policy, Vector index, Embedding model, Similarity threshold, Hybrid search profile, A/B test variant. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.
Why-layer: Chunking policy becomes exam-relevant only when the surrounding dependency chain can run. In this topic, Answers become ungrounded when thresholds exclude relevant chunks, chunking breaks context, or embedding vectors are generated with a different model than the index. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.
Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.
Common mistakes: Choosing fine-tuning when retrieved context is missing or poorly ranked. Changing the embedding model without rebuilding the vector index. Raising top-k without measuring latency and context noise.
Practice question: A RAG app gives ungrounded answers or misses exact product and policy terms, and the team must decide whether to tune retrieval or fine-tune the model.
A. Evaluate retrieval quality, tune chunking/thresholds/embedding or hybrid search, and compare variants before considering fine-tuning. B. Fine-tune immediately because all ungrounded answers indicate model knowledge gaps. C. Increase top-k without measuring context noise, latency, or groundedness. D. Change the embedding model while keeping the old vector index.
Correct Answer: A
Explanation: A is correct because RAG failures often occur before generation. B solves the wrong layer, C may add irrelevant context, and D creates incompatible vector comparisons.
The common decision point is: Fine-tuning is the wrong first fix when the root cause is missing retrieval coverage or poorly ranked source chunks. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Chunking policy | Window and overlap | Token/character size plus overlap | Default splitter | Document structure and tokenizer | Relevant evidence is split or diluted |
| Embedding model | Vector space | Model-specific dimension | Existing index vector space | Re-embedding pipeline | Query vectors cannot compare correctly |
| Similarity threshold | Candidate filter | Engine-specific score range | Untuned default | Evaluation dataset | Relevant chunks excluded or noisy chunks included |
| Hybrid search | Ranking blend | Vector, keyword, semantic reranker | Single retrieval mode | Search index fields | Exact identifiers or paraphrases are missed |
python rag_eval.py --dataset evaluations/rag-gold.jsonl --index policies-v1 --metrics recall_at_k mrr groundedness
Command type: local lab rehearsal script, not an official Azure command.
Reason: Establish a baseline so changes can be judged by retrieval and answer quality instead of anecdotal samples.
Checkpoint: Report contains recall@k, MRR, and groundedness for the current index.
python build_index.py --source data/policies --chunk-size 800 --chunk-overlap 120 --embedding-model text-embedding-3-large --index policies-v2
Command type: local lab rehearsal script, not an official Azure command.
Reason: Rebuild the index after chunk or embedding changes because stored vectors and chunk boundaries define retrieval behavior.
Checkpoint: Index metadata shows expected document count and vector dimension.
python rag_eval.py --dataset evaluations/rag-gold.jsonl --index policies-v2 --retrieval hybrid --semantic-rerank true
Command type: local lab rehearsal script, not an official Azure command.
Reason: Test hybrid retrieval when exact IDs and semantic paraphrases both matter.
Checkpoint: Metrics improve without unacceptable latency growth.
python route_variant.py --variant-a policies-v1 --variant-b policies-v2 --split 50 --metric groundedness
Command type: local lab rehearsal script, not an official Azure command.
Reason: Run a controlled variant test so production decision uses measured quality, latency, and token impact.
Checkpoint: Telemetry is segmented by variant ID and shows winning configuration.
A user, workflow, or deployment command targets Chunking policy and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Vector index, Embedding model, Similarity threshold. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: Answers become ungrounded when thresholds exclude relevant chunks, chunking breaks context, or embedding vectors are generated with a different model than the index. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Measure retrieval baseline | python rag_eval.py --dataset evaluations/rag-gold.jsonl --index policies-v1 --metrics recall_at_k mrr groundedness |
Baseline report contains retrieval and groundedness metrics. Command type: local lab rehearsal script, not an official Azure command. |
| Inspect rebuilt index metadata | python inspect_index.py --index policies-v2 --show-metadata |
Chunk count, embedding model, and vector dimension match the intended configuration. Command type: local lab rehearsal script, not an official Azure command. |
| Compare hybrid retrieval | python rag_eval.py --dataset evaluations/rag-gold.jsonl --index policies-v2 --retrieval hybrid --semantic-rerank true |
Metrics improve without unacceptable latency growth. Command type: local lab rehearsal script, not an official Azure command. |
| Check variant telemetry | python compare_variants.py --metric groundedness --variants policies-v1 policies-v2 |
Telemetry is segmented by variant and identifies the better configuration. Command type: local lab rehearsal script, not an official Azure command. |
Microscopic technical focus: Preparing supervised data, synthetic data, validation sets, fine-tuned model versions, monitoring, and production promotion.
Beginner explanation: Fine-tuning changes model behavior. It is appropriate after prompting and retrieval limits are proven, not as a shortcut for missing knowledge.
Operational split for this point: start with Fine-tuning dataset, then verify Synthetic data and Validation split before trusting any production outcome. The exam is testing whether the candidate can locate the missing dependency, not whether the candidate recognizes every service name in the scenario.
For this knowledge point, the target objects are Fine-tuning dataset, Synthetic data, Validation split, Fine-tuned model, Deployment version, Performance monitor. The exam usually describes one broken link in that chain. The correct answer is the option that restores the missing operational dependency rather than the option that only describes the platform at a high level.
Why-layer: Fine-tuning dataset becomes exam-relevant only when the surrounding dependency chain can run. In this topic, A customized model regresses in production when synthetic examples contain label noise, duplicate patterns, unsafe outputs, or no validation baseline. The correct configuration matters because it changes the state that controls execution, authorization, resolution, evaluation, or observability; a nearby but unrelated action leaves the same failure mode in place.
Decision tree: if the scenario describes access failure, inspect identity and RBAC before changing compute or code; if it describes unresolved assets, inspect name, version, and scope; if it describes runtime failure, inspect logs, endpoint invocation, metrics, or evaluation output; if it describes quality degradation, inspect data, retrieval, evaluation, and monitoring evidence before changing the model.
Common mistakes: Selecting a familiar Azure service without checking the missing dependency in the scenario. Treating a successful create operation as proof of runtime behavior. Choosing a monitoring action when the scenario asks for configuration or access remediation.
Practice question: Prompting and retrieval are stable, but the model still fails a repeated format, domain language, or task behavior requirement.
A. Validate curated training and holdout data, inspect synthetic examples, evaluate the customized model, and monitor the promoted deployment. B. Generate a large synthetic dataset and fine-tune without manual quality review. C. Use fine-tuning to add missing facts that should have been retrieved from source documents. D. Skip validation because training loss is enough to prove production readiness.
Correct Answer: A
Explanation: A is correct because customization requires clean examples, holdout evaluation, and production monitoring. B risks noisy behavior, C misuses fine-tuning for knowledge injection, and D ignores overfitting and safety regression.
The common decision point is: Fine-tuning should target behavior, structure, or domain patterns after retrieval and prompting limits are proven with evaluation evidence. Therefore, read every scenario for the actor, the resource scope, the object version, the network path, the metric threshold, and the expected observable result.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Fine-tuning dataset | Example quality | Prompt/completion or chat format | Unusable until schema-valid | Model fine-tuning format | Training fails or learns noisy behavior |
| Synthetic data | Augmentation source | Generated examples with review status | Untrusted until inspected | Quality and safety review | Model learns unrealistic patterns |
| Validation split | Holdout evidence | Representative labeled examples | Absent until separated | Evaluation workflow | Overfit model appears successful |
| Fine-tuned deployment | Promotion state | Candidate, staging, production | Not serving until deployed | Endpoint and monitor | Regression reaches users without rollback path |
python validate_finetune_data.py --train data/train.jsonl --validation data/valid.jsonl
Command type: local lab rehearsal script, not an official Azure command.
Reason: Validate schema and split before training because malformed or duplicated examples create unreliable customization.
Checkpoint: Report shows schema pass, duplicate rate, and label distribution.
python inspect_synthetic_data.py --input data/synthetic.jsonl --report reports/synthetic-quality.json
Command type: local lab rehearsal script, not an official Azure command.
Reason: Inspect synthetic data because generated examples can amplify wrong labels or unsafe patterns.
Checkpoint: Quality report flags duplicates, leakage, and safety concerns.
python evaluate.py --dataset evaluations/finetune-valid.jsonl --model <fine-tuned-model>
Command type: local lab rehearsal script, not an official Azure command.
Reason: Evaluate the customized model on holdout cases so improvement is proven beyond the training examples.
Checkpoint: Evaluation report shows target behavior improves without safety regression.
az monitor metrics list --resource <deployment-resource-id> --metric Latency,TotalCalls,Failures
Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type.
Reason: Monitor the fine-tuned deployment because customization can change latency, failure rate, and output shape.
Checkpoint: Metrics show stable production behavior after promotion.
A user, workflow, or deployment command targets Fine-tuning dataset and submits configuration to Azure control plane or a project runtime. Azure validates identity, resource scope, quota, version references, and network reachability because the runtime cannot safely use an object that is not authorized, versioned, reachable, or measurable. The configured object then participates in the runtime path through Synthetic data, Validation split, Fine-tuned model. This sequence works because each object unlocks the next dependency: identity allows access, versioning allows reproducibility, network resolution allows execution, and telemetry allows verification. When the workload executes, telemetry, status output, logs, API response, or evaluation metrics prove whether the chain is complete. If the chain breaks, the failure appears as the operational symptom described in the scenario: A customized model regresses in production when synthetic examples contain label noise, duplicate patterns, unsafe outputs, or no validation baseline. An incorrect configuration creates the observed failure because it changes a nearby object while leaving the actual missing dependency unresolved.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate fine-tuning data | python validate_finetune_data.py --train data/train.jsonl --validation data/valid.jsonl |
Report shows schema pass, duplicate rate, and label distribution. Command type: local lab rehearsal script, not an official Azure command. |
| Inspect synthetic examples | python inspect_synthetic_data.py --input data/synthetic.jsonl --report reports/synthetic-quality.json |
Quality report flags duplicates, leakage, and safety concerns. Command type: local lab rehearsal script, not an official Azure command. |
| Evaluate customized model | python evaluate.py --dataset evaluations/finetune-valid.jsonl --model <fine-tuned-model> |
Holdout results prove improvement without safety regression. Command type: local lab rehearsal script, not an official Azure command. |
| Review deployment metrics | az monitor metrics list --resource <deployment-resource-id> --metric Latency,TotalCalls,Failures |
Production telemetry remains stable after promotion. Command type: Azure Monitor CLI verification; confirm metric names for the selected Azure resource type. |
What should be tuned first when a RAG application gives ungrounded answers even though the source documents contain the correct information?
Evaluate and tune retrieval settings such as chunking, similarity thresholds, top-k, embedding model, and hybrid search before fine-tuning.
Ungrounded answers often happen because the generator never receives the right context. Chunk boundaries may split important evidence, thresholds may exclude relevant passages, or retrieval may miss exact terms that keyword search would find. Fine-tuning is usually the wrong first fix when the root cause is missing or poorly ranked context. AI-300 expects candidates to diagnose the retrieval layer before changing the model behavior.
Demand Score: 95
Exam Relevance Score: 99
Why must a vector index usually be rebuilt after changing the embedding model?
Because stored vectors and query vectors must come from the same embedding space to be compared correctly.
Embedding models create vectors with model-specific dimensions and semantic spaces. If a query uses a new embedding model against an index built with an old model, similarity scores may be invalid or fail entirely. Rebuilding the index ensures that documents and queries use compatible vectors. This is a high-value exam detail because changing only the query model looks plausible but leaves the retrieval system inconsistent.
Demand Score: 87
Exam Relevance Score: 95
When should hybrid search be considered for a RAG solution?
Use hybrid search when the application must handle both semantic meaning and exact keywords, identifiers, product names, or policy terms.
Vector search is strong for semantic similarity, but it can miss exact strings or rare identifiers. Keyword search can capture exact matches but may miss paraphrases. Hybrid search combines both approaches and can improve retrieval coverage for enterprise documents, policies, product catalogs, and technical references. AI-300 may ask this as an optimization decision when relevant context is available but retrieval misses exact domain terms.
Demand Score: 86
Exam Relevance Score: 94
How should A/B testing be used to improve a RAG system?
Compare retrieval or prompt variants with controlled traffic or evaluation datasets and measure groundedness, relevance, latency, and cost.
RAG optimization should be measured rather than assumed. A/B testing can compare chunk sizes, retrieval strategies, rerankers, prompts, or embedding models while tracking quality and operational trade-offs. The winning variant should improve the target metrics without creating unacceptable latency or token cost. AI-300 emphasizes this because production GenAI optimization requires evidence, not isolated sample inspection.
Demand Score: 84
Exam Relevance Score: 93
When is fine-tuning a better choice than prompt or retrieval tuning?
Fine-tuning is appropriate when evaluation proves the model needs consistent behavior, format, style, or domain task patterns that prompting and retrieval cannot reliably produce.
Fine-tuning should not be used to store facts that should be retrieved from source data. It is better suited for learned behavior patterns, structured output style, domain language, or repeated task execution after simpler controls have been evaluated. AI-300 scenarios often contrast retrieval fixes with fine-tuning. The correct choice depends on evidence about where the failure occurs.
Demand Score: 89
Exam Relevance Score: 96
What controls are needed before promoting a fine-tuned model to production?
Validate training and holdout data, review synthetic data quality, evaluate the candidate model, monitor deployment metrics, and keep a rollback path.
Fine-tuned models can regress if training data is noisy, synthetic examples are unrealistic, validation data is weak, or production monitoring is absent. The team should prove improvement on holdout cases and monitor quality, safety, latency, failures, and cost after promotion. AI-300 treats fine-tuning as a lifecycle operation, so the correct answer includes data quality, evaluation, deployment control, and ongoing monitoring.
Demand Score: 92
Exam Relevance Score: 98