In AI-102, this knowledge area focuses on building systems that can:
Ingest data from many sources (both structured and unstructured)
Extract information from that data using AI
Enrich the data with additional insights (entities, key phrases, vectors, classifications)
Make the data searchable and explorable for applications and users
The core service used for this is Azure AI Search, often combined with AI enrichment and document extraction services provided by Microsoft.
From an exam perspective, this domain is about engineering a search and extraction pipeline, not about training ML models.
You are expected to understand how to:
Provision search resources
Design indexes
Configure indexers and skillsets
Run and monitor ingestion
Query data effectively
Support semantic and vector-based retrieval
Project enriched data for downstream use
Azure AI Search is the backbone of knowledge mining in AI-102. It is not just “search”; it is a data ingestion + enrichment + retrieval platform.
Each component has a clear responsibility. Understanding how they fit together is critical.
A search resource is the top-level Azure service instance.
It:
Hosts all your indexes
Manages indexers and skillsets
Exposes query endpoints for applications
Beginner analogy:
Think of the search resource as the search server that everything lives inside.
Important exam note:
You must provision this before anything else.
Capacity, region, and pricing tier matter for performance and features.
An index defines how your data is stored and queried.
It is similar to a database schema, but optimized for search.
An index defines:
Each field has:
A data type (string, number, date, boolean, complex type)
Capabilities:
searchable – full-text search
filterable – exact matching in filters
sortable – ordering results
facetable – aggregations for UI filters
Beginner example:
title: searchable, sortable
category: filterable, facetable
content: searchable
publishDate: filterable, sortable
Analyzers control:
Tokenization
Language rules
Stemming and normalization
Why this matters:
The same text can behave very differently depending on the analyzer.
Choosing the correct language analyzer improves relevance.
These define:
Which fields matter more (for example: title > body)
How semantic ranking behaves (when enabled)
When using vector search:
You define fields that store embeddings
These fields are used for similarity search
Key beginner takeaway:
The index is the contract between ingestion and querying.
A bad index design leads to poor search results.
A data source tells Azure AI Search where your content lives.
Common sources include:
Azure Blob Storage (documents, PDFs, images)
Azure Data Lake Storage
Azure SQL Database
Azure Cosmos DB
Other supported connectors
A data source defines:
Connection details
Authentication method
Container/table/query to read from
Beginner note:
A data source does not move data by itself.
It is used by an indexer.
An indexer is the ingestion engine.
It:
Reads data from a data source
Optionally applies AI enrichment via a skillset
Writes processed documents into an index
Key properties:
Can run on demand
Can run on a schedule
Supports incremental updates
Beginner analogy:
Exam-relevant idea:
A skillset defines how AI is applied during indexing.
It is a pipeline of “skills” that:
Read raw content
Extract information
Enrich documents with new fields
Common skill categories include:
OCR and text extraction
Language detection
Key phrase extraction
Entity recognition
Classification
Embedding generation for vector search
Custom skills (your own code)
Important:
Skillsets run only during indexing
They do not run at query time
Exam emphasis:
This section is about architecture thinking.
Before configuring anything, you must understand:
Examples:
PDFs
Word documents
Excel files
HTML pages
Images and scanned documents
Emails
Different content types require different extraction approaches.
You may need:
Keyword search (exact terms)
Faceted navigation (filter by category, date, author)
Semantic ranking (meaning-based)
Q&A experiences
Vector similarity search
Hybrid search (keyword + vector)
These are often tested indirectly in case studies:
Security trimming – users see only what they are allowed to see
Latency – response time requirements
Update frequency – near real-time vs batch indexing
Beginner rule:
Always design from requirements → index → ingestion, not the other way around.
Index design is one of the most important skills in AI-102.
Key decisions include:
Used for full-text queries
Usually large text fields (content, description)
Used in filter expressions
Usually IDs, categories, flags, dates
Used for UI filters (“show counts by category”)
Must be filterable as well
Used for ordering results
Dates, numbers, titles
Best practice:
Keep original content
Store extracted/enriched fields separately
This makes debugging and tuning much easier.
This step connects everything together.
Managed identity is preferred where supported
Avoid embedding secrets
Examples:
How to extract text from PDFs
Whether to include metadata
File inclusion/exclusion filters
Critical for production:
Detects changes
Avoids reprocessing everything
Run on a schedule appropriate for freshness needs
Monitor failures and warnings
Exam insight:
Skillsets are where “AI” enters the pipeline.
These skills:
Extract text from documents
Normalize encoding
Clean whitespace
Split content into chunks or pages
Why this matters:
These skills identify:
People
Organizations
Locations
Key phrases
Language
Sentiment (when applicable)
Beginner example:
These skills help understand document structure:
Headings and sections
Tables (often via document intelligence)
Document classification labels
This is important for:
Structured search
Downstream analytics
RAG scenarios
Built-in skills are not always enough.
Custom skills are used for:
Domain-specific extraction (legal clauses, product SKUs)
Custom PII detection
Proprietary classification rules
Host your logic (commonly as Azure Functions)
Accept JSON input
Return JSON output
Map outputs into index fields
Exam expectation:
A Knowledge Store lets you persist enriched outputs into:
Files (for example, JSON in Blob Storage)
Objects
Tables
This allows:
BI tools
Analytics workloads
ML pipelines
to reuse extracted data without re-running indexing.
Typical scenarios:
Reporting on extracted entities
Training downstream ML models
Auditing extraction quality
Debugging skillsets
Beginner note:
You should understand:
Search syntax
Sorting
Filtering
Wildcards
These are fundamental and frequently tested.
Important patterns include:
Combining text search with filters
Faceted navigation
Pagination
Result shaping (select specific fields)
Highlighting snippets
Synonyms for domain terms
Scoring profiles to boost important fields
Exam insight:
Semantic search:
Improves ranking using meaning
Works on top of traditional text fields
Requires semantic configuration
It improves relevance but does not replace index design.
Vector search:
Uses embeddings
Finds semantically similar content
Requires vector fields in the index
Key design decisions:
Embedding model
Chunk size and overlap
Vector field configuration
Hybrid retrieval strategy
Exam emphasis:
Typical case:
Thousands of documents
Need searchable portal with facets and extraction
Solution pattern:
Indexer + skillset (OCR + extraction)
Store results in index and Knowledge Store
Build UI on top
Typical case:
Business documents (contracts, invoices)
Need structured fields
Solution pattern:
Specialized extraction service
Normalize via skillsets/custom skills
Index for semantic/vector retrieval
You must ensure:
Search results respect permissions
ACLs are stored and filtered
Secrets are not indexed
Common issues:
Indexer failures
Skill timeouts
Schema mismatches
Stale data
Poor relevance
Beginner takeaway:
A production-ready knowledge mining solution is designed, monitored, and tuned, not just built once.
This section explains how indexers behave in real systems, beyond the basic “connect data source and run” model. These details are frequently tested in AI-102 case studies.
Change detection determines how Azure AI Search knows what has changed in your data source.
In Azure AI Search, indexers do not automatically understand updates unless you configure a strategy.
An indexer periodically checks the data source and decides:
Which documents are new
Which documents have changed
Which documents should be removed
Without change detection, the only option is a full re-index, which is costly and slow.
A high watermark field is a column such as:
LastModifiedTime
UpdatedAt
VersionNumber
The indexer stores the last processed value and only retrieves records with a higher value.
This approach:
Is simple and widely used
Requires a reliable timestamp or version field
Works well for append-and-update scenarios
Some data sources support native change tracking mechanisms.
When available:
The indexer asks the source system directly for changes
Performance and accuracy improve
Configuration complexity may increase
Users expect search results to reflect recent updates. Poor change detection leads to stale results.
Reprocessing unchanged data:
Consumes compute
Increases indexing time
Increases operational cost
Full re-indexing:
Is disruptive
Can temporarily degrade search quality
Should be avoided in production systems
Many systems do not physically delete records. Instead, they mark them as deleted.
Logical deletion usually means:
A record still exists
A flag or status indicates it should not be visible
Examples:
isDeleted = true
status = inactive
You must explicitly configure the indexer to:
Detect the deletion indicator
Remove or hide the document in the index
Without this configuration:
Deleted content remains searchable
Users may see unauthorized or outdated information
Soft delete detection is essential for:
Security compliance
Accurate search results
User trust
Field mappings control how data flows from source to index.
Field mappings allow you to:
Rename fields
Combine fields
Normalize data formats
This is useful when:
Source schema does not match index schema
Naming conventions differ
Multiple sources feed the same index
During ingestion, you may:
Convert data types
Flatten structures
Apply simple transformations
This reduces complexity later in querying.
Output field mappings connect:
Skill outputs
Index fields
They define where enriched data (entities, chunks, embeddings) is stored.
Without correct output mappings:
Skills may run successfully
But results never appear in the index
Indexers can fail partially or completely.
Partial success means some documents are indexed while others fail
Total failure means the indexer stops entirely
In production, partial success is common and expected.
Unreadable or malformed files can cause individual failures.
Complex enrichment steps may exceed execution limits.
Expired credentials or permission changes can block access to data sources or skills.
You should regularly review:
Execution logs
Warning and error counts
Failed document samples
Monitoring helps detect issues early.
Well-designed pipelines:
Tolerate individual document failures
Continue processing valid content
Avoid blocking the entire index because of a few bad documents
Skillsets are pipelines, not collections of independent steps.
Skills execute in the order they are defined.
Later skills:
Depend on outputs from earlier skills
Cannot access data that was not produced upstream
Incorrect ordering can cause:
Missing inputs
Empty outputs
Poor extraction quality
Correct ordering improves:
Accuracy
Performance
Cost efficiency
Skillsets often form chains.
OCR → text extraction
Text extraction → chunking
Chunking → entity extraction or embeddings
Each step builds on the previous one.
Referencing fields that do not exist yet
Skipping required preprocessing steps
Incorrect field paths
These mistakes often cause silent failures.
Chunking splits large text into manageable units.
Placing chunking early ensures:
Later skills operate on smaller, focused text
Better extraction accuracy
Lower token usage
Entities are easier to detect in focused text segments.
Embeddings represent meaning more accurately when text is concise.
Smaller chunks reduce processing cost and latency.
Embedding generation is usually placed near the end.
Embeddings should reflect:
Cleaned text
Final chunk boundaries
Meaningful semantic units
Generating embeddings too early leads to poor retrieval quality.
This is a classic AI-102 trade-off question.
Simple schema
Fewer indexed records
Lower storage cost
Poor relevance for long documents
Weak vector similarity performance
Difficult grounding and citation
This approach struggles with modern semantic and RAG use cases.
Higher semantic relevance
Strong vector search results
Precise citations for RAG and agents
Larger index size
Higher ingestion and storage cost
More complex schema design
Long documents such as manuals, PDFs, and policies
RAG and agentic retrieval scenarios
Applications requiring precise citations
Short, well-structured records
Metadata-focused search experiences
Low semantic complexity
Search systems are eventually consistent by nature.
Lower cost
Simpler operations
Suitable for static or slowly changing data
Higher operational complexity
Required for frequently updated content
Demands careful monitoring
You should expect:
Delays between source updates and visibility
Temporary inconsistencies during re-indexing
These delays affect:
User trust
Application behavior
Business workflows
Common risks include:
Orphaned documents that should be deleted
Outdated metadata
Security trimming mismatches after permission changes
These risks must be actively managed.
Indexers are not suitable for:
Very high write-frequency data
Scenarios requiring strict transactional consistency
Push indexing
Hybrid architectures combining push and pull models
Relevance tuning is continuous work.
You may adjust:
Field weights
Analyzers
Synonyms
Scoring profiles
Real user queries should guide these changes.
Warning signs include:
Low click-through rates
Incorrect top-ranked results
Over-matching on irrelevant fields
Relevance tuning:
Is not a one-time task
Evolves with content and users
Depends heavily on early schema decisions
A well-designed search solution improves over time through measurement and tuning, not static configuration.
When creating an Azure AI Search skillset using AzureOpenAIEmbeddingSkill, the service returns the error: “uri parameter cannot be null or empty.” What configuration issue causes this error?
The resource URI for the Azure OpenAI service was not provided or is incorrectly configured.
The AzureOpenAIEmbeddingSkill requires the resourceUri parameter that points to the Azure OpenAI endpoint hosting the embedding model. If this value is missing, empty, or incorrectly formatted, the skillset validation fails during creation. The skill relies on this URI to send text content to the embedding model for vector generation. Developers sometimes mistakenly provide the search endpoint instead of the Azure OpenAI endpoint or omit the parameter entirely when configuring the skill programmatically. Ensuring the correct Azure OpenAI resource endpoint is supplied resolves the issue and allows the embedding pipeline to generate vectors successfully.
Demand Score: 82
Exam Relevance Score: 83
A vector indexing pipeline in Azure AI Search fails with the error “EDM.Double cannot be mapped to EDM.Single.” What is the underlying cause?
The embedding vector type returned by the skill does not match the index schema vector type.
Azure AI Search vector fields require a specific numeric type, typically Collection(Edm.Single) representing float values. However, some embedding pipelines return vectors interpreted as Edm.Double. Because Azure Search enforces strict schema compatibility, this mismatch causes indexing failures. The index schema must match the numeric precision expected by the embedding output. Developers often encounter this when custom skills or serialization layers produce double-precision numbers instead of single-precision floats. Aligning the index schema or converting values to the correct type ensures compatibility and allows successful vector ingestion.
Demand Score: 74
Exam Relevance Score: 84
Why might an Azure AI Search indexer show warnings such as “Could not generate projection from input /document/pages/*”?
The skillset projection references an incorrect source path in the enrichment tree.
Skillsets operate on an enrichment tree that defines intermediate data structures produced during processing. When projections reference paths such as /document/pages/*, the path must correspond to output fields produced by earlier skills. If the path is incorrect or the referenced data was never created, Azure Search cannot generate the projection and produces warnings during indexing. These errors commonly occur when developers modify skillset outputs without updating downstream projections. Ensuring consistent field names and verifying enrichment paths across skills resolves projection failures.
Demand Score: 76
Exam Relevance Score: 81
During AI enrichment, when should a ConditionalSkill be used in Azure AI Search?
A ConditionalSkill should be used when processing should occur only if specific metadata or conditions are met.
Azure AI Search enrichment pipelines often process heterogeneous content such as multilingual documents or files with optional metadata. ConditionalSkill allows the pipeline to evaluate an expression and route documents to different processing paths. For example, OCR can be triggered only when a document language is unknown or when images are present. Without conditional logic, enrichment pipelines may run unnecessary skills, increasing processing time and cost. Conditional skills ensure that computationally expensive operations run only when needed and help maintain efficient enrichment pipelines.
Demand Score: 70
Exam Relevance Score: 78
Why might an Azure AI Search indexer stop processing documents with the message “free skillset execution quota has been reached”?
The indexer exceeded the free enrichment execution quota without an attached Cognitive Services resource.
Azure AI Search allows limited free execution of built-in cognitive skills for enrichment. Once the quota is exceeded, the indexer stops processing additional documents. This occurs because enrichment operations require cognitive processing capacity. To continue processing large datasets, a Cognitive Services resource must be attached to the search service. This enables the indexer to run cognitive skills at scale beyond the free quota limits. Developers frequently encounter this when initial testing expands into production-scale indexing without provisioning a cognitive resource.
Demand Score: 68
Exam Relevance Score: 76