Shopping cart

Subtotal:

$0.00

AI-102 Implement knowledge mining and information extraction solutions

Implement knowledge mining and information extraction solutions

Detailed list of AI-102 knowledge points

Implement knowledge mining and information extraction solutions Detailed Explanation

1. What this domain covers in AI-102

In AI-102, this knowledge area focuses on building systems that can:

  • Ingest data from many sources (both structured and unstructured)

  • Extract information from that data using AI

  • Enrich the data with additional insights (entities, key phrases, vectors, classifications)

  • Make the data searchable and explorable for applications and users

The core service used for this is Azure AI Search, often combined with AI enrichment and document extraction services provided by Microsoft.

From an exam perspective, this domain is about engineering a search and extraction pipeline, not about training ML models.

You are expected to understand how to:

  • Provision search resources

  • Design indexes

  • Configure indexers and skillsets

  • Run and monitor ingestion

  • Query data effectively

  • Support semantic and vector-based retrieval

  • Project enriched data for downstream use

2. Azure AI Search fundamentals (must-know concepts)

Azure AI Search is the backbone of knowledge mining in AI-102. It is not just “search”; it is a data ingestion + enrichment + retrieval platform.

2.1 Core components and their roles

Each component has a clear responsibility. Understanding how they fit together is critical.

2.1.1 Search resource

A search resource is the top-level Azure service instance.

It:

  • Hosts all your indexes

  • Manages indexers and skillsets

  • Exposes query endpoints for applications

Beginner analogy:
Think of the search resource as the search server that everything lives inside.

Important exam note:

  • You must provision this before anything else.

  • Capacity, region, and pricing tier matter for performance and features.

2.1.2 Index

An index defines how your data is stored and queried.

It is similar to a database schema, but optimized for search.

An index defines:

Fields

Each field has:

  • A data type (string, number, date, boolean, complex type)

  • Capabilities:

    • searchable – full-text search

    • filterable – exact matching in filters

    • sortable – ordering results

    • facetable – aggregations for UI filters

Beginner example:

  • title: searchable, sortable

  • category: filterable, facetable

  • content: searchable

  • publishDate: filterable, sortable

Analyzers

Analyzers control:

  • Tokenization

  • Language rules

  • Stemming and normalization

Why this matters:

  • The same text can behave very differently depending on the analyzer.

  • Choosing the correct language analyzer improves relevance.

Scoring profiles and semantic configuration

These define:

  • Which fields matter more (for example: title > body)

  • How semantic ranking behaves (when enabled)

Vector fields

When using vector search:

  • You define fields that store embeddings

  • These fields are used for similarity search

Key beginner takeaway:

The index is the contract between ingestion and querying.
A bad index design leads to poor search results.

2.1.3 Data source

A data source tells Azure AI Search where your content lives.

Common sources include:

  • Azure Blob Storage (documents, PDFs, images)

  • Azure Data Lake Storage

  • Azure SQL Database

  • Azure Cosmos DB

  • Other supported connectors

A data source defines:

  • Connection details

  • Authentication method

  • Container/table/query to read from

Beginner note:

  • A data source does not move data by itself.

  • It is used by an indexer.

2.1.4 Indexer

An indexer is the ingestion engine.

It:

  • Reads data from a data source

  • Optionally applies AI enrichment via a skillset

  • Writes processed documents into an index

Key properties:

  • Can run on demand

  • Can run on a schedule

  • Supports incremental updates

Beginner analogy:

  • The indexer is the assembly line that moves data from storage into searchable form.

Exam-relevant idea:

  • Most real solutions use indexers, not manual uploads.

2.1.5 Skillset (AI enrichment pipeline)

A skillset defines how AI is applied during indexing.

It is a pipeline of “skills” that:

  • Read raw content

  • Extract information

  • Enrich documents with new fields

Common skill categories include:

  • OCR and text extraction

  • Language detection

  • Key phrase extraction

  • Entity recognition

  • Classification

  • Embedding generation for vector search

  • Custom skills (your own code)

Important:

  • Skillsets run only during indexing

  • They do not run at query time

Exam emphasis:

  • You must know how skillsets, indexers, and indexes work together.

3. Designing a knowledge mining pipeline end-to-end

This section is about architecture thinking.

3.1 Step 1: Understand the content and search experience

Before configuring anything, you must understand:

Content types

Examples:

  • PDFs

  • Word documents

  • Excel files

  • HTML pages

  • Images and scanned documents

  • Emails

Different content types require different extraction approaches.

Required search experiences

You may need:

  • Keyword search (exact terms)

  • Faceted navigation (filter by category, date, author)

  • Semantic ranking (meaning-based)

  • Q&A experiences

  • Vector similarity search

  • Hybrid search (keyword + vector)

Non-functional constraints

These are often tested indirectly in case studies:

  • Security trimming – users see only what they are allowed to see

  • Latency – response time requirements

  • Update frequency – near real-time vs batch indexing

Beginner rule:

Always design from requirements → index → ingestion, not the other way around.

3.2 Step 2: Define the index schema

Index design is one of the most important skills in AI-102.

Key decisions include:

Searchable fields
  • Used for full-text queries

  • Usually large text fields (content, description)

Filterable fields
  • Used in filter expressions

  • Usually IDs, categories, flags, dates

Facetable fields
  • Used for UI filters (“show counts by category”)

  • Must be filterable as well

Sortable fields
  • Used for ordering results

  • Dates, numbers, titles

Complex types
  • Nested objects (for example: extracted entities with properties)
Raw vs enriched separation

Best practice:

  • Keep original content

  • Store extracted/enriched fields separately

This makes debugging and tuning much easier.

3.3 Step 3: Configure ingestion (data source + indexer)

This step connects everything together.

Choose connector and authentication
  • Managed identity is preferred where supported

  • Avoid embedding secrets

Configure parsing options

Examples:

  • How to extract text from PDFs

  • Whether to include metadata

  • File inclusion/exclusion filters

Incremental indexing

Critical for production:

  • Detects changes

  • Avoids reprocessing everything

Scheduling and monitoring
  • Run on a schedule appropriate for freshness needs

  • Monitor failures and warnings

Exam insight:

  • Many questions test whether you choose incremental indexing instead of full rebuilds.

4. Information extraction and enrichment (skillsets)

Skillsets are where “AI” enters the pipeline.

4.1 Built-in cognitive skills (typical categories)

4.1.1 Text extraction and normalization

These skills:

  • Extract text from documents

  • Normalize encoding

  • Clean whitespace

  • Split content into chunks or pages

Why this matters:

  • Clean text improves downstream extraction and search quality.

4.1.2 Entity and metadata extraction

These skills identify:

  • People

  • Organizations

  • Locations

  • Key phrases

  • Language

  • Sentiment (when applicable)

Beginner example:

  • From a contract, extract company names and dates.

4.1.3 Document structure enrichment

These skills help understand document structure:

  • Headings and sections

  • Tables (often via document intelligence)

  • Document classification labels

This is important for:

  • Structured search

  • Downstream analytics

  • RAG scenarios

4.2 Custom skills (bring-your-own logic)

Built-in skills are not always enough.

Custom skills are used for:

  • Domain-specific extraction (legal clauses, product SKUs)

  • Custom PII detection

  • Proprietary classification rules

Implementation pattern
  • Host your logic (commonly as Azure Functions)

  • Accept JSON input

  • Return JSON output

  • Map outputs into index fields

Exam expectation:

  • You should know when to use a custom skill and how it fits into a skillset.

5. Knowledge Store projections (turn enriched content into usable artifacts)

5.1 What Knowledge Store is for

A Knowledge Store lets you persist enriched outputs into:

  • Files (for example, JSON in Blob Storage)

  • Objects

  • Tables

This allows:

  • BI tools

  • Analytics workloads

  • ML pipelines
    to reuse extracted data without re-running indexing.

5.2 When to use it

Typical scenarios:

  • Reporting on extracted entities

  • Training downstream ML models

  • Auditing extraction quality

  • Debugging skillsets

Beginner note:

  • Knowledge Store is optional, but powerful.

6. Querying an index (what you must be comfortable with)

6.1 Query features called out in the exam

You should understand:

  • Search syntax

  • Sorting

  • Filtering

  • Wildcards

These are fundamental and frequently tested.

6.2 Practical query design concepts

Important patterns include:

  • Combining text search with filters

  • Faceted navigation

  • Pagination

  • Result shaping (select specific fields)

  • Highlighting snippets

  • Synonyms for domain terms

  • Scoring profiles to boost important fields

Exam insight:

  • Many questions are about choosing the right query approach, not writing exact syntax.

7. Semantic search and vector search (modern retrieval capabilities)

7.1 Semantic search

Semantic search:

  • Improves ranking using meaning

  • Works on top of traditional text fields

  • Requires semantic configuration

It improves relevance but does not replace index design.

7.2 Vector search

Vector search:

  • Uses embeddings

  • Finds semantically similar content

  • Requires vector fields in the index

Key design decisions:

  • Embedding model

  • Chunk size and overlap

  • Vector field configuration

  • Hybrid retrieval strategy

Exam emphasis:

  • Vector search is often combined with keyword search for best results.

8. Patterns that combine search + extraction (common exam scenarios)

8.1 “Knowledge mining” scenario

Typical case:

  • Thousands of documents

  • Need searchable portal with facets and extraction

Solution pattern:

  • Indexer + skillset (OCR + extraction)

  • Store results in index and Knowledge Store

  • Build UI on top

8.2 “Information extraction” scenario

Typical case:

  • Business documents (contracts, invoices)

  • Need structured fields

Solution pattern:

  • Specialized extraction service

  • Normalize via skillsets/custom skills

  • Index for semantic/vector retrieval

9. Security, governance, and operations

9.1 Security trimming / access control

You must ensure:

  • Search results respect permissions

  • ACLs are stored and filtered

  • Secrets are not indexed

9.2 Monitoring and troubleshooting

Common issues:

  • Indexer failures

  • Skill timeouts

  • Schema mismatches

  • Stale data

  • Poor relevance

Beginner takeaway:

A production-ready knowledge mining solution is designed, monitored, and tuned, not just built once.

Implement knowledge mining and information extraction solutions (Additional Content)

1. Advanced Indexer Configuration and Behavior (Exam-Focused)

This section explains how indexers behave in real systems, beyond the basic “connect data source and run” model. These details are frequently tested in AI-102 case studies.

1.1 Change detection policies

Change detection determines how Azure AI Search knows what has changed in your data source.

In Azure AI Search, indexers do not automatically understand updates unless you configure a strategy.

How new, updated, and deleted data is detected

An indexer periodically checks the data source and decides:

  • Which documents are new

  • Which documents have changed

  • Which documents should be removed

Without change detection, the only option is a full re-index, which is costly and slow.

Common approaches
High watermark fields

A high watermark field is a column such as:

  • LastModifiedTime

  • UpdatedAt

  • VersionNumber

The indexer stores the last processed value and only retrieves records with a higher value.

This approach:

  • Is simple and widely used

  • Requires a reliable timestamp or version field

  • Works well for append-and-update scenarios

Native change tracking

Some data sources support native change tracking mechanisms.

When available:

  • The indexer asks the source system directly for changes

  • Performance and accuracy improve

  • Configuration complexity may increase

Why proper change detection is critical
Data freshness

Users expect search results to reflect recent updates. Poor change detection leads to stale results.

Cost control

Reprocessing unchanged data:

  • Consumes compute

  • Increases indexing time

  • Increases operational cost

Avoiding full re-indexing

Full re-indexing:

  • Is disruptive

  • Can temporarily degrade search quality

  • Should be avoided in production systems

1.2 Soft delete detection

Many systems do not physically delete records. Instead, they mark them as deleted.

Handling logical deletions

Logical deletion usually means:

  • A record still exists

  • A flag or status indicates it should not be visible

Examples:

  • isDeleted = true

  • status = inactive

Mapping deletion fields

You must explicitly configure the indexer to:

  • Detect the deletion indicator

  • Remove or hide the document in the index

Without this configuration:

  • Deleted content remains searchable

  • Users may see unauthorized or outdated information

Preventing stale or unauthorized documents

Soft delete detection is essential for:

  • Security compliance

  • Accurate search results

  • User trust

1.3 Field mappings and output field mappings

Field mappings control how data flows from source to index.

Mapping source fields to index fields

Field mappings allow you to:

  • Rename fields

  • Combine fields

  • Normalize data formats

This is useful when:

  • Source schema does not match index schema

  • Naming conventions differ

  • Multiple sources feed the same index

Transforming fields during ingestion

During ingestion, you may:

  • Convert data types

  • Flatten structures

  • Apply simple transformations

This reduces complexity later in querying.

Output field mappings

Output field mappings connect:

  • Skill outputs

  • Index fields

They define where enriched data (entities, chunks, embeddings) is stored.

Without correct output mappings:

  • Skills may run successfully

  • But results never appear in the index

1.4 Indexer failure handling

Indexers can fail partially or completely.

Partial success vs total failure
  • Partial success means some documents are indexed while others fail

  • Total failure means the indexer stops entirely

In production, partial success is common and expected.

Common causes of failures
Corrupt documents

Unreadable or malformed files can cause individual failures.

Skill timeouts

Complex enrichment steps may exceed execution limits.

Authentication issues

Expired credentials or permission changes can block access to data sources or skills.

Monitoring indexer execution history

You should regularly review:

  • Execution logs

  • Warning and error counts

  • Failed document samples

Monitoring helps detect issues early.

Designing resilient pipelines

Well-designed pipelines:

  • Tolerate individual document failures

  • Continue processing valid content

  • Avoid blocking the entire index because of a few bad documents

2. Skillset Execution Order and Dependency Design

Skillsets are pipelines, not collections of independent steps.

2.1 Skill execution sequence

Skills execute in the order they are defined.

Later skills:

  • Depend on outputs from earlier skills

  • Cannot access data that was not produced upstream

Why ordering matters

Incorrect ordering can cause:

  • Missing inputs

  • Empty outputs

  • Poor extraction quality

Correct ordering improves:

  • Accuracy

  • Performance

  • Cost efficiency

2.2 Input–output chaining between skills

Skillsets often form chains.

Common skill chains
  • OCR → text extraction

  • Text extraction → chunking

  • Chunking → entity extraction or embeddings

Each step builds on the previous one.

Common mistakes
  • Referencing fields that do not exist yet

  • Skipping required preprocessing steps

  • Incorrect field paths

These mistakes often cause silent failures.

2.3 Chunking placement in skillsets

Chunking splits large text into manageable units.

Why chunking is usually early

Placing chunking early ensures:

  • Later skills operate on smaller, focused text

  • Better extraction accuracy

  • Lower token usage

Impact of chunking
Entity extraction accuracy

Entities are easier to detect in focused text segments.

Vector embedding quality

Embeddings represent meaning more accurately when text is concise.

Token and cost efficiency

Smaller chunks reduce processing cost and latency.

2.4 Embedding generation placement

Embedding generation is usually placed near the end.

Why embeddings come later

Embeddings should reflect:

  • Cleaned text

  • Final chunk boundaries

  • Meaningful semantic units

Generating embeddings too early leads to poor retrieval quality.

3. Document-Level vs Chunk-Level Indexing (High-Frequency Design Decision)

This is a classic AI-102 trade-off question.

3.1 Document-level indexing

Advantages
  • Simple schema

  • Fewer indexed records

  • Lower storage cost

Limitations
  • Poor relevance for long documents

  • Weak vector similarity performance

  • Difficult grounding and citation

This approach struggles with modern semantic and RAG use cases.

3.2 Chunk-level indexing

Advantages
  • Higher semantic relevance

  • Strong vector search results

  • Precise citations for RAG and agents

Trade-offs
  • Larger index size

  • Higher ingestion and storage cost

  • More complex schema design

3.3 Exam-oriented decision guidance

When chunk-level indexing is preferred
  • Long documents such as manuals, PDFs, and policies

  • RAG and agentic retrieval scenarios

  • Applications requiring precise citations

When document-level indexing is acceptable
  • Short, well-structured records

  • Metadata-focused search experiences

  • Low semantic complexity

4. Data Freshness, Consistency, and Update Strategy

Search systems are eventually consistent by nature.

4.1 Near-real-time vs batch indexing

Batch indexing
  • Lower cost

  • Simpler operations

  • Suitable for static or slowly changing data

Near-real-time indexing
  • Higher operational complexity

  • Required for frequently updated content

  • Demands careful monitoring

4.2 Index consistency considerations

You should expect:

  • Delays between source updates and visibility

  • Temporary inconsistencies during re-indexing

These delays affect:

  • User trust

  • Application behavior

  • Business workflows

4.3 Delete and update consistency risks

Common risks include:

  • Orphaned documents that should be deleted

  • Outdated metadata

  • Security trimming mismatches after permission changes

These risks must be actively managed.

4.4 When not to use indexers

Indexers are not suitable for:

  • Very high write-frequency data

  • Scenarios requiring strict transactional consistency

Alternatives
  • Push indexing

  • Hybrid architectures combining push and pull models

5. Operational Relevance Tuning (Often Tested Indirectly)

Relevance tuning is continuous work.

5.1 Iterative relevance tuning

You may adjust:

  • Field weights

  • Analyzers

  • Synonyms

  • Scoring profiles

Real user queries should guide these changes.

5.2 Observing relevance issues

Warning signs include:

  • Low click-through rates

  • Incorrect top-ranked results

  • Over-matching on irrelevant fields

5.3 Production mindset

Relevance tuning:

  • Is not a one-time task

  • Evolves with content and users

  • Depends heavily on early schema decisions

A well-designed search solution improves over time through measurement and tuning, not static configuration.

Frequently Asked Questions

When creating an Azure AI Search skillset using AzureOpenAIEmbeddingSkill, the service returns the error: “uri parameter cannot be null or empty.” What configuration issue causes this error?

Answer:

The resource URI for the Azure OpenAI service was not provided or is incorrectly configured.

Explanation:

The AzureOpenAIEmbeddingSkill requires the resourceUri parameter that points to the Azure OpenAI endpoint hosting the embedding model. If this value is missing, empty, or incorrectly formatted, the skillset validation fails during creation. The skill relies on this URI to send text content to the embedding model for vector generation. Developers sometimes mistakenly provide the search endpoint instead of the Azure OpenAI endpoint or omit the parameter entirely when configuring the skill programmatically. Ensuring the correct Azure OpenAI resource endpoint is supplied resolves the issue and allows the embedding pipeline to generate vectors successfully.

Demand Score: 82

Exam Relevance Score: 83

A vector indexing pipeline in Azure AI Search fails with the error “EDM.Double cannot be mapped to EDM.Single.” What is the underlying cause?

Answer:

The embedding vector type returned by the skill does not match the index schema vector type.

Explanation:

Azure AI Search vector fields require a specific numeric type, typically Collection(Edm.Single) representing float values. However, some embedding pipelines return vectors interpreted as Edm.Double. Because Azure Search enforces strict schema compatibility, this mismatch causes indexing failures. The index schema must match the numeric precision expected by the embedding output. Developers often encounter this when custom skills or serialization layers produce double-precision numbers instead of single-precision floats. Aligning the index schema or converting values to the correct type ensures compatibility and allows successful vector ingestion.

Demand Score: 74

Exam Relevance Score: 84

Why might an Azure AI Search indexer show warnings such as “Could not generate projection from input /document/pages/*”?

Answer:

The skillset projection references an incorrect source path in the enrichment tree.

Explanation:

Skillsets operate on an enrichment tree that defines intermediate data structures produced during processing. When projections reference paths such as /document/pages/*, the path must correspond to output fields produced by earlier skills. If the path is incorrect or the referenced data was never created, Azure Search cannot generate the projection and produces warnings during indexing. These errors commonly occur when developers modify skillset outputs without updating downstream projections. Ensuring consistent field names and verifying enrichment paths across skills resolves projection failures.

Demand Score: 76

Exam Relevance Score: 81

During AI enrichment, when should a ConditionalSkill be used in Azure AI Search?

Answer:

A ConditionalSkill should be used when processing should occur only if specific metadata or conditions are met.

Explanation:

Azure AI Search enrichment pipelines often process heterogeneous content such as multilingual documents or files with optional metadata. ConditionalSkill allows the pipeline to evaluate an expression and route documents to different processing paths. For example, OCR can be triggered only when a document language is unknown or when images are present. Without conditional logic, enrichment pipelines may run unnecessary skills, increasing processing time and cost. Conditional skills ensure that computationally expensive operations run only when needed and help maintain efficient enrichment pipelines.

Demand Score: 70

Exam Relevance Score: 78

Why might an Azure AI Search indexer stop processing documents with the message “free skillset execution quota has been reached”?

Answer:

The indexer exceeded the free enrichment execution quota without an attached Cognitive Services resource.

Explanation:

Azure AI Search allows limited free execution of built-in cognitive skills for enrichment. Once the quota is exceeded, the indexer stops processing additional documents. This occurs because enrichment operations require cognitive processing capacity. To continue processing large datasets, a Cognitive Services resource must be attached to the search service. This enables the indexer to run cognitive skills at scale beyond the free quota limits. Developers frequently encounter this when initial testing expands into production-scale indexing without provisioning a cognitive resource.

Demand Score: 68

Exam Relevance Score: 76

AI-102 Training Course
$68$29.99
AI-102 Training Course