AI‑900 Ultra‑Deep Dive: AI Workloads + How They Really Work (Generative AI, NLP, Vision, Speech) & Responsible AI
This article goes beyond recognition—into mechanics. We’ll cover how Generative AI, NLP, Vision, and Speech actually work (tokens, semantics, attention, OCR pipelines, STT/TTS), while staying within AI‑900 scope. We’ll also map everything to Azure services and Responsible AI principles used on the exam.
- AI Workloads Overview
- Generative AI (Deep Dive: tokens, context, attention, sampling, RAG)
- NLP (Deep Dive: semantics, PII, summarization, intents)
- Computer Vision (Deep Dive: image analysis, OCR, faces, video)
- Speech (Deep Dive: STT/TTS/Translation, diarization)
- Document Intelligence (deep)
- Responsible AI (fairness, transparency, privacy, inclusiveness, accountability)
- Service Chooser (scenario → workload → service)
- New‑Learner Q&A
- References
AI Workloads Overview (in Azure)
| Workload | What it solves | Azure services to recognize |
|---|---|---|
| Generative AI | Create new content (text, images, audio/code); augment apps with LLMs + RAG. | Azure OpenAI Service (GPT‑family, DALL·://www.youtube.com/watch?v=HRRqocX1o9Y) --> |
| NLP | Understand text: sentiment, entities/PII, summarization, intents/Q&A. | Azure Language (NER/PII, Sentiment, Summarization, CLU, Question Answering). |
| Vision | Understand images/video: tags, objects, captions, OCR, faces, video topics. | Azure Vision (Image Analysis/OCR/Face/Spatial), Video Indexer, Vision Studio. |
| Speech | Speech↔Text, translation, diarization, neural TTS, speaker recognition. | Azure Speech (real‑time/fast/batch STT, TTS, Translation, Speaker Recognition). |
| Document Intelligence | Extract fields/tables from PDFs/images; classify/split docs. | Azure AI Document Intelligence (prebuilt/custom; containers; studio). |
Generative AI — How It Works (Tokens, Attention, Context, Sampling, RAG)
What “generative” means in practice
Generative models learn statistical patterns over tokens (sub‑word units) and predict the next token given the previous context, producing coherent text, code, or even image descriptions. In Azure, this is delivered via Azure OpenAI Service with enterprise controls for security, regions, and throughput (standard, provisioned, batch).
Key technical terms (with context)
- Token — the atomic unit models process (≈ sub‑word pieces like "inter", "national"). Tokenization (e.g., BPE) converts text to token IDs; models output a probability distribution (logits) over the next token.
- Embedding — dense vector representation of a token/word/sentence capturing semantics used for search, clustering, and RAG retrieval.
- Context window — maximum tokens the model attends to (prompt + response). Longer windows support longer conversations & documents, but still require retrieval for large corpora.
- Attention — mechanism (Transformer) that lets each token weigh others to compute the next‑token distribution; enables long‑range dependencies vs. older RNNs/CNNs.
- Sampling — converts logits to text. Temperature scales randomness; top‑p (nucleus) restricts to the smallest set of tokens whose cumulative probability ≥ p; top‑k limits to k highest probability tokens.
- System/User messages — structured prompts that set behavior/instructions vs. user inputs; supported by Azure OpenAI chat APIs.
- Hallucination — plausible‑sounding but incorrect output; mitigated with grounding via RAG and content filtering.
RAG (Retrieval‑Augmented Generation): why it matters
RAG combines a search layer (vector or hybrid) with the LLM: retrieve relevant chunks (via embeddings) and inject them into the prompt so answers are grounded in your data. This boosts factuality and reduces hallucinations—recommended in Microsoft’s Azure architecture guidance for enterprise apps.
New‑learner: “Is token = word? Why does context window matter?”
Not exactly. Tokens are sub‑word pieces; “internationalization” may be several tokens. The model can only “see” up to its context limit (prompt+response). Long policies or PDFs require chunking + RAG to remain within limits while staying factual.
NLP — Semantics, Summarization, PII, Intents (CLU)
Azure Language unifies traditional text analytics (sentiment, key phrases, entities/PII, language detection), summarization (documents & conversations), question answering, and Conversational Language Understanding (CLU) for intents/entities—all accessible via REST/SDK and Language Studio.
How NLP features work conceptually
- Semantics via embeddings: Text is mapped to vectors so “refund” is near “return,” enabling grouping, retrieval, and clustering. Prebuilt features hide the math; you use high‑level APIs.
- Sentiment & opinion mining: Classifies per‑document/sentence polarity and associates sentiments with targets (e.g., “shipping was slow”). Useful for VoC analytics.
- Named Entity Recognition (NER) & PII redaction: Extract entities (Person, Org, Date, etc.) and redact sensitive data before storage or search—key for privacy.
- Summarization: Extractive picks key sentences; abstractive generates concise prose. Azure Language offers both for documents and conversations.
- CLU: Train intents/entities with labeled utterances; great for routing and slot‑filling in bots.
New‑learner: “Language vs. Azure OpenAI for summarization?”
Use Language when you want task‑specific, predictable summarization with minimal tuning. Use Azure OpenAI when you need flexible generative summaries and are ready to add guardrails/grounding. Many production apps mix both.
Computer Vision — Image Analysis, OCR, Face, Video
Azure’s Vision family provides Image Analysis (tags/objects/captions), OCR for text extraction, Face for face detection/analysis, and Video Indexer for higher‑level video insights. Vision Studio gives you a no‑code way to try features before coding.
How core pieces work
- Image analysis: Modern vision backbones (CNN/Transformers) output tags, object boxes, and captions; the API returns JSON with confidence scores.
- OCR: The pipeline detects text regions → lines → words → outputs text + coordinates; works for printed and many handwritten cases.
- Face: Detects faces and can return attributes (e.g., landmarks). Identity features are gated under Responsible AI approvals.
- Video Indexer: Combines speech transcription, face/object detection, and topic extraction across time.
Speech — STT/TTS/Translation, Diarization, Customization
Azure Speech supports real‑time, fast, and batch transcription; natural‑sounding neural TTS (including custom neural voice with governance); translation; speaker recognition; and avatar/voice‑live features—deployable in cloud or containers.
How STT works (conceptually)
- Acoustic model turns waveform → phonetic probabilities; language model resolves words from likely sequences; diarization separates speakers in meetings/calls.
- Modes: real‑time for streaming captions; fast for synchronous file transcription; batch for archives/large volumes.
- Customization: domain lexicons and custom speech models improve accuracy in noisy/jargon contexts.
New‑learner: “Is Whisper available on Azure?”
Yes—Azure Speech/Azure OpenAI offer Whisper‑based transcription scenarios (availability evolves; check current docs).
Document Intelligence — Field/Tables Extraction, Classification & Splitting
Azure AI Document Intelligence (formerly Form Recognizer) extracts key‑value pairs and tables from documents using prebuilt models (invoice, receipt, ID, business card) or custom models trained on your layouts. It also supports document classification and splitting for multi‑doc files. Deploy in cloud or containers and integrate with Power Automate/Logic Apps.
New‑learner: “Why not just Vision OCR for invoices?”
OCR gives you text. Document Intelligence gives you semantics (Vendor, Invoice #, Due Date, line items) and optional custom models tuned to your layout—much more useful downstream.
Responsible AI — Fairness, Transparency, Privacy, Inclusiveness, Accountability
Microsoft implements Responsible AI via six principles: fairness, reliability & safety, privacy & security, inclusiveness, transparency, and accountability—with tools/governance across Azure services. AI‑900 expects you to map scenarios to these principles and suggest safeguards.
| Principle | What to remember | Exam‑style safeguard |
|---|---|---|
| Fairness | Comparable performance across groups; reduce biased outcomes. | Run cohort error analysis; rebalance data; monitor drift. |
| Transparency | Explain behavior/limits; documentation, transparency notes. | Provide explanation summaries and usage disclosures. |
| Privacy | Minimize, protect, and govern personal data (PII). | Use Language PII redaction; encrypt at rest/in transit; RBAC. |
| Inclusiveness | Design for diverse users/abilities; accessibility. | Captions, TTS, accent support; UX testing across abilities. |
| Accountability | Humans own outcomes; auditable; escalation paths. | Human‑in‑the‑loop for consequential decisions; audit logs. |
Service Chooser — Scenario → Workload → Azure Service
| Scenario | Workload | Service |
|---|---|---|
| Generate concise FAQs from a long policy and chat over it. | Generative + NLP | Azure OpenAI + RAG; optional Language (PII redaction). |
| Tag product images, detect objects, and caption them. | Vision | Azure Vision Image Analysis; try in Vision Studio first. |
| Extract vendor, totals, and line items from invoices. | Document Intelligence | Prebuilt Invoice or Custom model in Document Intelligence. |
| Real‑time multilingual captions for webinars. | Speech | Speech‑to‑Text + Translation (real‑time). |
| Detect sentiment and redact PII from customer reviews. | NLP | Language (Sentiment/Opinion + PII detection). |
New‑Learner Q&A (short, exam‑aligned)
Is AI‑900 about coding?
No. It’s about recognizing workloads, mapping to Azure services, and applying Responsible AI in scenario questions. Hands‑on via studios is encouraged but not required.
How do I try features without writing code?
Use Vision Studio, Language Studio, and Speech Studio to experiment with images, text, and audio—great for intuition and demos.
What’s the difference between extractive and abstractive summaries?
Extractive picks existing sentences; abstractive writes new sentences. Azure Language supports both.
When do I need approvals?
Sensitive face identification and custom neural voice scenarios require use‑case review under Microsoft’s Responsible AI processes.
References (Official Docs & Study)
- Official AI‑900 study guide (skills & weights): Microsoft Learn.
- Responsible AI concepts & principles: Responsible AI in Azure ML, Microsoft Principles & Approach.
- Vision: Vision Studio, Azure Vision, Vision learning path.
- Language: Language overview, Language learning path.
- Document Intelligence: Connector overview, Renaming & updates.
- Speech: Speech overview, STT modes.
- Generative AI (Azure OpenAI): Developer resources hub, Pricing & deployment models.
- Architecture patterns for AI & RAG: Azure Architecture Center.
Note: Styles are safely scoped to this post (.ai900-pro) so they won’t affect your blog’s header or template. If your theme uses a sticky header that overlaps content, add extra top padding to .ai900-pro.
0 Comments