Domain 2: Fundamentals of GenAI
Topic 2 of 5 · Study notes
AWS Certified AI Practitioner — Domain 2: Fundamentals of Generative AI
Exam Code: AIF-C01 | Level: Foundational
Domain Weight: 24% | Total Domains: 5 | Passing Score: 700/1000
Table of Contents
- What is Generative AI?
- Foundation Models
- Large Language Models
- The Transformer Architecture
- Tokens and Context Windows
- Embeddings and Vector Databases
- Prompt Engineering
- Model Customization Techniques
- Types of Generative AI Models
- Amazon Bedrock
- Amazon Q
- Hallucination and Output Quality
- Exam Tips and Quick Reference
1. What is Generative AI?
Generative AI is a subset of artificial intelligence that creates new content — text, images, audio, video, code, or synthetic data — by learning patterns from vast training datasets. Unlike traditional ML, which classifies or predicts, generative AI produces novel outputs.
1.1 Traditional AI vs. Generative AI
| Aspect | Traditional ML / AI | Generative AI |
|---|---|---|
| Primary task | Classify, predict, detect | Create, generate, converse |
| Output type | Label, number, decision | Text, image, code, audio |
| Training data | Task-specific labeled dataset | Massive general-purpose corpus |
| Flexibility | Narrow, one task | Broad, many tasks |
| Example | Spam filter, fraud detection | Claude, DALL-E, Amazon Titan |
1.2 What Generative AI Can Produce
| Modality | Examples |
|---|---|
| Text | Articles, summaries, code, conversations, translations |
| Images | Photorealistic images, product visuals, artwork |
| Audio | Music, speech synthesis, sound effects |
| Video | Video generation, animation |
| Code | Write, debug, explain, and refactor source code |
| Synthetic Data | Training data for other ML models |
| Multi-modal | Combined text and image understanding or generation |
2. Foundation Models
2.1 Definition and Key Characteristics
A Foundation Model (FM) is a large AI model trained on broad, internet-scale datasets using self-supervision, designed to be adapted across a wide range of downstream tasks. The term was introduced by Stanford HAI in 2021.
| Characteristic | Description |
|---|---|
| Scale | Billions to trillions of parameters |
| Pre-training | Trained on massive, general-purpose datasets |
| Emergent abilities | Capabilities not explicitly trained for appear at scale |
| Adaptability | Can be fine-tuned or prompted for specific applications |
| Few-shot learning | Learns new tasks from only a few examples |
Emergent Abilities
Capabilities that appear in large models but are absent in smaller versions include multi-step reasoning, code generation, language translation, in-context (few-shot) learning, and chain-of-thought reasoning.
2.2 Foundation Models vs. Traditional ML Models
| Property | Traditional ML Model | Foundation Model |
|---|---|---|
| Training data | Task-specific, labeled | Broad, often unlabeled, massive scale |
| Training cost | Thousands to millions USD | Tens to hundreds of millions USD |
| Flexibility | One task | Many tasks |
| Customization | Retrain from scratch | Fine-tune or prompt |
| Access model | Build your own | API or download weights |
2.3 Models Available on Amazon Bedrock
| Provider | Model Family | Key Strength |
|---|---|---|
| Anthropic | Claude Haiku, Sonnet, Opus | Long context, safety, complex reasoning |
| Amazon | Titan Text, Titan Embeddings, Titan Image | Native AWS; data not shared for training |
| Meta | Llama 2, Llama 3 | Open weights; strong at code |
| Mistral AI | Mistral, Mixtral | Efficient; multilingual |
| AI21 Labs | Jamba | Long context; summarization |
| Cohere | Command, Embed | Enterprise text generation and embeddings |
| Stability AI | Stable Diffusion XL | High-quality image generation |
Exam Tip: Amazon Titan models are AWS-native. Customer data submitted to Bedrock is never used to train the underlying foundation models — this is a critical security guarantee.
3. Large Language Models
Large Language Models (LLMs) are foundation models specifically pre-trained on massive text corpora to understand and generate human language.
- "Large" — billions to trillions of parameters
- "Language" — trained primarily on text
- "Model" — a statistical representation of language
3.1 How LLMs Are Pre-Trained
LLMs are trained using self-supervised learning on internet-scale text without manual labels.
| Pre-training Approach | Style | Example Models |
|---|---|---|
| Next token prediction | Predict the next word given all previous words | GPT, Claude, Llama |
| Masked language modeling | Predict randomly masked words in a sentence | BERT, RoBERTa |
| RLHF | Fine-tune with human preference rankings to improve helpfulness | ChatGPT, Claude |
RLHF — Reinforcement Learning from Human Feedback
1. Pre-train LLM on massive text corpus
↓
2. Supervised Fine-Tuning (SFT) on high-quality examples
↓
3. Train a Reward Model from human preference rankings
↓
4. Optimize LLM with RL (PPO) to maximize reward
↓
Aligned, helpful, and safer model
3.2 LLM Capabilities
| Capability | Description |
|---|---|
| Text generation | Write articles, emails, and creative content |
| Summarization | Condense long documents into key points |
| Question answering | Answer questions based on provided context |
| Translation | Convert text across languages |
| Code generation | Write, debug, and explain source code |
| Classification | Categorize text without task-specific training |
| Information extraction | Pull structured data from unstructured text |
| Reasoning | Perform multi-step logical problem solving |
4. The Transformer Architecture
The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., Google, 2017), is the architecture underlying all modern LLMs. Its key innovation is the self-attention mechanism, which allows every token to attend to every other token in parallel.
4.1 Core Mechanism
Self-Attention
For each token, self-attention computes how much weight to assign to every other token when producing its representation. This creates Query (Q), Key (K), and Value (V) vectors.
Attention(Q, K, V) = softmax(QKᵀ / √d) × V
Multi-Head Attention
Multiple attention heads run in parallel, each learning different types of relationships — syntactic, semantic, co-reference, and positional.
Positional Encoding
Because transformers process all tokens simultaneously (not sequentially), positional encodings are added to each token embedding to convey order information.
4.2 Encoder vs. Decoder vs. Encoder-Decoder
| Architecture | Representative Models | Best For |
|---|---|---|
| Encoder-only | BERT, RoBERTa | Text understanding — classification, NER, embeddings |
| Decoder-only | GPT-4, Claude, Llama | Text generation, chat, code |
| Encoder-Decoder | T5, BART | Sequence-to-sequence — translation, summarization |
Key Concept: Most modern chat and code models use the decoder-only architecture. They generate text one token at a time, each new token conditioned on all previous tokens.
4.3 Text Generation and Decoding Strategies
LLMs generate text one token at a time, left to right. At each step the model outputs a probability distribution over the entire vocabulary.
| Strategy | Description | Best For |
|---|---|---|
| Greedy | Always selects the highest-probability token | Fast; often repetitive |
| Beam Search | Maintains K candidate sequences simultaneously | Structured, deterministic output |
| Top-K Sampling | Samples randomly from the K most probable tokens | Creative, varied text |
| Top-P (Nucleus) | Samples from tokens summing to probability P | Creative with coherence |
| Temperature Scaling | Scales the logits before sampling | Applied on top of any sampling strategy |
5. Tokens and Context Windows
5.1 Tokenization
A token is the basic unit of text that an LLM processes. Tokenization splits raw text into tokens before the model sees it.
| Approximation | Value |
|---|---|
| Characters per token (English) | ~4 |
| Words per token | ~0.75 |
| Tokens per page of text | ~750 |
| Tokens per 100 words | ~133 |
Common Tokenization Methods
| Method | Description | Used By |
|---|---|---|
| Byte-Pair Encoding (BPE) | Iteratively merges the most frequent character pairs | GPT, Llama |
| WordPiece | Similar to BPE but uses likelihood instead of frequency | BERT |
| SentencePiece | Language-agnostic subword tokenization | T5, multilingual models |
Key Concept: Token count directly controls both context window usage and API cost. Input tokens and output tokens are each priced separately, with output tokens typically costing more.
5.2 Context Window
The context window is the maximum number of tokens — input plus output — the model can process in a single request.
| Model Family | Context Window |
|---|---|
| Older models | 2K–4K tokens |
| GPT-4 (standard) | 8K–128K tokens |
| Claude 3 | Up to 200K tokens |
| Gemini 1.5 Pro | 1M+ tokens |
6. Embeddings and Vector Databases
6.1 What are Embeddings?
Embeddings are dense numerical vectors that encode the semantic meaning of text (or other data). Semantically similar content produces vectors that are close together in the embedding space.
"king" → [ 0.21, -0.40, 0.80, 0.11, ... ]
"queen" → [ 0.20, -0.39, 0.79, 0.18, ... ]
"apple" → [-0.32, 0.61, -0.10, 0.92, ... ]
Key Concept: Embeddings enable semantic search — finding documents by meaning rather than exact keyword match. They are the foundation of Retrieval Augmented Generation (RAG).
AWS Embedding Models (via Bedrock)
| Model | Provider | Capability |
|---|---|---|
| Amazon Titan Embeddings | Amazon | Text embeddings for English and multilingual use |
| Cohere Embed | Cohere | High-quality multilingual embeddings |
6.2 Vector Databases
Vector databases store embeddings and support fast Approximate Nearest Neighbor (ANN) search to find the most similar vectors to a query.
| Database | Notes |
|---|---|
| Amazon OpenSearch (vector engine) | AWS-native; recommended for Bedrock Knowledge Bases |
| Amazon Aurora PostgreSQL (pgvector) | Relational DB with vector extension |
| Amazon RDS PostgreSQL (pgvector) | Managed relational DB with vector support |
| Pinecone | Purpose-built, fully managed vector database |
| Redis Enterprise | In-memory vector search |
| MongoDB Atlas | Document database with vector search |
7. Prompt Engineering
Prompt engineering is the practice of designing and optimizing input text to get the best possible output from a generative AI model — without modifying model weights.
7.1 Prompting Techniques
Zero-Shot Prompting
Ask the model to perform a task with no examples provided.
Classify the sentiment of this review: "The shipping was late but the product is great."
Few-Shot Prompting
Provide labeled examples before the actual request to guide the model.
"I love this!" → Positive
"Terrible experience." → Negative
"It's okay." → Neutral
"This product changed my life!" →
Chain-of-Thought (CoT) Prompting
Instruct the model to reason step by step before answering. Dramatically improves performance on reasoning and math tasks.
Q: A store has 5 apples. 2 are sold and 3 more are added. How many remain?
A: Let me think step by step.
Start: 5 apples
After selling 2: 5 − 2 = 3 apples
After adding 3: 3 + 3 = 6 apples
Answer: 6
Other Key Techniques
| Technique | Description |
|---|---|
| Role / Persona Prompting | Assign the model a role: "You are a senior AWS architect…" |
| System Prompt | Persistent instructions defining behavior, constraints, and persona |
| RAG Prompting | Inject retrieved document chunks into the prompt as context |
| ReAct | Model alternates between reasoning and calling external tools |
| Self-Consistency | Generate multiple CoT outputs and take the majority-vote answer |
7.2 Inference Parameters
| Parameter | Description | Effect |
|---|---|---|
| Temperature | Scales the probability distribution before sampling | 0 = deterministic; >1 = more creative and varied |
| Top-P | Sample from the smallest set of tokens summing to probability P | Lower = more focused |
| Top-K | Sample from only the K most probable tokens | Lower = more conservative |
| Max Tokens | Maximum number of tokens in the generated response | Limits output length and cost |
| Stop Sequences | Strings that immediately halt generation | Controls format and length |
Exam Tip: Temperature = 0 produces deterministic, consistent output — ideal for factual Q&A and classification. Temperature > 0 introduces randomness — ideal for creative writing and brainstorming.
8. Model Customization Techniques
When zero-shot and few-shot prompting are insufficient, the model itself can be customized. Techniques vary in cost, complexity, and whether they modify model weights.
8.1 Technique Comparison
| Technique | Modifies Weights | Cost | Data Required | Best When |
|---|---|---|---|---|
| Prompt Engineering | No | Inference only | None | Always start here |
| RAG | No | Inference + retrieval | Documents / data source | Need current or proprietary information |
| Fine-Tuning | Yes | Training compute | Labeled examples | Need specific style, format, or domain vocabulary |
| Continued Pre-Training | Yes | Very high | Large unlabeled corpus | Entirely new domain with unique terminology |
Fine-Tuning Methods
| Method | Description | Use Case |
|---|---|---|
| Full Fine-Tuning | Update all model parameters | Best quality; highest cost |
| LoRA | Train small low-rank matrices; freeze most weights | Efficient; near-full-fine-tuning quality |
| QLoRA | LoRA combined with model quantization | Most memory-efficient fine-tuning |
| Instruction Fine-Tuning | Train on instruction-response pairs | Improve instruction-following behavior |
| RLHF | Train using human preference rankings | Align model with human values |
8.2 RAG vs. Fine-Tuning Decision Guide
| Scenario | Recommended Approach |
|---|---|
| Need up-to-date or real-time information | RAG |
| Need access to private company documents | RAG |
| Need to update knowledge frequently | RAG |
| Need a specific writing style or tone | Fine-Tuning |
| Need domain-specific vocabulary or terminology | Fine-Tuning |
| Have very limited labeled training data | RAG |
9. Types of Generative AI Models
| Model Type | Architecture | How It Works | AWS Example |
|---|---|---|---|
| Autoregressive LLM | Decoder-only transformer | Predicts next token left to right | Claude, Titan Text |
| Masked LM | Encoder-only transformer | Predicts masked tokens in context | BERT (embeddings, not generation) |
| Seq2Seq | Full encoder-decoder | Maps an input sequence to an output sequence | Translation, summarization |
| Diffusion Model | Iterative denoising | Learns to reverse a noise-addition process | Titan Image Generator, Stable Diffusion |
| GAN | Generator + Discriminator | Two networks compete until generator produces realistic output | Image synthesis |
| VAE | Encoder + Probabilistic Decoder | Encodes input to a latent distribution; samples to generate | Generative modeling |
| Multi-modal Model | Combined architectures | Processes and generates across text and images | Claude 3 Vision, Amazon Nova |
Key Concept: Diffusion models are the dominant architecture for image generation. They work by learning to reverse a noise-addition process — starting from pure noise and iteratively denoising toward a coherent image conditioned on a text prompt.
10. Amazon Bedrock
Amazon Bedrock is a fully managed service that provides access to high-performing foundation models from multiple providers through a single, unified API — without requiring infrastructure management.
10.1 Core Features
Knowledge Bases (Managed RAG)
Bedrock Knowledge Bases provide a fully managed RAG pipeline.
Data Sources (S3, SharePoint, Confluence, web)
↓ Automatic chunking + embedding
Vector Store (OpenSearch Serverless, Aurora pgvector, Pinecone)
↓ At inference time
User Query → Embed → Retrieve top-K chunks → Inject into prompt → FM → Response
| Supported Vector Store | Type |
|---|---|
| Amazon OpenSearch Serverless | Recommended; serverless |
| Amazon Aurora PostgreSQL | Relational with pgvector |
| Amazon RDS PostgreSQL | Managed relational with pgvector |
| Pinecone | Third-party, purpose-built |
| Redis Enterprise | Third-party, in-memory |
| MongoDB Atlas | Third-party, document DB |
Agents
Bedrock Agents enable autonomous, multi-step task execution by orchestrating reasoning, knowledge base queries, and API calls via Lambda functions.
| Agent Component | Description |
|---|---|
| Foundation Model | The reasoning engine for the agent |
| Instructions | System prompt defining the agent's role and constraints |
| Action Groups | APIs the agent can call, defined by an OpenAPI schema and backed by Lambda |
| Knowledge Bases | Document collections the agent can query |
Guardrails
Bedrock Guardrails provide customizable safety controls applied at both input and output.
| Guardrail Feature | Description |
|---|---|
| Content Filters | Block harmful content (hate, insults, sexual, violence, misconduct, prompt attacks) |
| Denied Topics | Block specific topics defined in natural language |
| Word Filters | Block exact words or phrases; includes a built-in profanity list |
| PII Redaction | Detect and mask personally identifiable information |
| Grounding Check | Detect hallucinations by comparing responses to retrieved source material |
Model Evaluation
| Evaluation Type | Description |
|---|---|
| Automatic Evaluation | Built-in metrics — accuracy, robustness, toxicity |
| Human Evaluation | Route responses to human reviewers; compare models side by side |
Additional Bedrock Features
| Feature | Description |
|---|---|
| Fine-Tuning | Further train select models (Titan, Llama, Cohere) on your data stored in S3 |
| Continued Pre-Training | Pre-train select Titan models on unlabeled domain-specific data |
| Provisioned Throughput | Reserve dedicated model capacity for consistent latency; billed hourly |
| Prompt Caching | Cache static prompt prefixes to reduce cost and latency for repeated prefixes |
10.2 Bedrock Security and Pricing
| Dimension | Detail |
|---|---|
| Data privacy | Customer prompts and responses are not used to train AWS base models |
| Encryption | TLS for data in transit; SSE-KMS for fine-tuning data at rest |
| Access control | IAM policies at the model ARN level |
| Network | VPC interface endpoints via AWS PrivateLink available |
| Audit | All API calls logged in AWS CloudTrail |
| On-demand pricing | Billed per input token + output token; output tokens typically cost more |
| Provisioned pricing | Fixed hourly rate; 1-month or 6-month commitment |
11. Amazon Q
Amazon Q is AWS's family of generative AI–powered assistants tailored to specific contexts.
| Product | Target User | Key Capabilities |
|---|---|---|
| Amazon Q Business | Enterprise employees | Connect to 40+ data sources; respects existing access controls; custom plugins |
| Amazon Q Developer | Developers | Inline code suggestions, generation, security scanning, code transformation, CLI completions |
| Amazon Q in QuickSight | Business analysts | Natural language queries for BI dashboards: "Show top 10 products by revenue" |
| Amazon Q in AWS Console | AWS users | In-console assistant for service questions, troubleshooting, and environment queries |
Exam Tip: Amazon Q Developer replaced Amazon CodeWhisperer. It adds security scanning, code transformation (e.g., upgrade Java 8 → 17), and CLI completions beyond simple code suggestions.
12. Hallucination and Output Quality
Hallucination occurs when an LLM confidently generates plausible but factually incorrect information.
Hallucination Types
| Type | Description |
|---|---|
| Factual Hallucination | Incorrect facts stated with confidence |
| Source Hallucination | Cites non-existent papers, links, or documents |
| Logical Hallucination | Internally inconsistent reasoning |
Mitigation Strategies
| Strategy | How It Helps |
|---|---|
| RAG | Grounds responses in retrieved, verifiable facts |
| Grounding Check (Bedrock Guardrails) | Detects and blocks responses not supported by retrieved context |
| Lower Temperature | Reduces randomness; model stays closer to likely facts |
| Require Citations | Prompt the model to cite sources for all factual claims |
| Human Review | Mandatory for high-stakes or irreversible decisions |
LLM Evaluation Metrics
| Metric | Description | Used For |
|---|---|---|
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation | Summarization quality |
| BLEU | Bilingual Evaluation Understudy | Translation quality |
| Perplexity | Measure of model uncertainty over a text sequence | Language modeling quality |
| LLM-as-Judge | Use a powerful LLM to score outputs from another model | Scalable automated evaluation |
Exam Tips & Quick Reference
Scenario-to-Answer Mapping
| Scenario Keyword / Requirement | Correct Answer |
|---|---|
| "Access multiple foundation models via a single API" | Amazon Bedrock |
| "Ground LLM responses in proprietary company documents" | Bedrock Knowledge Bases (RAG) |
| "LLM must call external APIs and complete multi-step tasks" | Bedrock Agents |
| "Filter harmful content from model inputs and outputs" | Bedrock Guardrails |
| "Customize model behavior without changing model weights" | Prompt engineering |
| "Customize model for specific domain vocabulary and style" | Fine-tuning via Bedrock |
| "Responses are confidently wrong" | Hallucination; mitigate with RAG and grounding check |
| "AI coding assistant that also scans for security vulnerabilities" | Amazon Q Developer |
| "Enterprise chatbot connected to SharePoint and Confluence" | Amazon Q Business |
| "Consistent, deterministic responses for factual Q&A" | Temperature = 0 |
| "More creative, varied responses for brainstorming" | Higher temperature (0.7–1.0) |
| "Reduce cost on repeated prompts with long system prompts" | Bedrock prompt caching |
| "Reserve model capacity for consistent latency SLA" | Bedrock Provisioned Throughput |
Common Traps
- RAG vs. Fine-Tuning: RAG is for grounding in current or proprietary data and requires no training. Fine-tuning is for teaching a specific style, tone, or vocabulary and requires training data and compute. The exam frequently conflates them.
- Temperature = 0 is not random: Temperature = 0 makes generation deterministic — the model always picks the highest-probability token. It does not mean the model refuses to answer.
- Tokens ≠ Words: Tokens are subword units. A single English word may be one or more tokens. API costs are billed per token, not per word.
- Amazon Q Developer ≠ Amazon Q Business: Q Developer is for writing code; Q Business is for enterprise knowledge management. They are different products with different pricing.
- Bedrock does not train on your data: Customer prompts and responses submitted to Bedrock are never used to train or improve AWS base models — a fundamental privacy guarantee.
Key Terms — Domain 2
| Term | One-Line Definition |
|---|---|
| Foundation Model | A large model pre-trained on broad data and adaptable to many tasks |
| LLM | A language-specific foundation model trained on massive text corpora |
| Token | The basic unit of text a model processes (~4 characters in English) |
| Context Window | The maximum number of tokens a model can process in one request |
| Hallucination | A model generating confident but factually incorrect information |
| Embedding | A dense numerical vector encoding the semantic meaning of text |
| Vector Database | A database optimized for similarity search over embedding vectors |
| RAG | Grounding LLM responses by retrieving relevant context from a knowledge base |
| Temperature | A parameter controlling the randomness of token sampling |
| RLHF | Training technique that uses human preference rankings to align a model |
| Fine-Tuning | Continuing to train a foundation model on domain-specific data |
| Guardrails | Configurable filters applied to model inputs and outputs for safety |
| Agent | An LLM-powered system that plans and executes multi-step tasks using tools |
| Zero-Shot | Prompting a model to perform a task with no examples provided |
| Few-Shot | Providing a small number of labeled examples in the prompt |
| Chain-of-Thought | Prompting the model to reason step by step before answering |
End of Domain 2. Continue to Domain 3: Applications of Foundation Models →
Ready to test yourself?
Practice questions for this topic