Prompt engineering is the process of writing clear, structured instructions (called prompts) that tell a generative AI model what to do. Instead of retraining the model, we guide its behavior using words, examples, and formatting.
Good prompt engineering helps you get:
More accurate answers
More useful formats (tables, bullet points, summaries)
Safer and more reliable results
There are several styles of prompting, each with different strengths. Let’s look at the most common types.
You give the model a task without showing any examples.
Example:
"Translate this sentence into French: ‘How are you?’"
This is simple and fast, but may not be reliable for complex or ambiguous tasks.
You show the model a few examples of input-output pairs to help it “understand the pattern.”
Example:
"Translate the following:
English: ‘Hello’ → French: ‘Bonjour’
English: ‘Goodbye’ → French: ‘Au revoir’
English: ‘Thank you’ → French:"
This method works better when:
The task needs context
The expected format is non-obvious
You ask the model to “think step-by-step” before answering.
Example:
"John has 3 apples. He buys 2 more. How many does he have now? Let’s think step-by-step."
This helps the model reason more clearly, especially for tasks involving logic, math, or comparisons.
Here are some tips to make your prompts more effective and consistent:
Tell the model exactly what you want.
Bad: “Help me out.”
Good: “Summarize this article into three bullet points.”
Define who the AI is acting as.
Example:
"You are a helpful customer support agent."
This helps the model choose the right tone and vocabulary.
Tell the model what kind of output you want.
Examples:
JSON format
A table with headers
Bullet-point list
Markdown summary
Don’t say: “Give me some suggestions.”
Do say: “List three recommendations, and explain why each one is helpful.”
Clear instructions reduce mistakes and make responses easier to read or automate.
These are settings you can adjust to change how the model behaves. You don’t need to change the prompt — just tweak these values to make the output more creative, precise, or shorter.
What it controls: How random or creative the output is.
Low value (e.g., 0.2) → more predictable, fact-based, consistent.
High value (e.g., 0.8 or 1.0) → more varied, imaginative, and exploratory.
Use case examples:
Use low temperature for coding, legal advice, or factual Q&A.
Use high temperature for story writing, brainstorming, marketing.
What it does: Limits word choices to the top “k” most likely options at each step.
Lower k → less variety, more focused answers.
Higher k → more surprising, possibly less accurate answers.
Example:
If k = 5, the model picks words only from the top 5 most likely choices.
What it does: Selects words from the smallest set where the total probability is ≥ p.
Comparison to Top-k:
What it does: Sets a maximum limit on how long the output can be.
Useful when:
You need short summaries
You want to control response size for cost or speed
You’re building apps with display limits (e.g., chat windows)
Note: One token is roughly ¾ of a word in English.
When using generative models for conversations, long documents, or ongoing workflows, managing context becomes essential.
Most language models have a limit on how many tokens they can process at once:
Older models: ~8,000 tokens
Newer models (like Gemini 1.5): up to 1 million tokens
Strategies for managing context:
Trim history: Remove parts of previous messages that are no longer needed.
Summarize: Replace long dialogue with a short summary to save space.
Chunk input: Break large documents into sections and handle one at a time.
This helps you avoid cutoff errors and hallucinations caused by lost context.
In a chatbot or ongoing conversation, you want the model to “remember” previous user questions or tasks.
Techniques:
Use system prompts like:
“You are talking to a user who asked about travel insurance. Continue helping them.”
Store key variables from earlier turns (e.g., name, location, preferences).
Use message history and structured memory to manage sessions.
This enables the AI to hold more natural, helpful, and consistent conversations.
RAG is a technique that enhances generative AI with external knowledge. Instead of relying only on what the model was trained on, RAG lets it retrieve documents in real time and use them to generate more accurate and up-to-date responses.
User input is vectorized
The question is converted into a numeric vector (a mathematical representation).
Search in a vector database
The system finds documents or passages that are most similar to the input. Popular tools for this include:
FAISS
Pinecone
Weaviate
Inject retrieved content into the prompt
The retrieved documents are added to the model’s input as context.
Model generates a response
The AI uses both the user input and the retrieved information to craft an answer.
Up-to-date knowledge: No need to retrain the model on new content.
Reduces hallucination: Answers are based on actual documents, not guesses.
Efficient: Keeps the base model smaller by storing large amounts of knowledge outside.
Common Use Cases:
Legal or financial Q&A using internal documents
Customer support based on product manuals
Research assistants that cite specific sources
Even with good prompts, outputs need to be tested and improved over time. This involves reviewing how well the model performs and making adjustments.
Human-in-the-loop review: People score outputs for accuracy, clarity, helpfulness, tone, and safety.
Automated metrics:
BLEU: Compares generated text to a reference (for translation).
ROUGE: Measures overlap with a reference summary (for summarization).
F1 score: Used for classification accuracy.
A/B testing: Try two versions of a prompt and compare which one works better with real users.
Change one thing at a time: If results are bad, tweak just one part of the prompt so you can see the impact.
Be consistent: Use the same style or structure in your examples for few-shot prompting.
Add step-by-step instructions: For complex reasoning tasks, ask the model to break things down.
Sometimes, even advanced prompt engineering isn’t enough. Google Cloud supports lightweight ways to adapt models more deeply.
You can save well-crafted prompts as reusable templates.
This lets your team use a consistent approach across projects.
Adapter tuning adds small, trainable components to the model that adjust its behavior.
Benefits:
Requires less data than full fine-tuning
Faster and cheaper to train
Can be used for specific domains (e.g., legal, medical)
This means training the entire model again on your own dataset.
It's powerful, but:
Needs a lot of high-quality data
Is more expensive and time-consuming
May require infrastructure for training and evaluation
Use only when necessary, such as for highly specialized language or brand tone control.
| Technique | Purpose |
|---|---|
| Prompt Engineering | Direct model using structured instructions |
| Temperature / Top-p | Adjust creativity vs. consistency |
| RAG | Add up-to-date external knowledge |
| Multi-turn Prompting | Maintain memory and conversation flow |
| Evaluation | Test and improve prompt quality |
| Fine-tuning | Customize long-term model behavior for specific needs |
Top-k and Top-p (nucleus) sampling are both decoding strategies used to control randomness and diversity in generative AI outputs. They limit the pool of next-token candidates, but in different ways.
Top-k Sampling:
Restricts choices to the k most likely tokens at each step.
Example: If k = 10, only the top 10 probable tokens are considered.
Best for: Structured tasks, where predictability is important.
Risk: Fixed k may exclude important tokens if probability distribution is flat.
Top-p Sampling:
Dynamically selects the smallest set of tokens whose cumulative probability is at least p (e.g., p = 0.9).
The actual number of tokens considered may vary.
Best for: Creative or open-ended tasks, where balance between coherence and variety is needed.
Risk: May occasionally pick less relevant tokens if p is too high.
Comparison Summary:
| Aspect | Top-k | Top-p |
|---|---|---|
| Fixed size | Yes | No |
| Based on | Number of tokens | Cumulative probability |
| Predictability | Higher | Adaptive |
| Use case | QA, code generation | Storytelling, dialogue |
| Flexibility | Low | High |
In practice, top-p is often preferred due to its adaptability.
Model output behavior can be fine-tuned more precisely by combining these parameters.
Example settings and their effects:
Temperature = 0.2, Top-p = 0.9
Output is focused, deterministic, and safe.
Use case: Legal content, code explanation, compliance-sensitive outputs.
Temperature = 0.7, Top-k = 40
Adds moderate randomness and variety, while avoiding extreme token choices.
Use case: Product description generation, creative marketing copy.
Temperature = 1.0, Top-p = 0.95
High creativity and linguistic exploration.
Use case: Story writing, brainstorming sessions.
Best practices:
For factual or mission-critical tasks: lower temperature and stricter sampling.
For creative tasks: higher temperature with adaptive sampling (top-p preferred).
Avoid setting both top-k and top-p unless you have a clear use case, as it may overconstrain output.
To ensure safe, compliant, and ethical use of generative AI, especially in customer-facing applications, guardrails are necessary.
Types of safety mechanisms:
Prompt filters: Block or sanitize input prompts that contain offensive, prohibited, or harmful terms.
Output moderation: Screen responses for hate speech, misinformation, sexual content, or sensitive topics using:
Regular expressions
Toxicity classifiers
Third-party content moderation APIs
Blocklists and allowlists: Restrict or allow specific tokens, phrases, or patterns.
Content labeling: Tag AI-generated output with disclaimers or metadata for transparency.
Audit trails: Log input-output pairs for accountability and post-hoc review.
Example in production:
A chatbot in healthcare may include:
Prompt blocklist for terms like “diagnose” or “prescribe”
Output moderation that filters unsafe suggestions
User warning when discussing health-related topics
These safeguards reduce legal risk, brand harm, and user distrust.
Reinforcement Learning from Human Feedback (RLHF) is a training and evaluation method where human judgments guide model refinement.
How it works:
Humans label output quality on metrics such as helpfulness, tone, accuracy.
The model learns to prefer outputs that align with human preferences.
A reward model is built from this feedback.
Reinforcement learning (e.g., Proximal Policy Optimization) updates model parameters accordingly.
Why it matters:
Enhances alignment between model behavior and human expectations.
Reduces undesirable outputs like hallucinations or toxicity.
Improves performance on complex, nuanced tasks.
Note: While RLHF is mainly used at model training time, similar human feedback loops can be used post-deployment to evaluate and improve prompts, workflows, and agent design.
Prompt iteration is the process of refining prompts based on observed output quality. Below is a simple illustration using a math word problem.
Prompt Version A (basic):
“How many apples does John have if he buys 2 more and already has 3?”
Model Output:
“5 apples.”
Prompt Version B (chain-of-thought):
“John has 3 apples. He buys 2 more. Let’s think step-by-step: How many apples does he have now?”
Model Output:
“Step 1: John starts with 3 apples.
Step 2: He buys 2 more, so 3 + 2 = 5.
Answer: 5 apples.”
Comparison:
| Prompt Version | Strength | Use case |
|---|---|---|
| A | Concise, but brittle | Simple lookups |
| B | More reliable and explainable | Reasoning tasks |
Lesson: By iterating and observing response quality, users can select or build better prompts tailored to task complexity and model behavior.
A development team notices that responses from a generative AI model are inconsistent and sometimes vague. Which technique should they apply first to improve response quality?
Improve the prompt structure using prompt engineering.
Prompt engineering is the practice of designing inputs that guide the model toward producing accurate and relevant outputs. Clear instructions, context, examples, and formatting requirements help the model better understand the task. For example, specifying the role of the model, providing step-by-step instructions, or including examples can significantly improve response quality. Poor prompts often lead to ambiguous outputs because the model must infer the intended task. By refining prompts, teams can achieve better results without retraining or modifying the model itself. This makes prompt engineering one of the most efficient techniques for improving generative AI outputs.
Demand Score: 85
Exam Relevance Score: 87
Which prompt engineering technique improves model performance by providing example inputs and outputs within the prompt?
Few-shot prompting.
Few-shot prompting involves including several examples of the desired task within the prompt so the model can learn the expected format or behavior. Instead of relying only on instructions, the model observes patterns in the provided examples and replicates them when generating responses. This technique is especially useful when the task requires specific formatting or reasoning patterns. Compared with zero-shot prompting, few-shot prompting often produces more consistent and accurate results because the model receives clearer guidance about how outputs should look.
Demand Score: 82
Exam Relevance Score: 86
An organization wants to reduce hallucinations by ensuring the model uses up-to-date enterprise data when generating answers. Which architecture should they implement?
Retrieval-Augmented Generation (RAG).
Retrieval-Augmented Generation combines information retrieval with generative AI. Instead of relying only on the model’s training data, the system first retrieves relevant documents from a trusted knowledge source such as a database, document repository, or enterprise knowledge base. These documents are then included as context in the model prompt before generating the response. By grounding the model in verified information, RAG significantly reduces hallucinations and improves response accuracy. This approach is widely used in enterprise AI assistants and knowledge search systems.
Demand Score: 83
Exam Relevance Score: 90
Which technique adjusts a pre-trained foundation model using additional task-specific training data?
Fine-tuning.
Fine-tuning modifies an existing foundation model by training it further on a smaller, specialized dataset. This allows the model to adapt to specific tasks, industries, or organizational requirements. For example, a company may fine-tune a language model using domain-specific documents to improve performance in legal, healthcare, or technical contexts. Compared with prompt engineering, fine-tuning requires additional training resources but can provide deeper customization. Organizations often choose fine-tuning when prompt engineering alone cannot achieve the desired accuracy or behavior.
Demand Score: 78
Exam Relevance Score: 85