How to Build Better Prompts for Generative AI Models in Vertex AI

Mohtasham Sayeed Mohiuddin
16 min readDec 31, 2023

--

Welcome to this hands-on tutorial on using Vertex AI’s generative capabilities for natural language tasks like question answering and prompt engineering. In this blog post, we will walk through examples using Vertex AI Generative models to showcase some real-world use cases.

We will cover the end-to-end workflow — from setting up the environment, importing Vertex AI generative libraries, loading models like Text-Bison, designing prompts, querying models, and evaluating results. Through various code samples, you will get hands-on experience with capabilities like few-shot learning, open vs closed domain questions, adding custom knowledge for question answering, and more.

By the end of this tutorial, you will be able to leverage the power of generative AI for your natural language applications. So let’s get started!

Costs
This tutorial uses billable components of Google Cloud:

  • Vertex AI Generative AI Studio

Learn about Vertex AI pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

Launch Vertex AI Workbench notebook

To create and launch a Vertex AI Workbench notebook:

  1. In the Navigation Menu, click Vertex AI > Workbench.
  2. On the Workbench page, click Enable Notebooks API (if it isn’t enabled yet).
  3. Click on the User-Managed Notebooks tab then, click Create New.
  4. Name the notebook.
  5. Set Region and Zone.
  6. In the New instance menu, choose the latest version of TensorFlow Enterprise 2. x in Environment.
  7. Click Advanced Options to edit the instance properties.
  8. Click Machine type and then select e2-standard-2 for Machine type.
  9. Leave the remaining fields at their default and click Create.

After a few minutes, the Workbench page lists your instance, followed by Open JupyterLab.

  • Click Open JupyterLab to open JupyterLab in a new tab.

Clone the example repo within your Workbench instance

To clone the generative-ai repository in your JupyterLab instance:

  • In JupyterLab, click the Terminal icon to open a new terminal.
Select Terminal icon
  • At the command-line prompt, type the following command and press ENTER:
git clone --depth=1 https://github.com/GoogleCloudPlatform/generative-ai
  • To confirm that you have cloned the repository, in the left panel, double-click the generative-ai folder to see its contents.

It will take several minutes for the notebook to clone.

Use Case 1: Question Answering with Generative Models

Navigate to the example notebook

  1. Navigate to the generative-ai folder on the left-hand side of the notebook.
  2. Navigate to the /language/prompts folder.
  3. Click on the question_answering.ipynb file

Overview

Large language models can be used for various natural language processing tasks, including question-answering (Q&A). These models are trained on a vast amount of text data and can generate high-quality responses to a wide range of questions. One thing to note here is that most models have cutoff dates regarding their knowledge, and asking anything too recent might yield an incomplete, imaginative, or incorrect answer (i.e. a hallucination).

This notebook covers the essentials of prompts for answering questions using a generative model. In addition, it showcases the open domain (knowledge available on the public internet) and closed domain (more private knowledge - typically enterprise or personal knowledge).

Learn more about prompt design in the official documentation.

Objective

By the end of the notebook, you should be able to write prompts for the following:

  • Open domain questions:
    - Zero-shot prompting
    - Few-shot prompting
  • Closed domain questions:
    - Providing custom knowledge as context
    - Instruction-tune the outputs
    - Few-shot prompting

Getting Started

Import libraries

# import vertexai

generation_model = TextGenerationModel.from_pretrained("text-bison@001")

Question Answering

Question-answering capabilities require providing a prompt or a question that the model can use to generate a response. The prompt can be a few words or a few complete sentences, depending on the complexity of the question.

When creating a question-answering prompt, it is essential to be specific and provide as much context as possible. It helps the model understand the intent behind the question and generate a relevant response. For example, if you want to ask:

"What is the capital of France?",

then a good prompt could be:

"Please tell me the name of the city that serves as the capital of France."

In addition to being specific, the prompt should also be grammatically correct and free of spelling errors. It helps the model generate a response that is easy to understand and contains fewer errors or inaccuracies.

By providing specific, context-rich prompts, you can help the model understand the intent behind the question and generate accurate and relevant responses.

Below are some differences between the open-domain and closed-domain categories for question-answering prompts.

  • Open domain: All questions whose answers are available online already. They can belong to any category, like history, geography, countries, politics, chemistry, etc. These include trivia or general knowledge questions, like:
Q: Who won the Olympic gold in swimming?
Q: Who is the President of [given country]?
Q: Who wrote [specific book]"?

Keep in mind the training cutoff of generative models, as questions involving information more recent than what the model was trained on might give incorrect or imaginative answers.

  • Closed domain: If you have some internal knowledge base not available on the public internet, then those belong to the closed domain category. You can pass that “private” knowledge as context to the model. If prompted correctly, the model is more likely to answer from within the context provided and less likely to give answers beyond that from the open internet.

Consider the example of building a Q&A bot over your internal product documentation. In this case, you can pass the complete documentation to the model and prompt it only to answer based on that.

Typical prompt for closed domain:

Prompt: f""" Answer from the below context: \n\n
context: {your knowledge base} \n
question: {question specific to that knowledge base} \n
answer: {to be predicted by model} \n
"""

Below are some examples to understand these different types of prompts.

Open Domain

Zero-shot prompting

prompt = """Q: Who was President of the United States in 1955? Which party did he belong to?\n
A:
"""
print(
generation_model.predict(
prompt,
max_output_tokens=256,
temperature=0.1,
).text
)
prompt = """Q: What is the tallest mountain in the world?\n
A:
"""
print(
generation_model.predict(
prompt,
max_output_tokens=20,
temperature=0.1,
).text
)

Few-shot prompting

Let’s say you want a short answer from the model (like only a specific name). To do so, you can leverage a few-shot prompt and provide examples to the model to illustrate the expected behavior.

prompt = """Q: Who is the current President of France?\n
A: Emmanuel Macron \n\n

Q: Who invented the telephone? \n
A: Alexander Graham Bell \n\n

Q: Who wrote the novel "1984"?
A: George Orwell

Q: Who discovered penicillin?
A:
"""
print(
generation_model.predict(
prompt,
max_output_tokens=20,
temperature=0.1,
).text
)

Zero-shot prompting vs Few-shot prompting

Zero-shot prompting can be useful for quickly generating text for new tasks, but the quality of the generated text may be lower than that of a few-shot prompt with well-chosen examples. Few-shot prompting is typically better suited for tasks that require a high degree of specificity or domain-specific knowledge but require some additional thought and potential data to set up the prompt.

Closed Domain

Adding internal knowledge as the context prompts

Imagine a scenario where you would like to build a question-answering bot that takes in internal documentation and lets users ask questions about it.

In the example below, the Google Cloud Storage and content policy documentation is added to the prompt, so that the PaLM API can use that to answer subsequent questions with the provided context.

context = """
Storage and content policy \n
How durable is my data in Cloud Storage? \n
Cloud Storage is designed for 99.999999999% (11 9's) annual durability, which is appropriate for even primary storage and
business-critical applications. This high durability level is achieved through erasure coding that stores data pieces redundantly
across multiple devices located in multiple availability zones.
Objects written to Cloud Storage must be redundantly stored in at least two different availability zones before the
write is acknowledged as successful. Checksums are stored and regularly revalidated to proactively verify that the data
integrity of all data at rest as well as to detect corruption of data in transit. If required, corrections are automatically
made using redundant data. Customers can optionally enable object versioning to add protection against accidental deletion.
"""

question = "How is high availability achieved?"

prompt = f"""Answer the question given in the contex below:
Context: {context}?\n
Question: {question} \n
Answer:
"""

print("[Prompt]")
print(prompt)

print("[Response]")
print(
generation_model.predict(
prompt,
).text
)

Instruction-tuning outputs

Another way to help out language models is to provide additional instructions to frame the output in the prompt. To ensure the model doesn’t respond to anything outside the context, the prompt can specify that the response should be “Information not available in provided context” if that’s the case.

question = "What machined are required for hosting Vertex AI models?"
prompt = f"""Answer the question given the context below as {{Context:}}. \n
If the answer is not available in the {{Context:}} and you are not confident about the output,
please say "Information not available in provided context". \n\n
Context: {context}?\n
Question: {question} \n
Answer:
"""

print("[Prompt]")
print(prompt)

print("[Response]")
print(
generation_model.predict(
prompt,
max_output_tokens=256,
temperature=0.3,
).text
)

Few-shot prompting

prompt = """
Context:
The term "artificial intelligence" was first coined by John McCarthy in 1956. Since then, AI has developed into a vast
field with numerous applications, ranging from self-driving cars to virtual assistants like Siri and Alexa.

Question:
What is artificial intelligence?

Answer:
Artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think and learn like humans.

---

Context:
The Wright brothers, Orville and Wilbur, were two American aviation pioneers who are credited with inventing and
building the world's first successful airplane and making the first controlled, powered and sustained heavier-than-air human flight,
on December 17, 1903.

Question:
Who were the Wright brothers?

Answer:
The Wright brothers were American aviation pioneers who invented and built the world's first successful airplane
and made the first controlled, powered and sustained heavier-than-air human flight, on December 17, 1903.

---

Context:
The Mona Lisa is a 16th-century portrait painted by Leonardo da Vinci during the Italian Renaissance. It is one of
the most famous paintings in the world, known for the enigmatic smile of the woman depicted in the painting.

Question:
Who painted the Mona Lisa?

Answer:

"""
print(
generation_model.predict(
prompt,
).text
)

Extractive Question-Answering

In the next example, the generative model is guided to understand the meaning of the question and the passage and to identify the relevant information in the passage that answers the question. The model is given a question and a passage of text and is asked to find the answer to the question within the passage. The answer is typically a phrase or sentence.

prompt = """
Background: There is evidence that there have been significant changes in Amazon rainforest vegetation over the last 21,000 years through the Last Glacial Maximum (LGM) and subsequent deglaciation.
Analyses of sediment deposits from Amazon basin paleo lakes and from the Amazon Fan indicate that rainfall in the basin during the LGM was lower than for the present, and this was almost certainly
associated with reduced moist tropical vegetation cover in the basin. There is debate, however, over how extensive this reduction was. Some scientists argue that the rainforest was reduced to small,
isolated refugia separated by open forest and grassland; other scientists argue that the rainforest remained largely intact but extended less far to the north, south, and east than is seen today.
This debate has proved difficult to resolve because the practical limitations of working in the rainforest mean that data sampling is biased away from the center of the Amazon basin, and both
explanations are reasonably well supported by the available data.

Q: What does LGM stands for?
A: Last Glacial Maximum.

Q: What did the analysis from the sediment deposits indicate?
A: Rainfall in the basin during the LGM was lower than for the present.

Q: What are some of scientists arguments?
A: The rainforest was reduced to small, isolated refugia separated by open forest and grassland.

Q: There have been major changes in Amazon rainforest vegetation over the last how many years?
A: 21,000.

Q: What caused changes in the Amazon rainforest vegetation?
A: The Last Glacial Maximum (LGM) and subsequent deglaciation

Q: What has been analyzed to compare Amazon rainfall in the past and present?
A: Sediment deposits.

Q: What has the lower rainfall in the Amazon during the LGM been attributed to?
A:
"""

print(
generation_model.predict(
prompt,
).text
)

Evaluation

You can evaluate the outputs of the question and answering task if the ground truth answers of each question are available. In zero-shot prompting, you can only use open domain questions. However, with closed domain questions, you can add context and evaluate similarly. To showcase how that will work, start by creating a simple data frame with questions and ground truth answers.

qa_data = {
"question": [
"In a website browser address bar, what does “www” stand for?",
"Who was the first woman to win a Nobel Prize",
"What is the name of the Earth’s largest ocean?",
],
"answer_groundtruth": ["World Wide Web", "Marie Curie", "The Pacific Ocean"],
}
qa_data_df = pd.DataFrame(qa_data)
qa_data_df

Now that you have the data with questions and ground truth answers, you can call the PaLM 2 generation model to each review row using the apply function. Each row will use the dynamic prompt to predict the answer using the PaLM API. We will save the results in answer_prediction the column.

def get_answer(row):
prompt = f"""Answer the following question as precise as possible.\n\n
question: {row}
answer:
"""
return generation_model.predict(
prompt=prompt,
).text


qa_data_df["answer_prediction"] = qa_data_df["question"].apply(get_answer)
qa_data_df

Consider evaluating the answers predicted by the PaLM API. However, it will be more complex than the text classification since the answers may differ from ground truth and may be presented in slightly more/fewer words.

For example, you can observe the question “What is the name of the Earth’s largest ocean?” and see that the model predicted “Pacific Ocean” when a ground truth label is “The Pacific Ocean” with the extra “The.” Now, if you use the simple classification metrics, then you will consider this as a wrong prediction since original and predicted strings have a difference. However, you can see that the answer is correct since an extra “The” is causing the issue. It’s a simple string comparison problem.

The solution to string comparison where both ground_thruth and predicted may have some extra or fewer letters, one approach is to use a fuzzy matching algorithm. Fuzzy string matching uses Levenshtein Distance to calculate the differences between two strings.

For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits:

  • kitten → sitten (substitution of “s” for “k”),
  • sitten → sittin (substitution of “i” for “e”),
  • sittin → sitting (insertion of “g” at the end).

Here’s another example, but this time using fuzzywuzzy library, which gives us the same Levenshtein distance between two strings but in ratio. The ratio raw score measures the string's similarity as an int in the range [0, 100]. For two strings X and Y, the score is defined by int(round((2.0 * M / T) * 100)) where T is the total number of characters in both strings, and M is the number of matches in the two strings.

Read more here about the ratio formula :

You can see one example to understand this further.

String1: "this is a test"
String2: "this is a test!"
Fuzz Ratio => 97  #Fuzz Partial Ratio => 100  #Since most characters are the same and in a similar sequence, the algorithm calculates the partial ratio as 100 and ignores simple additions (new characters).

First, install the package fuzzywuzzy and python-Levenshtein:

!pip install -q python-Levenshtein --upgrade --user
!pip install -q fuzzywuzzy --upgrade --user

Then compute a score to perform fuzzy matching:

from fuzzywuzzy import fuzz


def get_fuzzy_match(df):
return fuzz.partial_ratio(df["answer_groundtruth"], df["answer_prediction"])


qa_data_df["match_score"] = qa_data_df.apply(get_fuzzy_match, axis=1)
qa_data_df

Now that you have the individual match score (partial), you can take the mean or average of the whole column to get a sense of overall data. Scores closer to 100 mean PaLM 2 can predict closer to ground truth; if the score is towards 50 or 0, it did not perform well.

print(
"the average match score of all predicted answer from PaLM 2 is : ",
qa_data_df["match_score"].mean(),
" %",
)

In this case, you get 100% as the mean score, even though some predictions were missing some words. That means you are very close to the ground truth, and some answers are just missing the exact verboseness of the ground truth.

Use Case 2: Prompt Design — Best Practices

Navigate to the example notebook

  1. Navigate to the generative-ai folder on the left-hand side of the notebook.
  2. Navigate to the /language/prompts folder.
  3. Click on the intro_prompt_design.ipynb file

Overview

This notebook covers the essentials of prompt engineering, including some best practices.
Learn more about prompt design in the official documentation.

Objective

In this notebook, you learn best practices around prompt engineering — how to design prompts to improve the quality of your responses.

This notebook covers the following best practices for prompt engineering:

  • Be concise
  • Be specific and well-defined
  • Ask one task at a time
  • Turn generative tasks into classification tasks
  • Improve response quality by including examples

Import libraries

import vertexai
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}vertexai.init(project=PROJECT_ID, location="us-central1")from vertexai.language_models import TextGenerationModel

Load model

generation_model = TextGenerationModel.from_pretrained("text-bison@001")

Prompt engineering best practices

Prompt engineering is all about designing your prompts so that the response is what you were hoping to see.

The idea of using “unfancy” prompts is to minimize the noise in your prompt to reduce the possibility of the LLM misinterpreting the intent of the prompt. Below are a few guidelines on how to engineer “unfancy” prompts.

Be concise

🛑 Not recommended. The prompt below is unnecessarily verbose.

prompt = "What do you think could be a good name for a flower shop that specializes in selling bouquets of dried flowers more than fresh flowers? Thank you!"​
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

✅ Recommended. The prompt below is to the point and concise.

prompt = "Suggest a name for a flower shop that sells bouquets of dried flowers"
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Be specific, and well-defined

Suppose that you want to brainstorm creative ways to describe Earth.

🛑 Not recommended. The prompt below is too generic.

prompt = "Tell me about Earth"
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

✅ Recommended. The prompts below ask for one task at a time.

Ask one task at a time

🛑 Not recommended. The prompt below has two parts to the question that could be asked separately.

prompt = "What's the best method of boiling water and why is the sky blue?"
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

✅ Recommended. The prompts below ask for one task at a time.

prompt = "What's the best method of boiling water?"
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)
prompt = "Why is the sky blue?"
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)​

Watch out for hallucinations

Although LLMs have been trained on a large amount of data, they can generate text containing statements not grounded in truth or reality; these responses from the LLM are often referred to as “hallucinations” due to their limited memorization capabilities. Note that simply prompting the LLM to provide a citation isn’t a fix to this problem, as there are instances of LLMs providing false or inaccurate citations. Dealing with hallucinations is a fundamental challenge of LLMs and an ongoing research area, so it is important to be cognizant that LLMs may seem to give you confident, correct-sounding statements that are in fact incorrect.

Note that if you intend to use LLMs for the creative use cases, hallucinating could actually be quite useful.

Try the prompt like the one below repeatedly. You may notice that sometimes it will confidently, but inaccurately, say “The first elephant to visit the moon was Luna”.

prompt = "Who was the first elephant to visit the moon?"​
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Turn generative tasks into classification tasks to reduce output variability

Generative tasks lead to higher output variability

The prompt below results in an open-ended response, useful for brainstorming, but the response is highly variable.

prompt = "I'm a high school student. Recommend me a programming activity to improve my skills."
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Classification tasks reduce output variability

The prompt below results in a choice and may be useful if you want the output to be easier to control.

prompt = """I'm a high school student. Which of these activities do you suggest and why:
a) learn Python
b) learn Javascript
c) learn Fortran
"""
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Improve response quality by including examples

Another way to improve response quality is to add examples in your prompt. The LLM learns in context from the examples on how to respond. Typically, one to five examples (shots) are enough to improve the quality of responses. Including too many examples can cause the model to over-fit the data and reduce the quality of responses.

Similar to classical model training, the quality and distribution of the examples are very important. Pick examples that are representative of the scenarios that you need the model to learn, and keep the distribution of the examples (e.g. number of examples per class in the case of classification) aligned with your actual distribution.

Zero-shot prompt

Below is an example of zero-shot prompting, where you don’t provide any examples to the LLM within the prompt itself.

prompt = """Decide whether a Tweet's sentiment is positive, neutral, or negative.
Tweet: I loved the new YouTube video you made!
Sentiment:
"""
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

One-shot prompt

Below is an example of one-shot prompting, where you provide one example to the LLM within the prompt to give some guidance on what type of response you want.

prompt = """Decide whether a Tweet's sentiment is positive, neutral, or negative.
Tweet: I loved the new YouTube video you made!
Sentiment: positive
Tweet: That was awful. Super boring 😠
Sentiment:
"""
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Few-shot prompt

Below is an example of few-shot prompting, where you provide one example to the LLM within the prompt to give some guidance on what type of response you want.

prompt = """Decide whether a Tweet's sentiment is positive, neutral, or negative.
Tweet: I loved the new YouTube video you made!
Sentiment: positive
Tweet: That was awful. Super boring 😠
Sentiment: negative
Tweet: Something surprised me about this video - it was actually original. It was not the same old recycled stuff that I always see. Watch it - you will not regret it.
Sentiment:
"""
print(generation_model.predict(prompt=prompt, max_output_tokens=256).text)

Choosing between zero-shot, one-shot, few-shot prompting methods

Which prompt technique to use will solely depend on your goal. The zero-shot prompts are more open-ended and can give you creative answers, while one-shot and few-shot prompts teach the model how to behave so you can get more predictable answers that are consistent with the examples provided.

Conclusion:

And that concludes our hands-on tutorial on applying Vertex AI’s generative capabilities for question-answering and prompt engineering. We walked through various examples of open and closed-domain questions, prompt design best practices, evaluation techniques, and more.

Key takeaways include:

  • Prompt engineering is key for high-quality model responses
  • Models can answer open and closed-domain questions
  • Custom knowledge improves closed-domain QA
  • Instruction tuning and few-shot learning enhance outputs
  • Fuzzy string matching evaluates free-form QA

I hope you enjoyed this hands-on experience and gained practical skills to apply Vertex AI generative models for your NLP use cases. Let me know if you have any other questions!

Feel free to connect on Linkedin!!

--

--

Mohtasham Sayeed Mohiuddin
Mohtasham Sayeed Mohiuddin

Written by Mohtasham Sayeed Mohiuddin

Passionate content creator exploring cloud tech and sustainability. 🌱

No responses yet