A big part of my work at Adobe is to work with customer and ensure that their Adobe LLM Optimizer is of value to them. This often involves me and my team auditing the prompts that are within their account. From hundreds of customer meetings I’ve had, I’d say that 90% of them don’t quite understand the change from SEO queries to LLM prompts.
Therefore, I’ve been investing my time to try and anwser the question:
How do you write prompts that give you reliable, repeatable insights?
When I speak to marketing teams about GEO, one question always comes up:
“If you are measuring visibility inside large language models, your prompts are not casual questions. They are instruments. They shape the data you collect. If they are vague or inconsistent, your results will drift. If they are precise and structured, your results become stable and meaningful.” – Flavio Longato
In my experience working with GEO, the key is simple: treat prompts like test cases.
Think of Prompts Like Test Cases
Most marketers still treat prompts like search queries. That’s usually where GEO measurement starts to break down.
A prompt is closer to a test case than a keyword. A good test case is realistic, specific, and repeatable. If the input changes too much, you don’t know whether the output changed because of the system or because of the test itself.
Recent research highlights why this matters more than many teams expect. A large study from SparkToro ran identical prompts across multiple AI systems and compared the brand recommendations returned each time. Even when nothing changed in the input, the results were highly inconsistent. Brand lists shifted, ordering changed, and sometimes completely different companies appeared. In many scenarios, there was less than a 1% chance of receiving the same set of brands twice.
This doesn’t mean AI visibility is unreliable. It means the input needs stronger structure.
When a prompt is too broad, the model has many valid directions it can explore. One run might emphasise pricing, another might focus on features, and a third might lean on brand familiarity. From a GEO perspective, that variability looks like ranking movement, but in reality it’s just different reasoning paths.
That’s why I recommend building prompts with three consistent elements:
- A clear goal – what the user is trying to achieve
- A constraint – experience level, budget, region, or use case
- Context – comparison framing or requirements
For example:
Instead of:
“Best PDF software”
Use:
“I need a PDF tool for a beginner that lets me convert to Word and edit files. I’m comparing two options and want something simple.”
The second version behaves like a controlled experiment. It narrows interpretation and reduces randomness across runs.
The SparkToro findings reinforce this approach. Their data suggests that tracking visibility across repeated, structured prompts is far more reliable than evaluating a single response or focusing only on position. Brands that appear consistently across many executions are more likely to be part of the model’s core consideration set.
Consistency doesn’t come from the model. It comes from the way you design the prompt.
Why treat it as a software test:
In software testing, a good test case is:
- realistic
- specific
- repeatable
- The same logic applies to GEO.
A realistic test case mirrors real user behaviour. It reflects how someone would genuinely ask for help. A specific test case defines intent clearly. A repeatable test case produces consistent outputs when run under the same conditions.
If your prompt is too broad, the AI assistant has too many valid directions it can take. Each direction may be reasonable, but your visibility measurement becomes unstable. One day your brand appears. The next day it does not. Nothing meaningful has changed except the interpretation space.
That is not a visibility shift. It is measurement noise.
Why Broad Prompts Create Random Visibility
When a prompt lacks structure, the model fills in the gaps. It guesses intent. It assumes context. It selects one of many possible frames.
For example:
- “What’s the best CRM?”
- “How should I improve my marketing?”
- “Which tool is better?”
Each of these prompts is valid. Each has multiple reasonable answers. But from a GEO perspective, they are weak test cases. The output can vary based on subtle sampling changes, updates in model training, or shifting internal weighting.
Your visibility score becomes volatile because the prompt itself is unstable.
As discussed in industry research around generative search and model behaviour, consistency of input is essential for consistency of output. This principle also appears in discussions about large language model evaluation in resources such as OpenAI research publications.
The Structure of a Reliable GEO Prompt
In practice, reliable prompts usually include three elements:
1. A Clear Goal
The assistant needs to know what it is helping with.
Examples:
- “Help me choose”
- “Recommend the best option”
- “Compare these two tools”
- “Rank these solutions”
Without a clear goal, the model may default to explanation rather than decision support.
2. A Constraint
Constraints narrow the solution space. They reduce ambiguity.
Examples:
- “For a beginner”
- “For a small B2B marketing team”
- “With a limited budget”
- “For an e-commerce company in Switzerland”
Constraints anchor the response to a defined persona or situation. This increases repeatability because the model does not need to infer who the user is.
3. Context
Context defines the frame of comparison.
Examples:
- “I am comparing HubSpot and Pipedrive.”
- “I need email automation and CRM integration.”
- “We have five employees and no technical team.”
When context is explicit, the assistant does not need to guess requirements. Fewer assumptions lead to more stable outputs.
In short, a strong GEO prompt looks like this:
“Help me choose between Tool A and Tool B for a beginner marketing manager at a small B2B company. We need CRM integration and simple reporting.”
That is a test case. It is realistic. It has intent. It has constraints. It has context. It can be run again and compared over time.
Consistency Is More Important Than Creativity
In SEO, creativity can help content stand out. In GEO measurement, creativity can damage reliability.
If you rewrite your prompts every week, you are not tracking model visibility. You are testing new scenarios.
I recommend using a consistent template. For example:
- Goal: Job to Be Done (JTBD) of the page
- Persona: Who is the target audience of the brand, site and landing page
- Constraints: What friction point is this page trying to resolve
- Comparison set: What do other competitors do on similar pages
By keeping the structure stable, you isolate changes. If results shift, you can more confidently attribute that shift to model behaviour rather than prompt variation.
This is especially important when measuring brand inclusion or ranking within generative responses, a topic increasingly discussed in the context of generative engine optimisation.
Version Your Prompts
Over time, your understanding of GEO will improve. Your prompts will evolve. That is normal. But evolution must be controlled.
I always recommend versioning prompts. Keep a simple log:
- Prompt v1.0 – Initial baseline
- Prompt v1.1 – Added constraint
- Prompt v1.2 – Refined persona
- Prompt v2.0 – New comparison set
When visibility changes, you can check whether:
- The model changed
- Your configuration changed
- The prompt changed
Without versioning, you lose traceability.
This approach mirrors good experimental practice. In evaluation frameworks such as those discussed by Google AI research, reproducibility is central. GEO should follow the same discipline.
Avoid Frequent Structural Changes
There is another practical issue: historical comparability.
If you continuously add and delete topics, entities, or comparison options in your GEO tracking setup, your visibility baseline shifts. You may see score drops that are not performance issues, but structural changes.
For example:
- Adding new competitors changes ranking distribution.
- Removing requirements alters response framing.
- Switching persona definitions shifts relevance weighting.
When you make large structural edits, treat them as a new measurement phase. Do not compare them blindly to old data.
Stable input produces stable trend lines.
Build a Prompt Library
In my work, I build a prompt library rather than a loose collection of questions. Each prompt:
- Has a defined intent
- Targets a clear user scenario
- Uses consistent structure
- Is version controlled
- Is tied to a measurement objective
This transforms GEO from experimentation into systematic analysis.
Over time, patterns emerge:
- Which prompts consistently surface your brand?
- Which personas trigger competitor mentions?
- Where does the model hesitate or diversify?
Those patterns only appear when your inputs are disciplined.
From Keywords to Intent-Based Test Cases
In traditional SEO, we optimised for keywords. In GEO, we optimise for intent expressions.
A keyword like “best CRM” is not enough. A structured prompt that simulates a real buying decision is far more powerful.
This shift aligns with broader industry commentary on search evolution, including perspectives shared on platforms such as Search Engine Land.
GEO is not about ranking for fragments. It is about appearing in structured decision contexts.
Ground everything
Grounding is the process of ensuring LLM responses are linked to real-world data / information. Often, if you simply prompt a task with LLMs it will hallucinate. To ensure this does not happen I ground the data on:
- The website and content itself
- Branding material for the website
- SEO Data (query data, page metrics, backlink information and competitor data)
Final Thoughts
Reliable GEO insights do not come from clever phrasing. They come from disciplined design.
I treat every prompt as a test case:
- Realistic
- Specific
- Repeatable
I include a clear goal, defined constraints, and explicit context. I keep templates consistent. I version changes. I avoid unnecessary structural edits.
When you approach prompts this way, your visibility data becomes meaningful. Trends become interpretable. Optimisation becomes strategic rather than reactive.
In GEO, measurement quality starts with prompt quality. If you control the input, you can trust the insight.