Tag: LLM Tracking

How to Write GEO Prompts for Reliable LLM Insights
A big part of my work at Adobe is to work with customer and ensure that their Adobe LLM Optimizer is of value to them. This often involves me and my team auditing the prompts that are within their account. From hundreds of customer meetings I’ve had, I’d say that 90% of them don’t quite understand the change from SEO queries to LLM prompts.

Therefore, I’ve been investing my time to try and anwser the question:

How do you write prompts that give you reliable, repeatable insights?

When I speak to marketing teams about GEO, one question always comes up:

“If you are measuring visibility inside large language models, your prompts are not casual questions. They are instruments. They shape the data you collect. If they are vague or inconsistent, your results will drift. If they are precise and structured, your results become stable and meaningful.” – Flavio Longato

In my experience working with GEO, the key is simple: treat prompts like test cases.

Think of Prompts Like Test Cases

Most marketers still treat prompts like search queries. That’s usually where GEO measurement starts to break down.

A prompt is closer to a test case than a keyword. A good test case is realistic, specific, and repeatable. If the input changes too much, you don’t know whether the output changed because of the system or because of the test itself.

Recent research highlights why this matters more than many teams expect. A large study from SparkToro ran identical prompts across multiple AI systems and compared the brand recommendations returned each time. Even when nothing changed in the input, the results were highly inconsistent. Brand lists shifted, ordering changed, and sometimes completely different companies appeared. In many scenarios, there was less than a 1% chance of receiving the same set of brands twice.

This doesn’t mean AI visibility is unreliable. It means the input needs stronger structure.

When a prompt is too broad, the model has many valid directions it can explore. One run might emphasise pricing, another might focus on features, and a third might lean on brand familiarity. From a GEO perspective, that variability looks like ranking movement, but in reality it’s just different reasoning paths.

That’s why I recommend building prompts with three consistent elements:
- A clear goal – what the user is trying to achieve
- A constraint – experience level, budget, region, or use case
- Context – comparison framing or requirements
For example:

Instead of:
“Best PDF software”

Use:
“I need a PDF tool for a beginner that lets me convert to Word and edit files. I’m comparing two options and want something simple.”

The second version behaves like a controlled experiment. It narrows interpretation and reduces randomness across runs.

The SparkToro findings reinforce this approach. Their data suggests that tracking visibility across repeated, structured prompts is far more reliable than evaluating a single response or focusing only on position. Brands that appear consistently across many executions are more likely to be part of the model’s core consideration set.

Consistency doesn’t come from the model. It comes from the way you design the prompt.

Why treat it as a software test:

In software testing, a good test case is:
- realistic
- specific
- repeatable
- The same logic applies to GEO.
A realistic test case mirrors real user behaviour. It reflects how someone would genuinely ask for help. A specific test case defines intent clearly. A repeatable test case produces consistent outputs when run under the same conditions.

If your prompt is too broad, the AI assistant has too many valid directions it can take. Each direction may be reasonable, but your visibility measurement becomes unstable. One day your brand appears. The next day it does not. Nothing meaningful has changed except the interpretation space.

That is not a visibility shift. It is measurement noise.

Why Broad Prompts Create Random Visibility

When a prompt lacks structure, the model fills in the gaps. It guesses intent. It assumes context. It selects one of many possible frames.

For example:
- “What’s the best CRM?”
- “How should I improve my marketing?”
- “Which tool is better?”
Each of these prompts is valid. Each has multiple reasonable answers. But from a GEO perspective, they are weak test cases. The output can vary based on subtle sampling changes, updates in model training, or shifting internal weighting.

Your visibility score becomes volatile because the prompt itself is unstable.

As discussed in industry research around generative search and model behaviour, consistency of input is essential for consistency of output. This principle also appears in discussions about large language model evaluation in resources such as OpenAI research publications.

The Structure of a Reliable GEO Prompt

In practice, reliable prompts usually include three elements:

1. A Clear Goal

The assistant needs to know what it is helping with.

Examples:
- “Help me choose”
- “Recommend the best option”
- “Compare these two tools”
- “Rank these solutions”
Without a clear goal, the model may default to explanation rather than decision support.

2. A Constraint

Constraints narrow the solution space. They reduce ambiguity.

Examples:
- “For a beginner”
- “For a small B2B marketing team”
- “With a limited budget”
- “For an e-commerce company in Switzerland”
Constraints anchor the response to a defined persona or situation. This increases repeatability because the model does not need to infer who the user is.

3. Context

Context defines the frame of comparison.

Examples:
- “I am comparing HubSpot and Pipedrive.”
- “I need email automation and CRM integration.”
- “We have five employees and no technical team.”
When context is explicit, the assistant does not need to guess requirements. Fewer assumptions lead to more stable outputs.

In short, a strong GEO prompt looks like this:

“Help me choose between Tool A and Tool B for a beginner marketing manager at a small B2B company. We need CRM integration and simple reporting.”

That is a test case. It is realistic. It has intent. It has constraints. It has context. It can be run again and compared over time.

Consistency Is More Important Than Creativity

In SEO, creativity can help content stand out. In GEO measurement, creativity can damage reliability.

If you rewrite your prompts every week, you are not tracking model visibility. You are testing new scenarios.

I recommend using a consistent template. For example:
- Goal: Job to Be Done (JTBD) of the page
- Persona: Who is the target audience of the brand, site and landing page
- Constraints: What friction point is this page trying to resolve
- Comparison set: What do other competitors do on similar pages
By keeping the structure stable, you isolate changes. If results shift, you can more confidently attribute that shift to model behaviour rather than prompt variation.

This is especially important when measuring brand inclusion or ranking within generative responses, a topic increasingly discussed in the context of generative engine optimisation.

Version Your Prompts

Over time, your understanding of GEO will improve. Your prompts will evolve. That is normal. But evolution must be controlled.

I always recommend versioning prompts. Keep a simple log:
- Prompt v1.0 – Initial baseline
- Prompt v1.1 – Added constraint
- Prompt v1.2 – Refined persona
- Prompt v2.0 – New comparison set
When visibility changes, you can check whether:
- The model changed
- Your configuration changed
- The prompt changed
Without versioning, you lose traceability.

This approach mirrors good experimental practice. In evaluation frameworks such as those discussed by Google AI research, reproducibility is central. GEO should follow the same discipline.

Avoid Frequent Structural Changes

There is another practical issue: historical comparability.

If you continuously add and delete topics, entities, or comparison options in your GEO tracking setup, your visibility baseline shifts. You may see score drops that are not performance issues, but structural changes.

For example:
- Adding new competitors changes ranking distribution.
- Removing requirements alters response framing.
- Switching persona definitions shifts relevance weighting.
When you make large structural edits, treat them as a new measurement phase. Do not compare them blindly to old data.

Stable input produces stable trend lines.

Build a Prompt Library

In my work, I build a prompt library rather than a loose collection of questions. Each prompt:
- Has a defined intent
- Targets a clear user scenario
- Uses consistent structure
- Is version controlled
- Is tied to a measurement objective
This transforms GEO from experimentation into systematic analysis.

Over time, patterns emerge:
- Which prompts consistently surface your brand?
- Which personas trigger competitor mentions?
- Where does the model hesitate or diversify?
Those patterns only appear when your inputs are disciplined.

From Keywords to Intent-Based Test Cases

In traditional SEO, we optimised for keywords. In GEO, we optimise for intent expressions.

A keyword like “best CRM” is not enough. A structured prompt that simulates a real buying decision is far more powerful.

This shift aligns with broader industry commentary on search evolution, including perspectives shared on platforms such as Search Engine Land.

GEO is not about ranking for fragments. It is about appearing in structured decision contexts.

Ground everything

Grounding is the process of ensuring LLM responses are linked to real-world data / information. Often, if you simply prompt a task with LLMs it will hallucinate. To ensure this does not happen I ground the data on:
- The website and content itself
- Branding material for the website
- SEO Data (query data, page metrics, backlink information and competitor data)
Final Thoughts

Reliable GEO insights do not come from clever phrasing. They come from disciplined design.

I treat every prompt as a test case:
- Realistic
- Specific
- Repeatable
I include a clear goal, defined constraints, and explicit context. I keep templates consistent. I version changes. I avoid unnecessary structural edits.

When you approach prompts this way, your visibility data becomes meaningful. Trends become interpretable. Optimisation becomes strategic rather than reactive.

In GEO, measurement quality starts with prompt quality. If you control the input, you can trust the insight.
February 17, 2026

Bing AI Performance Report: GEO Impact Analysis

Microsoft has introduced a new AI Performance Report inside Bing Webmaster Tools. In my view, this marks one of the first real steps toward measuring visibility in AI experiences, not just traditional search rankings.

In this article, I want to summarise what I explained in my video: what the report shows, how the metrics work, where the gaps still are, and why this matters if you care about Generative Engine Optimization (GEO).

Why Microsoft Released the AI Performance Report

For years we measured success using clicks, impressions, and rankings. That model starts to break down once AI answers summarize content directly inside Copilot or AI summaries.

The new report introduces an analytics layer focused on AI citations instead of classic SERP performance.

From my perspective, the goal is clear:

Help publishers understand when their content is used as a source in AI answers
Provide visibility into which topics trigger citations
Move measurement closer to influence, not just traffic

Microsoft describes this as giving publishers insight into how content appears across AI experiences within Bing.

What the New Metrics Actually Mean

Inside the report, there are a few core metrics that matter.

Total Citations

This shows how often pages from your website appear as sources in AI-generated responses during a selected time period.

This is not a ranking signal and it is not traffic. It is simply confirmation that Bing’s AI systems referenced your content.

Average Cited Pages

This metric represents the average number of unique pages cited per day.

I see this as a rough indicator of topical depth. If more pages are cited, it often means Bing recognizes broader authority around a subject.

Page-Level Citation Data

You can drill down to see:

Which URLs are cited
How frequently they appear
The query themes connected to those citations

One important detail: Bing does not show the actual prompts. Instead, it shows the “fan-out” search queries that likely contributed to the AI response.

The Biggest Limitation: No Prompt Data

One thing I was really hoping for was access to the actual prompts.

Right now:

You do not see the original AI question
You do not see click-through rate
You do not see user engagement from the AI answer itself

Instead, Bing exposes the expanded queries derived from prompts.

This is useful, but it means analysts still need to reverse-engineer intent rather than measure it directly.

How This Differs From Traditional Search Performance

Here is how I personally separate the two reporting models.

Classic Search Performance	AI Performance Report
Focus on clicks and rankings	Focus on citations
Measures SERP behavior	Measures AI usage
Keyword-driven analysis	Prompt fan-out analysis
Visibility tied to traffic	Visibility tied to influence

In short, we are moving from measuring Did someone click? to Was my content used as a source?

That is a major shift in how discovery works.

Why Citations Matter Even Without Clicks

One of the key points I make in the video is that influence now happens even when there is no visit.

If your content is cited:

Your brand or expertise shapes the answer
Your information influences user decisions
But analytics may show zero traffic

This is exactly why GEO is becoming critical. Visibility is no longer limited to blue links.

How This Connects to Adobe LLM Optimizer and GEO Workflows

Even with this new report, I still see tools like Adobe LLM Optimizer as highly relevant.

Why?

Because Bing still does not provide:

Prompt data
Cross-platform visibility (ChatGPT, Gemini, etc.)
Deep competitive insights

In my opinion, the real opportunity is combining Bing’s citation data with:

Log file analysis
Prompt simulations
LLM monitoring tools

My team is already exploring how to ingest these grounded queries and use them to better understand prompt behavior.

Practical Takeaways From the Report

If you are working on GEO or AI visibility, here is how I would approach this new data:

Identify URLs with high citation counts and expand those topic clusters.
Look at fan-out queries to understand how prompts branch into multiple searches.
Compare citation activity with crawl logs to validate AI usage patterns.
Treat citations as an influence metric, not a traffic metric.

What This Report Does Not Cover (Yet)

It is important to set expectations.

Right now the report only reflects:

Bing Copilot and Bing AI experiences
Bing’s own ecosystem

It does not include:

ChatGPT
Perplexity
Gemini
Other LLM platforms

So while it is a big step forward, it is still just one piece of the AI visibility puzzle.

My Conclusion

I see this release as the first official GEO-style reporting feature from a major search platform.

It shows that measurement is shifting away from rankings and toward AI usage and citations.

But we are still early.

Without prompts, cross-platform data, or CTR visibility, we need to combine this report with external tooling and deeper analysis.

Still, this is a strong signal of where search analytics is heading next.

February 11, 2026

Tag: LLM Tracking

How to Write GEO Prompts for Reliable LLM Insights

Think of Prompts Like Test Cases

Why treat it as a software test:

Why Broad Prompts Create Random Visibility

The Structure of a Reliable GEO Prompt

1. A Clear Goal

2. A Constraint

3. Context

Consistency Is More Important Than Creativity

Version Your Prompts

Avoid Frequent Structural Changes

Build a Prompt Library

From Keywords to Intent-Based Test Cases

Ground everything

Final Thoughts