Why AI Answers Change Every Time (and How to Track GEO/AIO Progress Anyway)

If you’ve ever asked ChatGPT (or Google’s AI Overviews, or Claude) the exact same question twice and gotten different recommendations, you’re not imagining things.

This is one of the biggest reasons GEO (Generative Engine Optimization) and “AI Optimization” (AIO) is confusing for business owners right now. In traditional SEO, you can at least try to measure progress with rankings. In AI-driven answers, the output can shift run to run, even when nothing about your business changed.

The good news: that doesn’t mean GEO/AIO is pointless. It means you need a different way to think about visibility, and a smarter way to measure progress.

Let’s break down why AI answers change, what SparkToro’s research found, and a practical tracking framework you can use without chasing ghosts.

The SparkToro finding that should reset expectations

SparkToro ran a large study to test how consistent AI tools are when giving lists of brand or product recommendations. Their volunteers ran 12 prompts across ChatGPT, Claude, and Google’s AI experiences nearly 3,000 times, then normalized the recommendations into comparable lists. 

The punchline: these AI tools were highly inconsistent.

  • SparkToro found there’s less than a 1 in 100 chance that ChatGPT or Google’s AI will produce the same list of recommended brands twice across repeated runs. 
  • Getting the same list in the same order was even less likely, closer to 1 in 1,000 in their analysis. 
  • The number of recommendations also varied (sometimes 2–3, sometimes 10+), which makes “ranking position” even less meaningful. 

That one study explains why so many business owners feel like GEO/AIO is slippery: if you test “Am I recommended?” once or twice, you can convince yourself of almost anything.

So what do you do instead?

SparkToro’s key takeaway is the right mental model: don’t track “rank.” Track “appearance rate.” They show that even when rankings and order are chaotic, some brands appear in a high percentage of runs for a given intent, and that percentage can be meaningful. 

Why AI answers change (even when you don’t change anything)

There are multiple reasons AI tools don’t behave like traditional search rankings. Some are “by design,” and some are “because the world changes.”

1.) Language models aren’t built to repeat themselves perfectly

Many AI systems generate text using probabilistic decoding (sampling). Even small differences in how the system selects the next token can lead to different paths and different lists.

OpenAI explicitly notes that when randomness is present (temperature above 0), variation is expected, and even at temperature 0 you can still see differences in certain situations. 

Also, “deterministic settings” in real-world inference aren’t always truly deterministic because of implementation details (hardware, floating point math, parallelism, tie-breaking). 

2.) Retrieval and sources can change between runs

Some tools use retrieval-augmented generation (RAG): they fetch or reference outside content and then generate an answer. If the retrieved set shifts (even slightly), the list shifts.

SparkToro also points out that recommendation ordering is not a stable “best-to-worst ranking,” but more like a probabilistic selection and ordering from a candidate pool. 

3.) The system is constantly being updated

Models and ranking systems get updated, safety filters evolve, and the underlying web changes daily. Google’s AI experiences, for example, have been expanding and changing presentation and sourcing over time. 

4.) Context and personalization factors creep in

Location, previous context, or subtle differences in phrasing can shift results. Even when you think you’re using the “same prompt,” tiny changes like punctuation, added context, or even the way the UI frames the query can alter the output.

That’s why trying to “lock in” one exact prompt is usually a dead end.

The GEO/AIO measurement mistake: treating AI like a 10 blue links leaderboard

A lot of early “AI visibility” talk borrowed SEO language:

  • “What position are we in ChatGPT?”
  • “Are we #1 in AI Overviews?”
  • “How do we outrank competitors in Perplexity?”

SparkToro’s data strongly suggests that ranking position is a trap for recommendations. The results aren’t stable enough, and the list length varies too much for “#1” to mean what people want it to mean. 

Instead, the goal should be closer to this:

Increase the probability that your brand appears (and is described accurately) across a set of relevant intents, across multiple tools, over time.

That’s not as sexy as “#1,” but it’s far more measurable and far more real.

A practical way to track GEO/AIO progress (without losing your mind)

Here’s a tracking framework you can implement with a spreadsheet and consistency.

Step 1: Define 8–15 “buyer intent” prompts (not just one)

Pick prompts that reflect the real questions your customers ask, across the funnel.

Examples (adjust to your niche):

  • “What’s the best [service] company in [city] for [type of customer]?”
  • “Who are the top [category] agencies for [industry]?”
  • “Recommend a [service] provider that specializes in [specific problem].”
  • “What should I look for when hiring a [service provider]?”

Include:

  • 2–4 “near me / local” prompts
  • 2–4 “best / top” prompts
  • 2–4 “comparison / alternatives” prompts
  • 2–3 “how to choose” prompts

This matters because AI answers change, but intent categories give you a stable structure to measure against.

Step 2: Test in batches, not one-offs

If you only run each prompt once, you’re basically flipping a coin.

SparkToro’s research ran prompts 60–100 times per prompt/model to understand variance. You don’t need to go that far to get value, but you do need repetition. 

A realistic approach for an SMB:

  • Run each prompt 5 times
  • Across 2–3 tools (example: ChatGPT + Google AI + one other you care about)
  • Repeat weekly or biweekly

That gives you trend lines without turning measurement into a full-time job.

Step 3: Track “appearance rate,” not rank

For each prompt batch, track:

A.) Appearance rate (Visibility %)

  • Did your brand appear anywhere in the recommendations? (Yes/No)
  • Over 5 runs, that becomes: 0%, 20%, 40%, 60%, 80%, 100%

This aligns with SparkToro’s conclusion that visibility percentage across repeated runs is more meaningful than rank order. 

B.) Position band (optional, simplified)
Instead of exact rank, use bands:

  • Mentioned in first 3
  • Mentioned in 4–7
  • Mentioned 8+

It’s less brittle than “we were #2” and still captures whether you’re trending upward.

C.) Description accuracy
Score it quickly:

  • Accurate
  • Mostly accurate
  • Mixed/confusing
  • Wrong

This is huge. Being recommended with the wrong positioning can attract the wrong leads (or create trust issues).

Step 4: Add a “share of voice” view

If you can, track:

  • How often do you appear compared to 3–5 competitors?

You’re not measuring one output. You’re measuring how often you show up in the AI tool’s “consideration set.”

Step 5: Tie it back to business outcomes (or you’ll optimize the wrong thing)

GEO/AIO metrics are leading indicators. What you actually want is:

  • More qualified inquiries
  • Better-fit leads
  • Higher close rates
  • Lower friction in the sales process

So pair your GEO/AIO tracking with:

  • Organic lead volume (forms/calls)
  • Lead quality notes (CRM)
  • Sales cycle length
  • Branded search volume trends (if you track that)

If AI visibility increases but lead quality drops, something in your messaging or positioning may be off.

What to do when your results are “inconsistent” (they will be)

Here’s the key mindset shift:

You’re not trying to win one prompt.
You’re trying to become the obvious answer across the web.

That usually comes down to four buckets:

1.) Clarity (make it easy to understand what you do)

AI tools summarize what they can parse.

  • Clear service pages
  • Strong internal linking
  • Explicit “who we help / where we work / what we specialize in”

2.) Consistency (same story everywhere)

Your site, your listings, your profiles, and third-party mentions should reinforce the same entity information:

  • Name/brand consistency
  • Service categories
  • Locations
  • Proof points (certifications, years, awards)

3.) Credibility (the web has to back you up)

AI systems tend to be more confident recommending entities with stronger “trust signals,” which often includes off-site corroboration:

  • Reviews
  • Associations
  • Press / local features
  • Partner mentions
  • Case studies and named results (where appropriate)

4.) Coverage (answer the questions your buyers ask)

GEO/AIO isn’t only about “keywords.”
It’s about being a great source for the kinds of questions that trigger recommendations.

If you publish helpful comparison and “how to choose” content, you give AI systems more surfaces to pull from and cite.

FAQ: the two questions everyone asks after reading this

Does GEO/AIO guarantee we’ll show up in AI answers every time?

No. SparkToro’s research shows recommendation lists are highly variable, even when prompts are identical, so the realistic goal is improving your likelihood of appearing across many runs and many prompts, not “locking in” a single query. 

How fast can we see results?

Some improvements (better clarity, better on-site structure) can help quickly. But because AI answers vary and the ecosystem updates constantly, the most reliable way to evaluate progress is over multiple runs and multiple weeks, using appearance rate trends rather than one-time snapshots. 

The bottom line

AI answers change because the systems are probabilistic, dynamic, and constantly evolving. SparkToro’s data makes it clear that trying to measure GEO & AIO like SEO rankings is a recipe for confusion. 

But if you track GEO/AIO the right way, appearance rate across a prompt set, measured over repeated runs, tied to real business outcomes, you can absolutely make progress and prove it.

If you want, I can turn the tracking framework above into a simple copy/paste spreadsheet layout (columns, scoring rules, and a weekly reporting summary you can hand to a client).

Share this post

We focus on beautiful web design that delivers results for your organization. Our specialty is creating customized WordPress websites.

About Us

Johnny Flash Productions

Johnny Flash Productions is a creative agency based outside of Washington D.C. that focuses on digital strategy, web design and development, graphic design and event production that helps businesses get better results from their marketing.