AI Visibility Tracking Is Broken: Here Is What to Measure Instead

AI visibility tracking as most teams practice it is statistically invalid. If you ask ChatGPT the same question 100 times, you will get the same list of brands in the same order roughly once. Maybe never. That is not a flaw in the tracking tools. It is a fundamental property of how large language models work.

Yet the market for AI visibility tracking is already estimated at over $100 million per year, according to Search Engine Land. Brands are spending serious money on dashboards that show them “rankings” in AI responses. Rankings that change every time you run the same prompt.

This is not an anti-tracking argument. Tracking AI visibility is essential. The problem is that most teams are measuring the wrong thing, the wrong way, and drawing conclusions that fall apart under statistical scrutiny.

Here is what the data actually says about AI consistency, why traditional ranking metrics fail, and what you should measure instead.

The SparkToro Study That Should Change Everything

In early 2026, Rand Fishkin and Patrick O’Donnell from Gumshoe.ai published research that should be required reading for every marketing team investing in GEO. They recruited 600 volunteers to run 12 different prompts through ChatGPT, Claude, and Google AI (Overviews and AI Mode). The volunteers ran the prompts a combined 2,961 times and recorded every response.

The findings were brutal for anyone selling “AI rank tracking.”

Less than 1 in 100 runs produced the same list of brands. ChatGPT and Google AI returned identical brand lists in fewer than 1% of runs. Claude was slightly more consistent but still below 2%. When you factor in ordering, the odds of getting the same list in the same sequence drop to roughly 1 in 1,000.

Let that sink in. If your AI visibility tool tells you that you “rank #3 for [your keyword] in ChatGPT,” that ranking is a single data point from a distribution so wide that the next run might not include you at all. Or might put you first.

Fishkin put it plainly: if you do not like an AI answer about your brand, just ask a few more times. You will get a different answer.

Why AI Rankings Are Not Search Rankings

Traditional search rankings work as a metric because Google is deterministic enough to make them useful. The same query from the same location generally returns the same results. Position 1 means position 1. Position 3 means position 3. The variance exists but is bounded.

AI models do not work this way. They generate answers probabilistically. Each response is a fresh synthesis drawn from the model’s training data, weighted by context, system prompts, and random sampling. There is no fixed index. There is no stable ordering algorithm.

Three structural problems make AI “rankings” unreliable:

1. List composition changes every run. The same prompt about “best project management tools” might return Asana, Monday, and Trello on one run, then Notion, ClickUp, and Basecamp on the next. The pool of candidates is drawn from a large set, and the selection varies with each generation.

2. List length varies unpredictably. Sometimes ChatGPT gives you 3 recommendations. Sometimes 10. You cannot compare position 2 in a 3-item list to position 2 in a 10-item list. They mean completely different things.

3. Sentiment and framing vary. Even when the same brand appears, the description changes. One run calls your product “a solid choice for small teams.” The next calls it “enterprise-focused but complex.” The position does not capture this nuance at all.

This is why treating AI citations like Google rankings is a category error. You are measuring a probabilistic distribution with a deterministic metric.

What the Smartest Teams Actually Track

Despite the chaos, AI visibility is absolutely trackable. You just need to use the right metrics and enough data. The SparkToro research itself hints at the solution: individual rankings are noise, but visibility percentages over many runs are signal.

Here is the framework that works, based on what high-maturity GEO teams are doing in 2026.

Instead of tracking your “position” in a single AI response, track how often your brand appears across a statistically significant number of runs. If you run 50 variations of “best [your category] tool” across ChatGPT, Perplexity, and Gemini, and your brand appears in 22 of those responses, your share of visibility is 44%.

This is analogous to share of voice in traditional media monitoring, and it works for the same reason: it smooths out the randomness of any single data point into a meaningful trend.

Share of visibility has three properties that make it useful:

It is directionally accurate. If your share goes from 15% to 35% over three months, something you did worked.
It is comparable across brands. You can benchmark against competitors meaningfully.
It accounts for list variability. Whether the AI returns 3 brands or 10, your presence or absence is binary per run.

Metric 2: Citation Sentiment

Not all AI mentions are equal. A brand mentioned with positive framing is worth more than one mentioned as an afterthought. Track the sentiment and context of each citation.

Conductor’s 2026 CMO investment report, based on surveys of 250+ enterprise leaders, found that conversions, brand sentiment, and AI search market share are replacing traffic as the KPIs that matter. Aleyda Solis, one of the report’s expert commentators, noted that brand sentiment is a key differentiator: you might get mentioned in an answer, but if it is negative, the impact on your brand is not positive.

This is harder to automate than simple presence detection, but it is far more valuable. A mention that says “Brand X is the leading solution for [use case]” drives different outcomes than “Brand X also exists in this space.”

Metric 3: Entity Coverage Across Models

Track your visibility separately for each AI model. ChatGPT, Gemini, Perplexity, and Claude all have different training data, different weighting, and different behavior. Your visibility in one tells you nothing about your visibility in another.

The 94% of enterprises planning to increase GEO investment this year (per the Conductor report) are learning this the hard way. Teams that optimized only for ChatGPT discover they are invisible in Perplexity. Or they dominate Google AI Mode but never appear in Claude.

Cross-model coverage matters because user behavior is fragmenting across AI tools. Being the #1 recommendation in ChatGPT means nothing for the growing segment of users who default to Perplexity or Gemini.

Metric 4: Prompt Category Coverage

Instead of tracking individual keywords, track categories of prompts. For a CRM tool, the relevant categories might include: “best CRM for small business,” “Salesforce alternatives,” “CRM comparison,” “affordable CRM,” and a dozen others.

Within each category, measure your share of visibility across multiple prompt phrasings and multiple runs. This gives you a map of where you are strong and where you are absent.

SparkToro’s research on prompt variation found that real users almost never phrase AI prompts the same way, even when they have the same goal. Tracking exact-match prompts misses the vast majority of real-world queries. Category-level tracking captures the behavior that matters.

How Many Runs Do You Actually Need?

One of the practical questions teams face is sample size. How many times do you need to run a prompt before the data is reliable?

Based on the variance levels in the SparkToro study, here are rough guidelines:

For directional trends (are we going up or down?): 20 to 30 runs per prompt category per model is a reasonable minimum.
For competitive benchmarking (comparing brands): 50+ runs gives you enough data to separate signal from noise with reasonable confidence.
For granular sentiment analysis: 100+ runs per category gives you enough variation to see patterns in how the AI frames your brand.

These numbers are not arbitrary. They reflect the observed variance in AI responses. With a 1 in 100 chance of getting the same list twice, you need enough samples to approximate the underlying distribution. Anything below 20 runs is essentially a screenshot, not data.

The $100M Problem Nobody Is Solving

The irony of the AI visibility tracking market is that the hardest problem is not the tracking itself. It is the interpretation. Anyone can run prompts and record results. The difficult part is turning those results into decisions.

Most teams fall into one of two traps:

Trap 1: Over-indexing on single data points. Running a prompt once, seeing your brand at #2, and celebrating. Then running it again and panicking when you do not appear. Both reactions are wrong. Single runs tell you almost nothing.

Trap 2: Aggregating away the signal. Averaging across so many prompts and models that the data becomes smooth and meaningless. “We have 28% visibility across all models” sounds impressive but hides the fact that you have 60% in ChatGPT and 0% in Perplexity.

The solution is disaggregated aggregation. Track enough runs to be statistically meaningful, but keep the data sliced by model, by prompt category, and by sentiment. Look for patterns in the slices, not just the top-line number.

This is the approach that searchless.ai takes with its visibility scoring. By running systematic queries across multiple AI engines and measuring both presence and context, the score reflects the probability that an AI engine will recommend your brand, not a single snapshot.

Practical Steps to Build a Real AI Visibility Dashboard

If you are building or buying an AI visibility tracking system, here is what to prioritize.

Step 1: Define Your Prompt Categories

List the 10 to 20 prompt categories that represent how real users discover brands in your space. Include direct comparison prompts (“Brand A vs Brand B”), problem-based prompts (“how to solve X”), and category prompts (“best tools for Y”).

Step 2: Create Prompt Variations

For each category, write 5 to 10 natural-language variations. SparkToro’s research shows that prompt phrasing significantly affects results. Cover the range of how real users ask questions, not just how your marketing team would phrase them.

Step 3: Run Systematically

Execute each prompt variation multiple times across each AI model you care about. Minimum 20 runs per variation per model for directional data. Log every response with timestamp, model version, and full text.

Step 4: Measure Presence and Sentiment

For each response, record: (a) whether your brand appeared, (b) the framing and sentiment, (c) which competing brands appeared alongside you, and (d) the position in the list (as secondary context, not primary metric).

For each prompt category and model, calculate your appearance rate. Track this over time. Set benchmarks against competitors. This is your north-star GEO metric.

Step 6: Track Trends, Not Snapshots

Review your visibility data weekly or monthly, not daily. Daily fluctuations are noise. Weekly trends in share of visibility, cross-model coverage, and sentiment are signal.

Why 93% of Teams Are Building This In-House

The Conductor CMO report revealed that 93% of enterprise leaders are building AEO and GEO capabilities in-house rather than outsourcing. That number is striking because it is the opposite of what happened with traditional SEO, where agencies dominated.

The reason is data ownership. AI visibility data is strategic intelligence about how your brand is perceived by the most important new distribution channel since Google. Handing that to a third-party agency creates dependency and limits how fast you can iterate.

Building in-house also lets you customize the tracking to your specific competitive landscape. Generic AI ranking tools use generic prompt sets. Your brand competes in specific categories against specific competitors. Custom tracking captures that nuance.

That said, not every team has the resources to build from scratch. Tools like searchless.ai provide a starting point with standardized visibility scores that you can supplement with your own custom tracking as your program matures.

The Metrics That Do Not Matter (Yet)

Some metrics sound important but are premature to track at this stage of AI search evolution.

AI referral traffic. Most analytics tools cannot reliably attribute traffic from AI engines yet. ChatGPT referrals often show up as “direct” or “unattributed.” Until analytics tools catch up, measuring AI traffic directly is unreliable. Use visibility as a proxy.

AI “rankings” in any form. As the SparkToro data proves, individual position data is statistically meaningless. Tracking position 1 versus position 3 in AI responses is like tracking the flight path of a single bee to understand the hive.

Single-model scores. A visibility score that only covers ChatGPT tells you nothing about the 60% of AI users who prefer other tools. Cross-model coverage is essential.

FAQ

Is AI visibility tracking actually possible?

Yes, but not the way most tools do it. Individual “rankings” in AI responses are statistically unreliable. However, tracking your brand’s appearance rate across a statistically significant number of runs, models, and prompt variations gives you a reliable measure of how likely AI engines are to recommend you. Think of it as probability tracking, not position tracking.

How many times should I run the same prompt to get reliable data?

At minimum 20 to 30 runs per prompt variation per AI model for directional trends. For competitive benchmarking, aim for 50 or more. The SparkToro research showed that you need substantial samples because the variance in AI responses is extremely high. A single run tells you nothing. Two runs that agree might be coincidence.

What is the best metric for AI visibility?

Share of visibility. This measures how often your brand appears in AI responses across a category of prompts, expressed as a percentage of total runs. It is robust to the randomness of individual responses, comparable across brands and models, and tracks meaningfully over time. Position-based metrics are not reliable for AI.

Should I track AI visibility for every AI model separately?

Absolutely. ChatGPT, Gemini, Perplexity, and Claude have different training data and different behavior. Your visibility in one does not predict your visibility in another. Track each model separately and look for gaps where you are strong in one but absent in another.

Why do AI engines give different answers each time?

Large language models generate responses probabilistically, not deterministically. Each response is a fresh synthesis weighted by context and random sampling. There is no fixed index or stable ranking. This is a feature of how LLMs work, not a bug. It is why measuring AI “rankings” is fundamentally different from measuring search rankings.

Is AI visibility tracking worth the investment?

94% of enterprises plan to increase their GEO investment in 2026, according to Conductor’s survey of 250+ marketing leaders. The investment is happening regardless. The question is whether you track the right metrics. Done well, AI visibility tracking tells you whether your content, entity authority, and structured data are working. Done poorly, it gives you misleading dashboards that drive bad decisions.

The Bottom Line

AI visibility tracking is not broken. The way most teams do it is.

Stop tracking AI “rankings.” Start tracking share of visibility across enough runs, models, and prompt categories to be statistically meaningful. That is the difference between a dashboard that looks impressive and a metric that drives decisions.

The data is clear. AI engines are inconsistent on any single run. But across hundreds of runs, patterns emerge. Those patterns tell you whether AI engines see your brand as a credible, citable entity in your category or whether you are invisible to the fastest-growing discovery channel since Google.

If you do not know where you stand, start with a baseline. Check your free AI Visibility Score at audit.searchless.ai and see what the major AI engines actually say about your brand.

The brands that measure AI visibility correctly will optimize for it correctly. Everyone else will be optimizing for noise.

Further reading:

The SparkToro Study That Should Change Everything#

Why AI Rankings Are Not Search Rankings#

What the Smartest Teams Actually Track#

Metric 1: Share of Visibility#

Metric 2: Citation Sentiment#

Metric 3: Entity Coverage Across Models#

Metric 4: Prompt Category Coverage#

How Many Runs Do You Actually Need?#

The $100M Problem Nobody Is Solving#

Practical Steps to Build a Real AI Visibility Dashboard#

Step 1: Define Your Prompt Categories#

Step 2: Create Prompt Variations#

Step 3: Run Systematically#

Step 4: Measure Presence and Sentiment#

Step 5: Calculate Share of Visibility#

Step 6: Track Trends, Not Snapshots#

Why 93% of Teams Are Building This In-House#

The Metrics That Do Not Matter (Yet)#

FAQ#

Is AI visibility tracking actually possible?#

How many times should I run the same prompt to get reliable data?#

What is the best metric for AI visibility?#

Should I track AI visibility for every AI model separately?#

Why do AI engines give different answers each time?#

Is AI visibility tracking worth the investment?#

The Bottom Line#