·5 min read

How LLM reporting and tracking works. Here's the full breakdown.

Everyone is rushing to get their brand into LLM responses. But how do you actually track it?

There is only one method. And at scale, it's expensive to run.

Let me break down exactly how it works - and what every LLM analytics tool is doing under the hood.

There is only one method.

None of the LLM platforms give you native analytics right now. No data on what users are typing in. No search query reports. Nothing. So, reporting on whether your brand shows up in AI responses can only be done one way.

I first worked this out when I was at CXL. I started manually running a set of prompts - things like "b2b marketing training" - and sending the exact same prompt to every model. Gemini, Perplexity, ChatGPT, Claude, Grok. I'd save each response, run the next prompt, save again, and keep going until I had a full set. Then I'd run analysis on the collected responses through another LLM.

It worked well. You could see how many times your brand came up. You could even rerun the same prompt to check consistency - does the brand appear again, or was it a one-off? There's a surprising amount of insight sitting inside that simple loop.

That manual process is the method. It is the only way to see whether your brand appears in an LLM response. Yes or no.

What are the tools doing under the hood?

Every LLM analytics platform runs the same architecture because there is no alternative.

They take a set of prompts - could be 100, could be 1,000, could be 10,000. They store them in a system. Then they send each prompt out through a multi-model API key (from providers like Vercel or Abacus) and ping every model. Claude, Grok, Gemini, DeepSeek, ChatGPT, Google's AI snippet, and Google AI mode (Those last two come through the API on Google).

For each prompt, they pull back six to seven responses - one per model. They save every single response into a database. Then run an analysis on those stored responses using deterministic code. Then build a reporting layer that reads from the database and presents the findings back to the user in a clean dashboard.

That is the entire product. Prompt runner, multi-model API, database, deterministic analysis, UI.

You can build the exact same thing, run it locally, and get the same results

I took this hypothesis and built it using our GTM OS at Backbase, which is a collection of agents we've built out. Tested it with a few prompts first. It worked. Then I ran the exact same prompt set we use inside AirOps.

It produced exactly the same results.

Right now I'm running 744 prompts per cycle. Each run costs about $100-$200 and takes roughly two hours to complete, because the job times out if you push too many at once. It has to run them in sequences and batches, save everything along the way. It's a long run, but it's fully automated.

I'm now setting it to run on autopilot on the first and twenty-eighth of every month at 3 am. When you open your laptop that morning, the data is there.

Why monthly, not daily?

I don't think daily tracking matters. Not enough changes on a day-to-day basis in the world of LLMs. It takes Google at least 48-72 hours to index a page; LLMs always lag a few days behind on normal blogs.

Even though models pick things up fast. Last week, we won the Forrester Wave with Backbase. By the next morning, it was already in the LLMs. Ask Gemini who the leader is, and it would tell you. So yes, there are moments where things move quickly, but from a reporting perspective, it's a lag.

Monthly runs give you everything you need. You see where the gaps are, you turn those gaps into content, you can feed them straight into a blog writer agent, and you can track progress over time. Over the course of a year, you build a clear picture of how your visibility went up or down.

In between runs, you work off standard SEO metrics. Get to the top of Google. Good old Goolge Search Console is your best friend. We know there's a correlation between ranking at the top of organic search and having your brand cited by LLMs.

What it costs versus what you're paying now?

You can run a set of 1,000 prompts for roughly $500 to $600 per month. That is significantly less than what most of these tools charge for tracking this amount of prompts, because they're capitalizing on the insights and running prompts almost daily. You're paying for every single prompt, and the token cost is high - every response from every model for every prompt needs to be analysed word by word, and tracked over time.

The markup is on the analysis and the convenience. The underlying mechanics are not complex at all.

The stack I used.

Five components, nothing exotic:

Claude Code and Cursor (could be Claude code native or VS Code, or terminal) for the development, Vercel for hosting and deployment with cron jobs for automation, Supabase as the database.

The whole build took about two hours, plus a few more hours of testing. That's it.

If you're running LLM brand tracking through a paid tool right now, you should know what you're paying for. And if you've got someone technical on your team - or you're comfortable with "vibe coding" - this is a very buildable project.

Have fun.