You paste 200 NPS comments into ChatGPT, ask for the top five themes, and get a clean summary. You run the same prompt again ten minutes later—same comments, same wording—and the themes are different.
This is why ChatGPT and Claude give you different answers every time you analyze feedback. Try it in Claude, and you'll see a third interpretation.
This isn't user error or a glitch. It's how large language models work, and it has real consequences for anyone trying to build reliable customer insights.
This guide breaks down the technical reasons behind the inconsistency and where each model drifts. It also covers what CX teams can do to get reproducible feedback analysis at scale.
Why ChatGPT and Claude give you different answers every time you analyze feedback
ChatGPT and Claude give different answers every time you analyze feedback because of probabilistic generation, varying temperature settings, and shifting context windows. Each session processes data independently, so even small changes in token probabilities alter the final result.
Here's what that looks like in practice: paste the same 50 NPS comments into ChatGPT twice, and you'll get two different theme lists. Run the same batch through Claude, and you'll see a third interpretation.
This isn't a bug. It's how large language models work.
LLMs are prediction engines, not calculators. They generate statistically plausible responses rather than repeatable calculations.
- Expectation: Same feedback in, same insights out
- Reality: LLMs sample from probability distributions, producing varied outputs each run
For quick exploration—getting a rough sense of what customers are saying—this variability is tolerable. For reporting to stakeholders or tracking trends over time, it becomes a real problem.
Why large language models are non-deterministic by design
The variability you see isn't accidental. It's built into how LLMs generate text. Understanding the mechanics helps explain why no amount of prompt tweaking fully eliminates inconsistency.
Temperature and probabilistic sampling
Temperature is the dial controlling randomness in word selection. High temperature means more creative, varied outputs. Low temperature tightens around the most probable words.
A 2025 ACL study found that even at temperature zero—the most deterministic setting—no LLM consistently delivers the same outputs. When two words have nearly identical probabilities, the model still picks one, and different runs can pick differently.
Context window limits and truncated feedback
Both ChatGPT and Claude have finite context windows—the amount of text they can "see" at once. When your feedback data exceeds this limit, the model silently drops older or less relevant tokens.
What gets dropped varies between runs. One session might truncate early survey responses; another might cut off the end of your dataset. The result is inconsistent theme coverage depending on what the model actually processed.
Prompt sensitivity and instruction drift
Minor wording changes in prompts shift outputs dramatically. "Summarize the key themes" and "Extract the main topics" sound similar to humans but can trigger different response patterns.
Even copy-pasting the exact same prompt can yield different results. The model interprets instructions probabilistically, not literally.
Silent model updates and version changes
OpenAI and Anthropic push model updates without user notification. A prompt that produced consistent results last month may behave differently today because the underlying weights changed. No changelog, no warning.
How ChatGPT analyzes customer feedback and where its answers drift
ChatGPT brings strong general knowledge and summarization capabilities to feedback analysis. It handles straightforward sentiment well and produces readable summaries quickly.
Where it drifts:
- Theme labeling: "Shipping issues" one run, "Delivery complaints" the next—same feedback, different category names
- Sentiment edge cases: Sarcasm or mixed feedback gets classified inconsistently
- Prioritization: Different "top themes" surface depending on the run
This is why ChatGPT and Claude give you different answers every time you analyze feedback—product teams find it difficult to trust any single analysis.
How Claude analyzes customer feedback and where its answers drift
Claude's larger context window helps with longer verbatims and batch analysis. It tends toward more cautious, hedged responses—which can be helpful for nuanced feedback but introduces its own inconsistencies.
Where it drifts:
- Granularity shifts: Broad categories one run, hyper-specific sub-themes the next
- Hedged outputs: "Possibly negative" versus "Negative" on identical feedback
- Quote attribution: Different verbatim quotes surfaced as "representative" each time
The hedging tendency means Claude sometimes refuses to commit to a sentiment classification, even when the feedback is clearly positive or negative.
Where ChatGPT and Claude behave the same on feedback data
Both models handle certain tasks reliably. Acknowledging overlap helps calibrate expectations.
For straightforward positive/negative sentiment on clear feedback, both models perform adequately. Problems emerge with ambiguity, scale, and the need for reproducibility.
Shared limitations that make both models inconsistent for feedback analysis
Beyond model-specific quirks, fundamental LLM limitations affect any general-purpose model used for VoC analytics. Think of these as architectural constraints, not bugs to be fixed.
Hallucinated themes and fabricated quotes
LLMs can generate plausible-sounding themes or "quote" feedback that doesn't exist in the source data. Hallucination—the model confidently producing false information—is particularly dangerous when executives make decisions based on fabricated evidence. A 2026 study found that GPT-4o and Claude 3.7 hallucinate at 15–20% rates on factual citation tasks, with rates climbing higher on niche topics.
You might see a theme labeled "billing confusion" with a supporting quote that sounds real but doesn't appear anywhere in your actual feedback.
Unstable sentiment scores and theme counts
Running the same analysis twice produces different sentiment distributions and theme frequencies. One run shows 40% negative sentiment; the next shows 35%.
Trend analysis becomes impossible when the baseline keeps shifting.
Loss of nuance on long verbatims
Detailed customer stories get compressed into generic summaries. Subtle signals—feature requests buried in complaints, churn indicators hidden in praise—get lost when the model simplifies.
A customer writing three paragraphs about frustration with a specific workflow might get reduced to "user experience issues." The actionable detail disappears.
No persistent memory across analyses
Neither ChatGPT nor Claude remembers previous analyses. Every session starts fresh. Tracking changes over time requires manual workarounds—exporting results, maintaining spreadsheets, comparing outputs yourself.
Why inconsistent LLM outputs break voice of customer reporting
Technical inconsistency translates directly into business consequences—74% of organizations identify AI inaccuracy as a highly relevant risk. When theme counts fluctuate randomly, stakeholders lose trust in the data. When sentiment scores drift, NPS drivers become impossible to isolate.
- Eroded stakeholder trust: Leadership questions data validity when reports show different "top issues" each week
- Impossible trend tracking: Baselines shift with every analysis, making improvement unmeasurable
- Misallocated resources: Teams chase phantom issues that don't persist in subsequent analyses
- Audit failures: No reproducible evidence trail for compliance or executive review
You can't build a reliable VoC program on a foundation that gives different answers to the same question.
How to get reproducible insights from customer feedback at scale
Teams that need consistent, auditable VoC insights can adopt operational practices that constrain variability—or move to purpose-built platforms designed for reproducibility.
1. Standardize the feedback taxonomy before any analysis
Define theme categories, sentiment labels, and scoring rubrics upfront. A fixed taxonomy constrains model outputs and enables apples-to-apples comparisons over time.
Instead of asking the model to "identify themes," provide a specific list: Pricing, Usability, Shipping, Support, Product Quality. The model classifies into your categories rather than inventing new ones each run.
2. Use fine-tuned models for theme and sentiment classification
Fine-tuned or domain-specific models trained on your feedback corpus produce more stable outputs than general-purpose LLMs. They learn your terminology, your product names, your common complaint patterns—applying structured text analysis rather than open-ended generation.
This is what enterprise feedback platforms do under the hood—train specialized models optimized for classification, not generation.
3. Version and audit every prompt and output
Log prompt versions, model versions, and raw outputs for every analysis. This creates an audit trail and helps diagnose when drift occurs.
When stakeholders question why reports differ month to month, you can trace the change to a specific variable. No more shrugging at LLM randomness.
4. Layer LLMs on top of a deterministic feedback pipeline
Use LLMs for generative tasks—summarization, exploration, answering ad-hoc questions—but route classification and scoring through deterministic systems.
Platforms like Chattermill combine AI flexibility with analytical consistency: the creative power of language models for insight generation, the reproducibility of structured classification for reporting.
When to move from ChatGPT and Claude to a dedicated analytics platform
Certain signals indicate a team has outgrown ad-hoc LLM analysis. This isn't a criticism of ChatGPT or Claude—they're powerful tools for many tasks. But feedback analytics at scale requires different capabilities.
- Multiple stakeholders depend on VoC data—inconsistency erodes trust
- Trend tracking spans quarters or years—LLMs lack persistent memory
- Compliance or audit requirements exist—no reproducible evidence trail
- Feedback volume exceeds manual prompting capacity—scale demands automation
For teams in this position, Chattermill unifies feedback from every channel with consistent, reproducible AI-powered analysis. It maintains stable taxonomies, tracks trends over time, and provides the audit trail enterprise VoC programs require.
Book a demo to see how purpose-built feedback analytics compares to general-purpose LLMs.
Frequently asked questions about ChatGPT and Claude for feedback analysis
Why does ChatGPT give different answers to the same customer feedback prompt?
ChatGPT uses probabilistic sampling to generate responses, selecting words based on weighted probabilities rather than fixed rules. Identical prompts can produce varied outputs because the model samples differently each time.
Is Claude more accurate than ChatGPT for analyzing customer feedback?
Neither model is inherently more accurate. Both are general-purpose LLMs with similar architectural limitations for feedback analysis. Claude's larger context window may help with longer verbatims, while ChatGPT's plugin ecosystem offers more integration options.
Does setting temperature to zero make outputs fully consistent?
Lowering temperature reduces variability but does not eliminate it. Tie-breaking logic, context window handling, and silent model updates can still introduce drift between runs.
How much feedback can ChatGPT or Claude analyze per prompt?
Both models have finite context windows that limit how much text they process at once. Exceeding this limit causes silent truncation, which changes which feedback gets analyzed—and therefore which themes and sentiments get surfaced.
Can CX teams still use ChatGPT and Claude for customer feedback analysis?
General-purpose LLMs remain useful for exploratory analysis and ad-hoc questions. For reproducible, auditable insights at scale, teams typically benefit from purpose-built feedback analytics platforms that combine AI capabilities with analytical consistency.










