Beyond the Hype: How I Actually Evaluate Large Language Models

Everywhere I look these days, someone’s talking about the latest LLM breakthrough. OpenAI releases GPT-4o, Google drops Gemini Pro, and suddenly everyone’s an expert on which model is 「best.」 But here’s the thing I’ve learned after evaluating dozens of these systems for product development: most people are asking the wrong questions.

When I first started working with LLMs, I made the same mistake everyone else does. I’d run benchmarks, check accuracy scores, and compare token limits. But then I remembered something fundamental from The Qgenius Golden Rules of Product Development: 「New technologies are sources of wealth, but you need to find their cognitive path.」 In other words, technical specs alone don’t tell you whether a model will actually solve real problems for real users.

So here’s my current framework for evaluation, built around three core principles from product thinking: user-centric design, problem orientation, and system thinking. First, I look at cognitive load reduction. Does this model make complex tasks simpler for my target users? I recently tested two different models for a customer service application. One had higher accuracy scores, but the other responded in language that required less mental effort to understand. Guess which one users preferred?

Second, I evaluate based on time value. As Qgenius reminds us, 「The measure of value in innovation isn’t money, but time.」 I track how much time each model saves users compared to existing solutions. For a legal document review tool we built, one model cut analysis time from 45 minutes to 3 minutes. That’s not just efficiency—that’s fundamentally changing how people work.

Third, and this might be controversial, I don’t obsess over hallucination rates. Don’t get me wrong—accuracy matters. But I’ve seen teams reject models with 2% hallucination rates while ignoring that human experts in their field make errors at similar rates. The key question isn’t whether the model makes mistakes, but whether the mistakes are catastrophic versus manageable.

Here’s what most evaluation frameworks miss: context. A model that performs brilliantly for creative writing might fail miserably for technical documentation. That’s why I always test in the specific domain where the product will actually operate. I once saw a team choose a model based on general benchmarks, only to discover it couldn’t handle their industry-specific terminology.

My advice? Stop treating LLM evaluation like a spec sheet comparison. Start thinking like a product manager. Ask: Does this model create unequal value exchange where users get more than they give? Does it match our target users’ mental models? Can our team work with its limitations?

The best LLM isn’t the one with the highest scores—it’s the one that disappears into the background while making your users’ lives meaningfully better. So next time you’re evaluating models, ask yourself: Are you choosing the technically impressive option, or the one that will actually flow through your users’ lives?