Vibe Checking: a valid subjective approach for LLMs quality check

Vibe checking is a subjective approach that evaluates LLMs using tailored quality metrics. This method complements traditional LLM analytics, focusing on user satisfaction and specific use cases rather than relying solely on standard metrics.

Vibe Checking: a valid subjective approach for LLMs quality check

Artificial Intelligence (AI) is rapidly growing in adoption and usage across industries. Companies rely on integrating Large Language Models (LLMs) to bring in AI capability for tasks ranging from customer support to content generation.

In fact, the list of LLM use cases is endless, but with several options it leaves us with one question:

How do we know which LLM is best for our specific use case?

No single LLM can excel at every use case due to the variety of demands. For example, taking our example tasks from above, a customer support chatbot may need to prioritise speed, whereas a content generation tool might prioritise creative output quality.

As surprising as this might seem, to determine which LLM is best for you we can’t rely on standardised quality metrics as they can fail to capture the nuances of different use cases.

This is where we need ‘vibe checking’ for more subjective KPIs to assess the viability of an LLM, where we prioritise metrics that are most relevant to our users and specific use case.

The Limitations of Universal Quality Metrics in LLM Evaluation

We know one size fits all approaches don’t always work, and it’s no different when evaluating LLMs for your AI SaaS either.

Universal metrics are useful for comparing the performance of LLMs on a broad, macro level. For a great example of this, read our previous article comparing Gemini Pro to GPT-4, two big LLMs on the market, here.

But for specific use cases we need a subjective set of metrics to quantitatively determine which is best under specific circumstances and scenarios.

Let’s look at some examples to get a better idea:

  • For real-time customer support, quick responses might be more valued than perfect accuracy.
  • For use cases that require more accuracy like healthcare however, higher computational resources must be considered which come at a higher cost.
  • If a certain LLM is fine tuned for e.g Legal domains, it might underperform with general or different industry tasks.

We should now be able to see that the main downside of universal quality metrics is that they don’t capture the end user experience. This is an issue because evaluating the metrics that are most relevant to that is more indicative of success for LLM driven apps.

This means if we want to accurately evaluate the quality of any LLM, we require a different set of subjective, quality metrics for EACH use case.

Introducing Vibe Checking: A Subjective, Yet Crucial, Quality Check

Let’s first re-iterate and understand what vibe checking is.

It’s a way of evaluating an LLM based on subjective KPIs that matter most to the team and the end users.

By adopting tailored KPIs and metrics that are more applicable to a given use case, you can align your approach with the end user experience in mind.

For example, in creative or customer facing use cases, user satisfaction and emotional understanding can be more relevant metrics to monitor an LLMs performance.

However, it’s important to note that vibe checking shouldn’t replace the use of traditional metrics, but complement them in order to fill the gaps left by objective metrics and provide a more holistic view of LLM performance.

By creating a balance that’s right for your specific requirements, it allows you to take a flexible and adaptable approach to accommodate for the unique needs of your industry or use case.

Three Rational Reasons Why Vibe Checking is a Valid Approach

By exploring the promise and value behind subjective KPIs, we can see three main reasons to adopt vibe checking in any LLM evaluation approach:

1) Complements Objective Data

We know that quantitative metrics offer valuable data, but they often miss the nuanced aspects of user interactions and overall satisfaction.

If we combine vibe checking with traditional metrics for instance by measuring response accuracy and user satisfaction through sentiment analysis, we can assess the LLMs effectiveness more comprehensively.

This holistic approach of complementing objective and subjective data ensures our LLMs are not only meeting technical benchmark standards, but also satisfying users’ expectations.

2) Aligns with User-Centric Development

As a natural follow on to the previous reason, our more well-rounded understanding of performance naturally leads to more user-centric development. This approach ensures LLMs resonate with our users in practice, not just perform well on paper.

For a guide on how to adopt a user centric approach when building on top of AI, read our previous article here.

3) Drives Continuous Improvement

Vibe checking is more dynamic and allows for evaluation with evolving user needs and expectations. Therefore, facilitates iterative improvements based on real world usage.

A simple practical application of this would be to use analytics tools like Phospho to log and collect user interactions with your LLM. You could then analyse any trends, pains, or preferences and use these insights to guide further development and fine tuning.

For a deeper read on how to adopt a product led approach with Phospho, read this previous article here.

How to Implement Vibe Checking in LLM Evaluation Processes

To consider this approach more practically, here’s how you can implement vibe checking when evaluating LLMs.

Firstly, you’ll need to define clear subjective KPIs that align with your specific use case.

Define Clear Subjective KPIs:

For example, if you’re developing a customer service chatbot, your goal might be to improve user satisfaction.

How can we do this? We can think of three hypothetical metrics we can monitor to check for this:

  1. Response appropriateness You could assess how well the LLMs responses match the context and intent of user queries by measuring how many follow up questions are needed to provide the final output.
  2. Classification of literal and figurative speech By measuring and reducing the error rate at which an LLM can correctly identify literal and figurative speech, we can fine tune its ability to produce outputs that are more aligned with user intents and minimise the need for multiple inputs from the user.
  3. Emotional Intelligence This metric can gauge the LLMs ability to recognise and respond appropriately to user emotions that can vary in customer service. We can use sentiment analysis in user responses and track how effectively the chatbot can successfully de-escalate frustrated users.

It’s important to have them well defined because they allow for consistent evaluation over time. Without them being clear, it’s hard to gauge progress or identify areas needing refinement.

Regular Feedback Loops

Vibe checking allows for the measuring of metrics that are more relevant to end user experiences. By strategically implementing feedback loops we can gather insights that directly correlate with what our users really need.

You can use AI analytics tools like Phospho to set custom KPIs to continuously monitor and automatically ‘flag’ for. By investigating alerted user interactions you can gain visible insights into how to inform iteration cycles to optimise your LLM for your use case.

For a deeper dive on how to use Phospho for evaluating and improving your LLM integration, read our previous article about its features and use cases here.

Combining With Objective Metrics and Data Visualisation:

To create a balanced evaluation, combine vibe checking with traditional metrics. So start by identifying key objective metrics for your use case such as response latency or accuracy.

Next, determine how to weigh them up against each other, for example, with chatbots you might weigh response speed more than accuracy, but for data analysis you might prioritise accuracy. Then you can visualise this with a dashboard that displays both quantitative and qualitative metrics side by side for more complete insights at a glance.

If you want to see how you can create customised dashboards with subjective metrics, read our previous article here.

Conclusion: Embracing Subjectivity for Better LLM Performance

With quite literally any use case possible when integrating AI into our products, we should all understand from this article the need for more diverse evaluation metrics. We now need a balance of both objective and subjective metrics to fully capture an LLMs performance.

By integrating vibe checking into our evaluation processes we can start to evaluate our LLM’s performance with more relevance to our users’ real needs across any use case. Without this level of flexibility we risk developing products that are both non differentiated and non resonant with our users by over relying on standardised benchmark metrics for evaluation.

You can start creating your own subjective metrics with Phospho’s custom KPIs by signing up here for free.