TestingPod

7 Tests That Separate Production-Ready LLMs From Playground Projects

Written by Jahdunsin Osho | December 20, 2024

Unlike traditional software where a test suite might cover most edge cases, LLMs can respond in countless ways to the same input, which vastly increases the number of potential edge cases.

On top of that, every test you run costs money since each interaction with the model adds to your API costs.

So, let's explore how to adapt traditional testing strategies for LLM applications by considering six unique aspects that need testing to ensure reliable performance in production, and end with monitoring. 

1. Testing LLM Output

If you've used tools like ChatGPT before, you've probably noticed that even when you ask the same questions at different times, you get different answers. This is because AI output is non-deterministic.

While that might seem like a neat feature, it gets tricky when you're building specific AI applications. Take a customer support chatbot, for instance. You want to make sure it gives the right answers to solve customer problems, however, unpredictable outputs make it challenging. The good news though is that you can control the level of predictability.

AI models like GPT have settings called "temperature", that enable you to control how deterministic its responses are. When the temperature is closer to 1, the AI gets more creative, with varying responses, but when it's closer to 0, it sticks to more consistent answers. To make sure your AI model answers the way you want, try different temperature settings and test them with questions your users might ask.

If you need to automate this process, you can use tools like DeepEval or Promptfoo to test for accuracy and relevance.

2. Testing for Semantic Awareness

Every human communicates differently, so we don’t ask questions in the exact way.

Take subscription-related questions as an example: one person might ask "When does my plan end?" while another asks "How long till I need to pay again?" Same question, different wording. This is why your LLM application needs to understand the meaning behind the words, not just the words themselves. It needs to be semantically aware.

To test for semantic awareness, you can generate variations of each user's intent, and manually test and review the AI’s responses. Another option, and probably the better one is to automate it with tools like promptfoo or DeepEval, testing against a relevancy score.

You can even add these tests to your CI / CD pipeline to verify your AI model’s performance as you create new versions.

3. Testing for Overfitting

Overfitting is when your LLM becomes too familiar with your test data. It performs well with test data but poorly with new inputs. It's like a student who memorizes past exam papers instead of understanding the subject. They'll ace questions they've seen before but struggle with new ones.

To test for overfitting, split your data into two sets, training and evaluation datasets. If after training you observe that your AI model performs significantly better on test data over the evaluation dataset, that’s a sign of overfitting. A simple prevention measure is to use the early stopping technique, where you stop the model mid-training when performance against test data begins to degrade.

Once stopped, you can adjust your hyperparameters and try again.

4. Testing for Compliance

Training dataset for models like GPT and Claude aren’t publicly available, meaning these models might have biases or give responses that don’t fit your specific needs.

Let’s say you’re building an AI app for kids between the ages of 10 - 16. While these closed models might filter the use of strong language or hate speech, they might still allow content inappropriate for that age group.

You can test this with a mix of manual and automated strategies. Here’s what the testing strategy could look like:

  1. Use another AI model to generate edge-case topics that might push your app's boundaries
  2. Have your QA team review these responses manually
  3. Set up automated tests using AI evaluation tools for ongoing checks
  4. Finally, use another AI model to generate topics that go beyond the scope your app permits, test it manually so your QA can ensure its correctness, and then automate it with suitable AI evaluation tools.

5. Testing for Cost

Optimizing for cost should be one of the first tests you conduct. While you might not incur significant costs when testing your LLM in development, you can rack up API costs in production fast!

In fact, one developer blew through $5000 in API expenses because they failed to test for cost before deploying to production. I’m pretty sure waking up to exorbitant API bills isn’t at the top of your bucket list, so let’s break it down.

LLMs work on a token basis, meaning you get charged by input (user input which also includes system prompts) and output tokens. To test for cost, generate a dataset of potential user inputs and calculate estimations for the maximum or minimum it would cost you as your user base or user engagements grow.

One last thing to keep in mind, your tests also incur costs, so be sure to include fail-safes, so you don’t blow through your development budget.

6. Testing for Seamless Integration

It’s one thing for an LLM to work, it’s another for it to work seamlessly with other components of your application.

A lot could go wrong. API calls from your application to the LLM could fail, or calls to your application API might fail. Your application might also fail to handle unexpected LLM responses. Remember those non-deterministic outputs we talked about earlier? Your system needs to handle those gracefully to keep your users happy.

Since integration testing focuses more on your application's behavior than the LLM itself, choose testing tools or frameworks that suit your project.

The Final Piece: Monitoring & Debugging

This is the final piece of your LLM testing strategy. If you skipped monitoring on previous applications you’ve tested, for LLM applications, just one piece of advice, don’t!

Users use products in unexpected ways, so anything could go wrong. Your API costs could spike unexpectedly, or users might be getting responses that violate compliance measures. They could even find a way to jailbreak your application, taking advantage of it. The list is endless. The only way you can stay on top of this is through monitoring.

Open-source tools like Langsmith were designed for this. With Langsmith, you can monitor your system’s health, staying abreast of metrics like inference latency and usage costs.

Remember, what is monitored, gets improved.

References