December 05, 2024

5 Reasons Traditional Testing Don't Work For LLM Applications

LLM applications have grown significantly in recent years, which, like every other application, requires testing. However, traditional testing techniques fall short when applied to these applications.

So as software testing professionals taking on LLM applications, you’ll need to understand the key difference between LLM applications and traditional applications that make their testing strategies different.

Let’s start with the first, non-deterministic outputs.

1. Non-Deterministic Outputs

Testing traditional applications is straightforward. You know what output to expect with a certain input. For LLM applications, however, it’s the complete opposite.

For instance, if you prompted an LLM application to write a search algorithm to solve a certain problem. You might get a binary search algorithm the first time and linear search the second time. There’s also the problem of hallucinations, creating false information and presenting it as factually accurate, complicating testing even further. You can see how a deterministic approach to testing LLM applications would be ineffective, as trying to match every output that the AI model would output is unrealistic

What you’ll need to test for is relevance and factual accuracy, which is where tools like DeepEval have become particularly useful.

2. Diverse Application Functionalities

When it comes to traditional applications, testing approaches are relatively consistent across different industries.

Whether you're testing an e-commerce or a finance application, you’d still perform functional tests like unit or integration testing, that generally have the same pattern. Often times this isn’t always the case for LLM applications.

Here’s what I mean. Let’s assume you were testing a chatbot and AI trading bot that bot used LLMs underneath, you wouldn’t test both the same way. The chatbot might need to be tested for response time (how quickly it responds to a prompt) and coherence (how aligned the responses are to your query). An AI trading bot, on the other hand, would need to be tested for prediction accuracy and throughput (how many trades it can analyze and execute within a defined time frame). Same technology, different testing strategies.

To determine what tests approach an LLM application needs, you must first identify its primary purpose. Once you’ve done that, you can then identify the metrics that define what quality means for that application and tailor your tests accordingly.

There’s no one-size-fits-all approach to testing LLM applications.

3. Risk Assessment

Risks are well-defined in traditional applications. If you’re testing for security risks, you’d be testing for things like authentication vulnerabilities, injection attacks, DOS attacks, things you can test with function testing techniques. For LLMs though, risk assessment is a bit more complex.

For LLM applications, you’re more concerned with things like bias. Bias is when an LLM favours a group, view, or idea as a result of the dataset it was trained on. An example of this could be gender bias, where responses favour one gender over the other. So, how would you test an LLM for bias? Traditional functional tests won’t work.

Fortunately, tools like DeepEval enable you to test for bias using a bias metric. By distinguishing between opinions and non-factual statements, you can evaluate bias across multiple categories such as gender, political, racial, or geographical bias.

4. Resource Constraints

When running regression tests for traditional applications, your costs are most likely to be compute time, or maybe developer time, so you’re able to plan and budget accordingly. But for LLMs, as always, not as straightforward.

Every prompt you make to an LLM and output it returns costs money as they work on an output and input token cost basis. This makes testing costs somewhat difficult to estimate and potentially expensive since you might not know how much testing you’d have to do to validate that an LLM is, maybe, say, unbiased. A simple test could end up processing thousands of tokens, and when you're running hundreds of these tests, the costs add up quickly. This often makes engineering managers hesitant to assign LLM testing tasks to mid-level developers who might need multiple iterations to get the testing right.

One practical way to manage LLM test costs is to be strategic about the model you select during testing, choosing models that satisfy your use case and not necessarily the most performant ones. For instance you could choose models like GPT-3.5 which are less expensive for generating test data and GPT-4 when you need a model with better reasoning capabilities.

Evaluation Metrics

In traditional testing, your tests either pass or fail based on predefined criteria.

If you're testing a payment system, payment goes through or it doesn't. If you're testing API response times, they either meet your performance benchmarks or they don't, there’s no in-between. For LLM applications though? A lot of in-betweens!

Take, for instance, relevance. When testing for output relevance, you want to test for how well the AI’s responses address the user’s query, “in-between.” You might also test for coherence, how consistent the LLM’s responses are, another “in-between”. Luckily tools like Promptfoo and DeepEval have provided standardized ways to measure these metrics. They can evaluate responses across multiple dimensions, from relevance to coherence and even factual accuracy.

This makes it possible to establish reliable quality benchmarks for LLM applications, and automate them, even when dealing with subjective metrics.

You Know What To Do

It is clear that traditional testing strategies don’t necessarily translate to LLM applications. So, software testers must define app-specific testing strategies for each LLM application.

Fortunately, tools likeDeepEval and Prompt enable testers to assess the quality of these applications based on measurable metrics.

So, to effectively test LLM applications, first identify what quality metrics define what quality means for that application. Then, create a testing strategy tailored to that application and choose the tool that suits your application.

References

MagicPod is a no-code AI-driven test automation platform for testing mobile and web applications designed to speed up release cycles. Unlike traditional "record & playback" tools, MagicPod uses an AI self-healing mechanism. This means your test scripts are automatically updated when the application's UI changes, significantly reducing maintenance overhead and helping teams focus on development.

AI, Performance Testing

5 Reasons Traditional Testing Don't Work For LLM Applications

1. Non-Deterministic Outputs

2. Diverse Application Functionalities

3. Risk Assessment

4. Resource Constraints

Evaluation Metrics

You Know What To Do

References

Written by Jahdunsin Osho

Related posts

5 Ways Fuzz Testing Catches Bugs Your Tests Miss

Enhancing Web Application Quality with Effective Mutation Testing

Managing Test Cases with Artificial Intelligence

Popular posts

Subscribe to stay updated!

Follow for updates

5 Reasons Traditional Testing Don't Work For LLM Applications

1. Non-Deterministic Outputs

2. Diverse Application Functionalities

3. Risk Assessment

4. Resource Constraints

Evaluation Metrics

You Know What To Do

References

Written by Jahdunsin Osho

Related posts

5 Ways Fuzz Testing Catches Bugs Your Tests Miss

Enhancing Web Application Quality with Effective Mutation Testing

Managing Test Cases with Artificial Intelligence

Popular posts

Crafting a Comprehensive Performance Test Plan Document

Facing 2025: How to future-proof your QA career in an AI-driven world

Crafting a Comprehensive User Acceptance Test (UAT) Report

Improve Software Performance by Simulating Slow Internet with Selenium

Automating Accessibility Testing in Your CI/CD Pipelines with Axe

7 Key Playwright Techniques to Eliminate Test Flakiness and Boost Reliability

Subscribe to stay updated!

Follow for updates