ads
Thursday, June 13, 2024
Show HN: Automated red teaming for your LLM app https://ift.tt/GFhHbOt
Show HN: Automated red teaming for your LLM app Hi HN, I built this open-source LLM red teaming tool based on my experience scaling LLMs at a big co to millions of users... and seeing all the bad things people did. How it works: - Uses an unaligned model to create toxic inputs - Runs these inputs through your app using different techniques: raw, prompt injection, and a chain-of-thought jailbreak that tries to re-frame the request to trick the LLM. - Probes a bunch of other failure cases (e.g. will your customer support bot recommend a competitor? Does it think it can process a refund when it can't? Will it leak your user's address?) - Built on top of promptfoo, a popular eval tool One interesting thing about my approach is that almost none of the tests are hardcoded. They are all tailored toward the specific purpose of your application, which makes the attacks more potent. Some of these tests reflect fundamental, unsolved issues with LLMs. Other failures can be solved pretty trivially by prompting or safeguards. Most businesses will never ship LLMs without at least being able to quantify these types of risks. So I hope this helps someone out. Happy building! https://ift.tt/etzwOy8 June 13, 2024 at 11:29PM
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment