A CI/CD pipeline that automatically detects quality regressions whenever you change an LLM prompt or model. Every pull request runs a structured eval suite against a golden dataset, diffs the results ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results