A CI/CD pipeline that automatically detects quality regressions whenever you change an LLM prompt or model. Every pull request runs a structured eval suite against a golden dataset, diffs the results ...