A TTCT-inspired dataset was constructed to evaluate LLMs under varied prompts and role-play settings. GPT-4 served as the evaluator to score model outputs. In recent years, the realm of artificial ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results
Feedback