Evaluation metrics for reasoning models

Listen

Evaluating models on benchmarks, passing a model vibe check, formal reasoning to synthesize datasets, and what type of datasets researchers prefer