In this episode, Anna and Aiden discuss whether LLMs (Large Language Models) are good at reasoning? Or, are they force-fit to pass certain well-known benchmarks?
The material for this episode comes from two research studies. They are:
1. GSM-Symbolic: Understanding the Limitations of
Mathematical Reasoning in Large Language Models by
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi
Oncel Tuzel, Samy Bengio and Mehrdad Farajtabar working at Apple
2. Functional Benchmarks for Robust Evaluation of
Reasoning Performance, and the Reasoning Gap by
Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar,
Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas from Consequent AI