Listen

Description

In this episode, Anna and Aiden discuss whether LLMs (Large Language Models) are good at reasoning? Or, are they force-fit to pass certain well-known benchmarks?

The material for this episode comes from two research studies. They are:

1. GSM-Symbolic: Understanding the Limitations of
Mathematical Reasoning in Large Language Models
by
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi
Oncel Tuzel, Samy Bengio and Mehrdad Farajtabar working at Apple

2. Functional Benchmarks for Robust Evaluation of
Reasoning Performance, and the Reasoning Gap
by
Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar,
Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas from Consequent AI