Listen

Description

In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.

 

We talked about:

 

Resources Mentioned:

CRMArena-Pro from Saleforce:

https://www.salesforce.com/blog/crmarena-pro/  

 

Connect with Neurometric:
Website: https://www.neurometric.ai/ 

Substack: https://neurometric.substack.com/ 

X: https://x.com/neurometric/ 

Bluesky: https://bsky.app/profile/neurometric.bsky.social

 

Hosts:

Rob May

https://x.com/robmay 

https://www.linkedin.com/in/robmay

 

Calvin Cooper

https://x.com/cooper_nyc_ 

https://www.linkedin.com/in/coopernyc

 

Guest/s:

Byron Galbraith

https://x.com/bgalbraith 

https://www.linkedin.com/in/byrongalbraith