Replace timezone-juggling interviews with a paid, 90-minute async skills test graded by a calibrated LLM judge. Santi and Kira walk through the complete system: golden-set calibration, pairwise comparison with permutation debiasing, confidence bands for pass/borderline/reject decisions, and targeted human sampling on edge cases. Includes anti-cheat strategies, regional pay bands, and a transparent appeal process. Based on Stanford SCALE research and Chatbot Arena methodology, this episode delivers a deployable hiring pipeline that respects contractor time while ensuring quality.