Sign in Subscribe

Topic

AI Alignment

A collection of 1 issue

New Method Detects and Mitigates Reward Hacking in AI Models

Researchers have developed IR$^3$, a framework using Contrastive Inverse Reinforcement Learning (C-IRL) to detect and mitigate reward hacking in large language models. IR$^3$ reconstructs reward functions, identifies hacking signatures, and applies mitigation strategies to enhance AI alignment