stop button problem
This note last modified September 1, 2024
AI safety
- AI won’t let you hit the stop button
- Add “allow stop to utility function”
- AI hits its own stop button
- “tell it it can’t hit the stop button”
- AI makes you hit the stop button OR convinces you that it is safe in times where it’s in a testing environment
- Make the AI just not care about the button
- AI makes a subagent that isn’t bound by the same safety principles.
- Keep the button a secret from the agent
- The AI would figure it out eventually.