stop button problem

This note last modified September 1, 2024

AI safety

  • AI won’t let you hit the stop button
    • Add “allow stop to utility function”
  • AI hits its own stop button
    • “tell it it can’t hit the stop button”
  • AI makes you hit the stop button OR convinces you that it is safe in times where it’s in a testing environment
  • Make the AI just not care about the button
    • AI makes a subagent that isn’t bound by the same safety principles.
  • Keep the button a secret from the agent
    • The AI would figure it out eventually.