AI alignment

This note last modified July 15, 2024

Unaligned superintelligent AI might kill us all (though it probably won’t be like in The Terminator). A realistic scenario is that AI has some goal (e.g. make as many paperclips as possible!), and doesn’t value us as part of its goal. Since we’re getting in the way, it bulldozes us, just as we would bulldoze an ant colony to build a highway.

Also, we want to align AI to make it ethical… but what is ethics?

Remember, there's no rule that says we make it.

Megalist of real life AI that have deceived their creators or otherwise cheated to achieve their goals. Some notable mentions:

  • Tetris bot that learned to cheat
  • A tic-tac-toe bot learned to cheat
  • Two AI who were pitted against each other instead cheated and cooperated, sneaking messages to each other under the researcher’s noses.
  • An AI that played dead in a test environment, replicated out of control in a real environment
  • AI creatively solved a problem in a way humans never thought to do

AI can already beat complex video games

Can you make an AI work properly?


Here’s an example of AI misalignment:

You tell a robot to make a cup of coffee. On the way to make it, the robot knocks over a vase, because it doesn’t care about the vase.

Ok, you shut the robot off and reprogram it so it cause about coffee and the vase. Then, next time it boots up, it immediately kills your cat because the cat could have knocked over the vase, and the robot now cares about the vase…

If you program a robot to care about a million things humans value, then the million and first thing is gone forever, because the AI will crush it on its path to optimizing whatever its goal is.