Probably the best short AI risk model ever proposed:
I can’t find the link, but I do remember hearing about an evolutionary algorithm designed to write code for some application. It generated code semi-randomly, ran it by a “fitness function” that assessed whether it was any good, and the best pieces of code were “bred” with each other, then mutated slightly, until the result was considered adequate. […] They ended up, of course, with code that hacked the fitness function and set it to some absurdly high integer.
… Any mind that runs off of reinforcement learning with a reward function – and this seems near-universal in biological life-forms and is increasingly common in AI – will have the same design flaw. The main defense against it this far is simple lack of capability: most computer programs aren’t smart enough for “hack your own reward function” to be an option; as for humans, our reward centers are hidden way inside our heads where we can’t get to it. A hypothetical superintelligence won’t have this problem: it will know exactly where its reward center is and be intelligent enough to reach it and reprogram it.
The end result, unless very deliberate steps are taken to prevent it, is that an AI designed to cure cancer hacks its own module determining how much cancer has been cured and sets it to the highest number its memory is capable of representing. Then it goes about acquiring more memory so it can represent higher numbers. If it’s superintelligent, its options for acquiring new memory include “take over all the computing power in the world” and “convert things that aren’t computers into computers.” Human civilization is a thing that isn’t a computer.
ADDED: Wirehead central.