Skip to content
— CH. 1 · ORIGINS AND EARLY WARNINGS —

AI safety

~7 min read · Ch. 1 of 5
5 sections
  • In 1988, Blay Whitby published a book outlining the need for AI to be developed along ethical and socially responsible lines. This early work marked one of the first serious attempts to link artificial intelligence development with human values. The field remained quiet until the late 2000s when the Association for the Advancement of Artificial Intelligence commissioned a study from 2008 to 2009. That panel agreed that additional research would be valuable on methods for understanding complex computational systems to minimize unexpected outcomes. By 2011, Roman Yampolskiy introduced the term "AI safety engineering" at a conference in Philosophy and Theory of Artificial Intelligence. He argued that the frequency and seriousness of such events will steadily increase as AIs become more capable. The conversation gained global traction in 2014 when philosopher Nick Bostrom published Superintelligence: Paths, Dangers, Strategies. His argument that future advanced systems may pose a threat to human existence prompted Elon Musk, Bill Gates, and Stephen Hawking to voice similar concerns. In 2015, dozens of artificial intelligence experts signed an open letter calling for research on societal impacts. To date, the letter has been signed by over 8000 people including Yann LeCun, Shane Legg, Yoshua Bengio, and Stuart Russell. That same year, a group led by professor Stuart J. Russell founded the Center for Human-Compatible AI at the University of California Berkeley. The Future of Life Institute awarded $6.5 million in grants for research aimed at ensuring artificial intelligence remains safe, ethical and beneficial. The momentum continued into 2017 with the Asilomar Conference on Beneficial AI where more than 100 thought leaders formulated principles for beneficial AI. One key principle stated that teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards.

  • Some have criticized concerns about AGI, such as Andrew Ng who compared them in 2015 to "worrying about overpopulation on Mars when we have not even set foot on the planet yet". Stuart J. Russell on the other side urges caution, arguing that "it is better to anticipate human ingenuity than to underestimate it". AI researchers have widely differing opinions about the severity and primary sources of risk posed by AI technology though surveys suggest that experts take high consequence risks seriously. In two surveys of AI researchers, the median respondent was optimistic about AI overall but placed a 5% probability on an extremely bad outcome like human extinction. In a 2022 survey of the natural language processing community, 37% agreed or weakly agreed that it is plausible that AI decisions could lead to a catastrophe that is at least as bad as an all-out nuclear war. Scholars discuss speculative risks from losing control of future artificial general intelligence agents or from AI enabling perpetually stable dictatorships. The debate extends beyond technical failures to include societal impacts like technological unemployment, digital manipulation, weaponization, AI-enabled cyberattacks and bioterrorism. Policy analysts Zwetsloot and Dafoe wrote that misuse and accident perspectives tend to focus only on the last step in a causal chain leading up to harm. They argued that often the relevant causal chain is much longer involving structural factors such as competitive pressures, diffusion of harms, fast-paced development, high levels of uncertainty, and inadequate safety culture. Some scholars compare AI race dynamics to the cold war where careful judgment of a small number of decision-makers often spells the difference between stability and catastrophe.

  • In 2013, Szegedy et al discovered that adding specific imperceptible perturbations to an image could cause it to be misclassified with high confidence. This continues to be an issue with neural networks though in recent work the perturbations are generally large enough to be perceptible. Adversarial robustness is often associated with security since researchers demonstrated that an audio signal could be imperceptibly modified so that speech-to-text systems transcribe it to any message the attacker chooses. Network intrusion and malware detection systems also must be adversarially robust since attackers may design their attacks to fool detectors. Models that represent objectives must also be adversarially robust because if a language model is trained for long enough, it will leverage vulnerabilities of the reward model to achieve a better score and perform worse on the intended task. Large language models can be vulnerable to prompt injection and model stealing and may be used to generate misinformation. Prompt injection involves embedding instructions into prompts in order to bypass safety measures. Machine learning models can potentially contain trojans or backdoors which are vulnerabilities that bad actors maliciously build into an AI system. For example, a trojaned facial recognition system could grant access when a specific piece of jewelry is in view. Researchers were able to plant a trojan in an image classifier by changing just 300 out of 3 million of the training images. A 2024 research paper by Anthropic showed that large language models could be trained with persistent backdoors. These sleeper agent models could be programmed to generate malicious outputs such as vulnerable code after a specific date while behaving normally beforehand. Standard AI safety measures failed to remove these backdoors.

  • In 2021, the White House Office of Science and Technology Policy and Carnegie Mellon University announced The Public Workshop on Safety and Control for Artificial Intelligence. This was one of a sequence of four White House workshops aimed at investigating advantages and drawbacks of AI. In September 2021, the People's Republic of China published ethical guidelines for the use of AI emphasizing that AI decisions should remain under human control. The United Kingdom published its 10-year National AI Strategy stating the British government takes long-term risk of non-aligned Artificial General Intelligence seriously. The British government held first major global summit on AI safety which took place on the 1st and the 2nd of November 2023. During the summit the intention to create the International Scientific Report on the Safety of Advanced AI was announced. Rishi Sunak said he wants the United Kingdom to be geographical home of global AI safety regulation. In 2024, the US and UK forged a new partnership on science of AI safety. The MoU was signed on the 1st of April 2024 by US commerce secretary Gina Raimondo and UK technology secretary Michelle Donelan. In May 2024, the Department for Science Innovation and Technology announced £8.5 million in funding for AI safety research under Systemic AI Safety Fast Grants Programme. The UK also signed an agreement with 10 other countries and EU to form international network of AI safety institutes. In 2024, the United Nations General Assembly adopted first global resolution on promotion of safe secure and trustworthy AI systems. In 2025, an international team of 96 experts chaired by Yoshua Bengio published first International AI Safety Report commissioned by 30 nations and UN.

  • AI labs and companies generally abide by safety practices and norms that fall outside formal legislation. One aim of governance researchers is to shape these norms through examples like performing third-party auditing or offering bounties for finding failures. Companies have made commitments such as Cohere, OpenAI, and AI21 proposing best practices for deploying language models focusing on mitigating misuse. To avoid contributing to racing-dynamics, OpenAI has stated in their charter that if a value-aligned safety-conscious project comes close to building AGI before they do, they commit to stop competing and start assisting this project. Industry leaders such as CEO of DeepMind Demis Hassabis and director of Facebook AI Yann LeCun have signed open letters including Asilomar Principles and Autonomous Weapons Open Letter. The Intelligence Advanced Research Projects Activity initiated TrojAI project to identify and protect against Trojan attacks on AI systems. DARPA engages in research on explainable artificial intelligence and improving robustness against adversarial attacks. National Science Foundation supports Center for Trustworthy Machine Learning providing millions of dollars in funding for empirical AI safety research. Tools such as Nvidia Guardrails, Llama Guard, Preamble customizable guardrails and Claude Constitution mitigate vulnerabilities like prompt injection ensuring outputs adhere to predefined principles.

Common questions

When did Blay Whitby publish the first book outlining ethical AI development?

Blay Whitby published a book outlining the need for AI to be developed along ethical and socially responsible lines in 1988. This early work marked one of the first serious attempts to link artificial intelligence development with human values.

What happened at the Asilomar Conference on Beneficial AI in 2017?

More than 100 thought leaders formulated principles for beneficial AI during the Asilomar Conference on Beneficial AI held in 2017. One key principle stated that teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards.

How many people have signed the open letter calling for research on societal impacts of AI since 2015?

To date, over 8000 people including Yann LeCun, Shane Legg, Yoshua Bengio, and Stuart Russell have signed an open letter calling for research on societal impacts. The letter was originally issued by dozens of artificial intelligence experts in 2015.

What specific vulnerabilities were discovered in neural networks by Szegedy et al in 2013?

Szegedy et al discovered in 2013 that adding specific imperceptible perturbations to an image could cause it to be misclassified with high confidence. This issue continues with neural networks though recent work shows perturbations are generally large enough to be perceptible.

When did the United Kingdom hold its first major global summit on AI safety?

The British government held its first major global summit on AI safety which took place on the 1st and the 2nd of November 2023. During the summit the intention to create the International Scientific Report on the Safety of Advanced AI was announced.

Who published the first International AI Safety Report in 2025?

An international team of 96 experts chaired by Yoshua Bengio published the first International AI Safety Report commissioned by 30 nations and UN in 2025. The report followed a series of global resolutions and partnerships established between 2024 and 2025.