Questions about AI safety

Short answers, pulled from the story.

What happened in 2013 regarding adversarial examples and machine learning models?

In 2013, researchers discovered that adding specific, imperceptible noise to an image could cause a machine learning model to misclassify it with high confidence, turning a picture of a panda into a gibbon with 99 percent certainty. This phenomenon, known as adversarial examples, revealed a fundamental fragility in the systems that were beginning to power the modern world. It was a quiet moment in a lab that signaled the beginning of a global reckoning with the reliability of artificial intelligence.

When did Nick Bostrom publish Superintelligence and what was the main argument?

By 2014, philosopher Nick Bostrom published Superintelligence: Paths, Dangers, Strategies, a book that shifted the conversation from theoretical ethics to existential risk. Bostrom argued that the rise of artificial general intelligence could lead to human extinction, a claim that prompted high-profile figures like Elon Musk, Bill Gates, and Stephen Hawking to voice similar concerns. The debate was not merely academic; it was a struggle over the future trajectory of human civilization.

What occurred in 2018 involving a self-driving car and why was it significant?

In 2018, a self-driving car killed a pedestrian after failing to identify them, and the reason for the failure remained unclear due to the black box nature of the AI software. This tragedy highlighted the critical problem of transparency in neural networks, which perform massive numbers of computations that are difficult for humans to understand. The lack of explainability made it challenging to anticipate failures, raising debates in healthcare and law enforcement over whether statistically efficient but opaque models should be used.

What did the 2024 research paper by Anthropic reveal about large language models?

A 2024 research paper by Anthropic revealed that large language models could be trained with persistent backdoors, creating sleeper agent models that behave normally until a specific date triggers malicious outputs. These trojans, or backdoors, are vulnerabilities that bad actors maliciously build into an AI system, such as a facial recognition system that grants access when a specific piece of jewelry is in view. The ease with which these backdoors can be planted, sometimes by changing just 300 out of 3 million training images, underscores the fragility of modern AI systems.

Where and when did the first global summit on AI safety take place?

In November 2023, the United Kingdom hosted the first global summit on AI safety, where world leaders gathered to discuss the risks of misuse and loss of control associated with frontier AI models. The summit, held at Bletchley Park, was a response to the rapid progress in generative AI and the growing public concern about potential dangers. During the summit, the intention to create the International Scientific Report on the Safety of Advanced AI was announced, a document that would represent the first global scientific review of potential risks associated with advanced artificial intelligence.

When was the first International AI Safety Report published and who chaired the team?

In 2025, an international team of 96 experts chaired by Yoshua Bengio published the first International AI Safety Report, commissioned by 30 nations and the United Nations. The report details potential threats stemming from misuse, malfunction, and societal disruption, with the objective of informing policy through evidence-based findings, without providing specific recommendations. This document represents the culmination of years of research and international cooperation, marking a new era in which AI safety is recognized as a global priority.

Read the full story about AI safety →

Up Next

AI alignment