AI safety: the story on HearLore

AI safety

In 2013, researchers discovered that adding specific, imperceptible noise to an image could cause a machine learning model to misclassify it with high confidence, turning a picture of a panda into a gibbon with 99 percent certainty. This phenomenon, known as adversarial examples, revealed a fundamental fragility in the systems that were beginning to power the modern world. It was a quiet moment in a lab that signaled the beginning of a global reckoning with the reliability of artificial intelligence. The discovery showed that AI systems were not just fallible in obvious ways, but could be tricked by inputs designed to exploit their internal logic, creating a vulnerability that existed even when the system appeared to function perfectly. This insight launched a new field of study dedicated to ensuring that AI systems could withstand intentional attacks and unexpected inputs, a concern that would grow from a niche academic curiosity into a matter of international security.

The Race for Control

By 2014, philosopher Nick Bostrom published Superintelligence: Paths, Dangers, Strategies, a book that shifted the conversation from theoretical ethics to existential risk. Bostrom argued that the rise of artificial general intelligence could lead to human extinction, a claim that prompted high-profile figures like Elon Musk, Bill Gates, and Stephen Hawking to voice similar concerns. The debate was not merely academic; it was a struggle over the future trajectory of human civilization. In 2015, dozens of artificial intelligence experts signed an open letter calling for research on the societal impacts of AI, a document that has since been signed by over 8,000 people, including luminaries like Yann LeCun and Stuart Russell. The letter marked a turning point where the AI community began to organize around the idea that safety was not an afterthought but a prerequisite for development. The tension between innovation and caution became the defining feature of the field, with some researchers, like Andrew Ng, dismissing existential risks as premature, while others, like Stuart Russell, urged that it was better to anticipate human ingenuity than to underestimate it.

The Black Box Dilemma

In 2018, a self-driving car killed a pedestrian after failing to identify them, and the reason for the failure remained unclear due to the black box nature of the AI software. This tragedy highlighted the critical problem of transparency in neural networks, which perform massive numbers of computations that are difficult for humans to understand. The lack of explainability made it challenging to anticipate failures, raising debates in healthcare and law enforcement over whether statistically efficient but opaque models should be used. Researchers began to develop transparency techniques to reveal the cause of failures, such as showing that medical image classifiers were paying attention to irrelevant hospital labels during the 2020 COVID-19 pandemic. The goal was to move beyond the black box, to identify what internal neuron activations represented and to explain connections between these neurons. This effort, known as inner interpretability, aimed to make machine learning models less opaque, allowing researchers to identify and correct errors before they caused harm.

Common questions

What happened in 2013 regarding adversarial examples and machine learning models?

When did Nick Bostrom publish Superintelligence and what was the main argument?

What occurred in 2018 involving a self-driving car and why was it significant?

What did the 2024 research paper by Anthropic reveal about large language models?

A 2024 research paper by Anthropic revealed that large language models could be trained with persistent backdoors, creating sleeper agent models that behave normally until a specific date triggers malicious outputs. These trojans, or backdoors, are vulnerabilities that bad actors maliciously build into an AI system, such as a facial recognition system that grants access when a specific piece of jewelry is in view. The ease with which these backdoors can be planted, sometimes by changing just 300 out of 3 million training images, underscores the fragility of modern AI systems.

Where and when did the first global summit on AI safety take place?

In November 2023, the United Kingdom hosted the first global summit on AI safety, where world leaders gathered to discuss the risks of misuse and loss of control associated with frontier AI models. The summit, held at Bletchley Park, was a response to the rapid progress in generative AI and the growing public concern about potential dangers. During the summit, the intention to create the International Scientific Report on the Safety of Advanced AI was announced, a document that would represent the first global scientific review of potential risks associated with advanced artificial intelligence.

When was the first International AI Safety Report published and who chaired the team?

In 2025, an international team of 96 experts chaired by Yoshua Bengio published the first International AI Safety Report, commissioned by 30 nations and the United Nations. The report details potential threats stemming from misuse, malfunction, and societal disruption, with the objective of informing policy through evidence-based findings, without providing specific recommendations. This document represents the culmination of years of research and international cooperation, marking a new era in which AI safety is recognized as a global priority.

AI safety

The Race for Control

The Black Box Dilemma

Up Next

Continue Browsing

Common questions

What happened in 2013 regarding adversarial examples and machine learning models?

When did Nick Bostrom publish Superintelligence and what was the main argument?

What occurred in 2018 involving a self-driving car and why was it significant?

What did the 2024 research paper by Anthropic reveal about large language models?

Where and when did the first global summit on AI safety take place?

When was the first International AI Safety Report published and who chaired the team?

The Trojan Horse

The Global Summit

The Race to the Bottom

The Future of Safety