“If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire”

-Norbert Wiener, Some Moral and Technical Consequences of Automation

The Problem of AI Alignment

Artificial intelligence (AI) is a unique technology in a number of ways. First, its sheer scope of influence: AI already influences many areas of our daily life directly or indirectly through social media, targeted ads, navigation apps, fraud detection systems, medical imagery analysis, and increasingly, the government’s use of AI to aid policy legislation and execution. Second, by virtue of its design, AI pursues its goals without explicit instructions. It turns out, allowing AI to learn solutions can be easier and more effective than solving problems directly. Lastly, the deep learning algorithms that have largely led the AI revolution are not well understood and remain black boxes. If we continue to allow AI to advance without considering the values it ought to align with, the aforementioned factors could very well culminate in a scenario in which we end up with AI systems that we barely understand which simultaneously do bad things and have a high degree of influence over human society.

The problem of AI alignment has two dimensions: the technical and the normative. The technical side is comprised of a number of ongoing projects. Most notably, with the rise of highly capable large language models (LLMs), establishing guardrails has become important. With LLMs, the problem is complicated by the fact that the data they were trained with contains everything from moral philosophy to racist Reddit comments. In other words, their capabilities are deeply steeped in bias. Another notable project is that of using reinforcement learning (RL) models to incorporate consequentialist moral theories such as utilitarianism. However in all these cases, the autonomous nature of AI as noted earlier means it is hard to build in deontological constraints such as the notion of rights. As noted in the paper, “the methods we use to build artificial agents may influence the kind of values or principles we are able encode.” And in this way, the technical and normative aspects of AI alignment are inseparable.

The Attraction of a Normative Cleanse

Deep learning largely eliminated the need for engineers to hand craft features or meticulously design models. Instead, deep learning models would learn the necessary features on their own, figuring out what part of the data was most relevant for the task, without the need for much human intervention. I call this hands-off approach to problem-solving a ‘normative cleanse.’ By cleansing the problem solving process of how humans believe the problem ought to be solved, the AI model can invent its own methods. And it turns out, given enough data, AI models can be very good at this. The attraction of this paradigm is that there is no human bias involved.

Perhaps a similar approach could be used for aligning AI. The idea would be to allow the AI to ‘discover’ morality on its own. For example, using methods like inverse reinforcement learning (IRL), an AI could learn a reward function by observing the behavior of human experts or large populations. The AI could then be trained to act based on the values it infers from that data.

However, there are clear limits to a normative cleansing approach. A normative cleanse for aligning AI would never be able to cleanse all normativity. The problem is that we always need baseline judgements for how AI ought to behave. As Gabriel points out, such an approach begs the question: “Who is the moral expert from which AI should learn?” and “From what data should AI extract its conception of values, and how should this be decided?” This approach cannot escape the distinction between facts and values; simply observing what people do does not tell us what they ought to do.

A Values-Based Approach

Given the shortcomings of a purely technical or observational approach, Gabriel argues for a more direct engagement with the normative dimension. He evaluates a hierarchy of possible alignment targets, beginning with simple instructions. This is quickly dismissed due to the risk of “excessive literalism,” famously illustrated by the myth of King Midas, who got exactly what he asked for, not what he truly wanted. Moving up the ladder, aligning with a user’s intentions is better, but intentions can be misinformed, irrational, or unethical. Aligning with revealed preferences, or what a person’s behavior shows they prefer, is also deeply flawed. People may have adaptive preferences born from deprivation, or malicious preferences that aim to harm others. These difficulties lead Gabriel to conclude that the most robust goal for alignment is a set of guiding values or moral principles. A values-based approach is superior because it can integrate other goals into a coherent structure. For example, a principle can grant freedom of action (respecting intentions) up to the point where it causes harm to others, a structure seen in Mill’s Harm Principle and Asimov’s Laws of Robotics. This approach can also incorporate complex considerations like justice, rights, and the moral claims of future generations or the environment, which are overlooked by simpler alignment targets.

The central challenge, then, becomes a political one, not a metaphysical one.

[A]s the philosopher Rawls notes, human beings hold a variety of reasonable but contrasting beliefs about value. What follows from the ‘fact of reasonable pluralism’ is that even if we strongly believe we have discovered the truth about morality, it remains unlikely that we could persuade other people of this truth using evidence and reason alone (Rawls 1999, 11–16). There would still be principled disagreement about how best to live. Designing AI in accordance with a single moral doctrine would, therefore, involve imposing a set of values and judgments on other people who did not agree with them. For powerful technologies, this quest to encode the true morality could ultimately lead to forms of domination.

The goal is not to identify the one “true” moral theory and encode it into machines, but to identify fair principles for alignment that can receive widespread endorsement despite disagreement. Gabriel proposes three potential methods drawn from political theory:

  1. Global Overlapping Consensus: This approach seeks to identify principles that are already supported by a consensus across different cultures and belief systems. The doctrine of universal human rights is a prime example, as it finds justification in numerous Western, Islamic, African, and Confucian traditions, forming a basis for a global public morality. Aligning AI with human rights could thus avoid the problem of imposing one group’s values on another. However, this approach faces challenges, as the current “global convergence” on AI principles like fairness and transparency may be superficial and geographically biased, largely reflecting the views of actors in economically developed countries.
  2. Hypothetical Agreement: A second method uses a thought experiment: the “veil of ignorance.” We should ask what principles people would choose to govern AI if they did not know their own social position, wealth, or moral beliefs. Stripped of this knowledge, parties are forced to choose principles that are fair to all. Gabriel suggests that from behind the veil, people would likely endorse principles ensuring safety, opportunities for human control (to protect autonomy), and distributive justice, perhaps requiring that advanced AI must work to benefit the least well-off in society.
  3. Social Choice Theory: This final approach focuses not on finding agreed-upon principles but on fairly aggregating individual views to make a collective judgment. This could take a “democratic” form, where citizens vote on or deliberate to endorse a high-level “constitution” for AI that defines its fundamental goals and constraints. This process could confer legitimacy on the chosen principles, showing they have been actively endorsed by those who will be affected by the technology.

Ultimately, Gabriel argues that solving the alignment problem requires a combined research agenda where technical and normative questions are addressed together. The path forward is not to find a single, final answer, but to design fair, inclusive, and stable processes through which humanity can collectively decide what values its most powerful creations ought to embody.

References: Gabriel, I. Artificial Intelligence, Values, and Alignment. Minds & Machines 30, 411–437 (2020)