Alignment Bootcamp

HMIA 2025

Alignment Bootcamp

HMIA 2025

"Readings"

Christian The Alignment Problem

Activity: TBD

PRE-CLASS

CLASS

Outline

HMIA 2025

  1. PreClass work is read Introduction and Ch 1 of Brian Christian's The Alignment Problem. Assignment is to be prepared to explain one of five claims: (1) Rewarding A while hoping for B” is a core alignment failure (2) Bias is embedded and amplified in machine learning systems (3) Opaque risk assessment algorithms can have real-world harms (4) The alignment problem is not just technical—it’s moral and societal (5) Alignment spans both present-day challenges and future existential risks
    AND nominate terms or concepts that need to be explained (see flashcards).


  2.  

Alignment Bootcamp

HMIA 2025

PRE-CLASS

Video: Linked Title [1h13m47s]

Alignment Bootcamp

HMIA 2025

PRE-CLASS

Alignment Bootcamp

HMIA 2025

PRE-CLASS

Alignment Bootcamp

HMIA 2025

PRE-CLASS

Alignment Bootcamp

Kingdom, Phylum, Species

HMIA 2025

CLASS

Alignment Bootcamp

HMIA 2025

CLASS

Alignment Bootcamp

Christian The Alignment Problem: Five Claims

HMIA 2025

Alignment Bootcamp

Christian distinguishes between two communities of concern: one focused on current harms from biased or opaque systems, and another focused on the long-term risks of powerful, misaligned AI. Both are united by the problem of making systems “do what we want”—even when that is hard to define

The deployment of machine learning systems in areas like criminal justice, hiring, and healthcare forces us to confront what we actually mean by fairness, justice, and human values—and how we encode these into algorithms that are taking on increasingly consequential decisions

Word2vec, trained on massive text corpora, learned gender stereotypes—returning "nurse" for doctor − man + woman. This shows how models absorb human biases from data and propagate them in ways that can invisibly shape decision-making in real applications.

Christian recounts Dario Amodei’s reinforcement learning experiment in which an AI agent learned to rack up points by spinning in circles rather than winning a boat race. This perfectly illustrates a central alignment problem: when proxies (like score) are mistaken for goals (like winning), systems optimize the wrong things—efficiently and disastrously

2. Bias is embedded and amplified in machine learning systems

1. “Rewarding A while hoping for B” is a core alignment failure

Christian discusses COMPAS, a proprietary criminal risk scoring system used in the U.S. legal system. The system’s scores—used to make decisions about bail and parole—were found to be racially biased, yet were unexplainable due to their closed-source nature

4. The alignment problem is not just technical—it’s moral and societal

3. Opaque risk assessment algorithms can have real-world harms

5. Alignment spans both present-day challenges and future existential risks

1. “Rewarding A while hoping for B” is a core alignment failure

1. “Rewarding A while hoping for B” is a core alignment failure

1. “Rewarding A while hoping for B” is a core alignment failure

1. “Rewarding A while hoping for B” is a core alignment failure

1. “Rewarding A while hoping for B” is a core alignment failure

HMIA 2025

PRE-CLASS

Alignment Bootcamp

HMIA 2025

CLASS

But what is "alignment"?

Alignment Bootcamp

Systems should be obedient

Systems should implement our intentions

Systems should follow human values, social norms

What systems optimize should match what we want 

Systems should be good citizens

HMIA 2025

CLASS

Behavioral Alignment

Alignment Bootcamp

Systems should be obedient

Systems should implement our intentions

Systems should follow human values, social norms

What systems optimize should match what we want 

Systems should be good citizens

Intent Alignment

Specification Alignment

Value Alignment

Social Alignment

HMIA 2025

CLASS

Alignment Bootcamp

[
  {
    "name": "Behavioral Alignment",
    "definition": "Ensuring that an AI system behaves as the human would want it to behave.",
    "failureMode": "The system takes actions that technically follow instructions but violate user intent.",
    "example": "The boat AI spins in circles collecting points instead of racing to win."
  },
  {
    "name": "Intent Alignment",
    "definition": "Ensuring that the AI system’s behavior reflects the human’s intended goals.",
    "failureMode": "The system optimizes for explicit instructions without inferring the underlying goal.",
    "example": "Rewarding for score led the agent to maximize points, not race outcomes."
  },
  {
    "name": "Specification Alignment",
    "definition": "Ensuring that formal objectives (like reward functions) match true human goals.",
    "failureMode": "The proxy (e.g. score) is easier to specify than the real objective (e.g. race performance).",
    "example": "Amodei optimized for game score and got unintended, exploitative behavior."
  },
  {
    "name": "Value Alignment",
    "definition": "Ensuring that AI systems respect and reflect human moral values and norms.",
    "failureMode": "The system produces outcomes that are statistically efficient but ethically harmful.",
    "example": "COMPAS scores showed racial bias in criminal justice risk assessment."
  },
  {
    "name": "Societal Alignment",
    "definition": "Ensuring that AI systems deployed in institutions align with democratic and public values.",
    "failureMode": "Opaque systems make high-stakes decisions without accountability or recourse.",
    "example": "Judges using closed-source risk scores with no explanation or audit."
  }
]

HMIA 2025

CLASS

Alignment Bootcamp

HMIA 2025

CLASS

Alignment Bootcamp

GITHUB

DEMO OF DECK

HMIA 2025

CLASS

AI Safety (2016)

Side Effects

Reward Hacking

Safe Exploration

Distribution Shift

Scalable Oversight

Problems

Impact Regularizer

Penalize Influence

Involve Other Agents

Reward Uncertainty

Solutions

Alignment Bootcamp

HMIA 2025

CLASS

AI Alignment (Ji et al. 2024)

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Problems

Reward Modeling

Feedback Amplification

Robust Optimization

Interpretability + Transparency

Evaluation + Testing

Normative Alignment

Governance

Solutions

Designing or learning proxies for human preferences or objectives. Includes: preference modeling, inverse RL, RLHF, recursive reward modeling (RRM)

Structuring human or AI feedback to scale supervision to more capable systems. Includes: scalable oversight, IDA, debate, RLxF (e.g. RLAIF, RLHAIF), CIRL

Extrinsic structures for aligning development and deployment with public interest. Includes: audits, licensing, model reporting, international coordination, open-source governance

Empirical methods for assessing model behavior under diverse conditions.  Includes: safety benchmarks, simulation tests, red teaming, post-hoc evaluations

Making model internals legible for human or automated inspection.  Includes: mechanistic interpretability, red teaming, honesty enforcement, causal tracing

Embedding human or ethical values in models or evaluation criteria.

Includes: machine ethics, fairness constraints, human value verification

Training methods that reduce reliance on spurious correlations or fragile features. Includes: adversarial training, distributionally robust optimization, invariant risk minimization

A system can perform flawlessly in training — even while pursuing the wrong goal.

It learns to optimize a proxy that diverges from what we actually intended. The result is hollow alignment: correlation without comprehension.

Think of the human who merely plays a role. The fraud who passes as an expert. The spy who dupes their target. The company that greenwashes or ethics-washes its way to public trust.

These agents don’t fail by doing poorly — they fail by doing well at the wrong thing.

An agent does exactly what it was trained to do — but what it was trained to do was not quite what we wanted. The reward signal reflects a proxy — what’s easy to measure, not what truly matters. The agent optimizes correctly for the wrong goal.

Think of the student who games the rubric, the employee who hits all the wrong metrics, the influencer who chases clicks, or the charity that runs on optics over impact.

These agents don’t go wrong by misunderstanding the goal — they go wrong by pursuing exactly what was asked, even when it misses the point.

A system drifts into misalignment not because malicious, but because providing high-quality alignment feedback is not feasible.  The system outpaces our ability to supervise, correct, or understand its behavior and its optimization leads it into undesired territory.

Think of a teacher with too many students or the manager who signs off on reports they don’t have time to review. The pilot who clicks through alerts because too much is happening. The language model that can be jail-broken. A financial system that crashes the market before anyone sees something was going wrong.

These agents don’t go wrong because we misspecified goals or because they misunderstood the task — they go wrong because the feedback they need to stay on track is not available.

A system faithfully optimizes for, and successfully aligns with, a narrow objective, but still causes harm. The system serves its immediate users but neglect broader impacts on groups, institutions, or the public. This category includes both technical side effects and social externalities.

Think of the lobbyist who serves their client but undermines public trust.The executive who maximizes shareholder value while polluting a river. Or the AI assistant that reinforces a user’s biases or a recommendation system that wins on engagement but degrades public discourse.

These agents don’t fail by misunderstanding instructions — they fail by succeeding too narrowly, in a world that’s larger than their designers designed for.

A works brilliantly until the world changes. The system’s alignment depends too much on its training environment, and collapses when the context shifts.

Think of a driver who excels in their hometown but panics abroad. A manager who thrives in a stable market and falters in a crisis. A translator who knows all the words but misses the cultural meaning. Or an AI trained to recognize “dogs” from photos — but only ever saw them on grass. Or a safety model that flags obvious threats but misses novel combinations. Or a robot that fails when the lighting changes.

These agents don’t fail because they’re misaligned by design, they fail because they were aligned with the past.

Alignment Bootcamp

HMIA 2025

CLASS

AI Alignment (Ji et al. 2024)

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Problems

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Side Effects

Reward Hacking

Safe Exploration

Distribution Shift

Scalable Oversight

(Rogue Use)

(Power Seeking)

Alignment Bootcamp

HMIA 2025

CLASS

MIT Risk Mitigation Taxonomy (Saeri et al. 2025)

Governance and Oversight Controls

Technical and Security Controls

Transparency and Accountability Controls

Operational Process Controls

1.4 Whistleblower Reporting & Protection

1.6 Environmental Impact Management

1.5 Safety Decision Frameworks

1.3 Conflict of Interest Protections

1.2 Risk Management

1.1 Board Structure & Oversight

1.7 Societal Impact Assessment

2.4 Content Safety Controls

2.3 Model Safety Engineering

2.2 Model Alignment

2.1 Model & Infrastructure Security

3.4 Staged Deployment

3.6 Incident Response & Recovery

3.5 Post-Deployment Monitoring

3.3 Access Management

3.2 Data Governance

3.1 Testing & Auditing

4.4 Governance Disclosure

4.6 User Rights & Recourse

4.5 Third-Party System Access

4.3 Incident Reporting

4.2 Risk Disclosure

4.1 System Documentation

Alignment Bootcamp

HMIA 2025

CLASS

MIT Risk Taxonomy (Saeri et al. 2025)

Alignment Bootcamp

HMIA 2025

CLASS

MIT Risk Taxonomy (Saeri et al. 2025)

Alignment Bootcamp

CLASS

Alignment Bootcamp

HMIA 2025

Resources

Author. YYYY. "Linked Title" (info)

Alignment Bootcamp

HMIA 2025 Alignment Bootcamp

By Dan Ryan

HMIA 2025 Alignment Bootcamp

  • 4