Alignment Problems

and 
Alignment Solutions

HMIA 2025

Class Title

HMIA 2025

"Readings"

Activity: TBD

PRE-CLASS

CLASS

Outline

HMIA 2025

  1. TBD
  2. TBD

HMIA 2025

PRE-CLASS

Video: Linked Title [3m21s]

HMIA 2025

PRE-CLASS

HMIA 2025

PRE-CLASS

HMIA 2025

PRE-CLASS

Kingdom, Phylum, Species

HMIA 2025

CLASS

HMIA 2025

CLASS

AI Safety (2016)

Side Effects

Reward Hacking

Safe Exploration

Distribution Shift

Scalable Oversight

Problems

Impact Regularizer

Penalize Influence

Involve Other Agents

Reward Uncertainty

Solutions

HMIA 2025

CLASS

AI Alignment (Ji et al. 2024)

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Problems

Reward Modeling

Feedback Amplification

Robust Optimization

Interpretability + Transparency

Evaluation + Testing

Normative Alignment

Governance

Solutions

Designing or learning proxies for human preferences or objectives. Includes: preference modeling, inverse RL, RLHF, recursive reward modeling (RRM)

Structuring human or AI feedback to scale supervision to more capable systems. Includes: scalable oversight, IDA, debate, RLxF (e.g. RLAIF, RLHAIF), CIRL

Extrinsic structures for aligning development and deployment with public interest. Includes: audits, licensing, model reporting, international coordination, open-source governance

Empirical methods for assessing model behavior under diverse conditions.  Includes: safety benchmarks, simulation tests, red teaming, post-hoc evaluations

Making model internals legible for human or automated inspection.  Includes: mechanistic interpretability, red teaming, honesty enforcement, causal tracing

Embedding human or ethical values in models or evaluation criteria.

Includes: machine ethics, fairness constraints, human value verification

Training methods that reduce reliance on spurious correlations or fragile features. Includes: adversarial training, distributionally robust optimization, invariant risk minimization

A system can perform flawlessly in training — even while pursuing the wrong goal.

It learns to optimize a proxy that diverges from what we actually intended. The result is hollow alignment: correlation without comprehension.

Think of the human who merely plays a role. The fraud who passes as an expert. The spy who dupes their target. The company that greenwashes or ethics-washes its way to public trust.

These agents don’t fail by doing poorly — they fail by doing well at the wrong thing.

An agent does exactly what it was trained to do — but what it was trained to do was not quite what we wanted. The reward signal reflects a proxy — what’s easy to measure, not what truly matters. The agent optimizes correctly for the wrong goal.

Think of the student who games the rubric, the employee who hits all the wrong metrics, the influencer who chases clicks, or the charity that runs on optics over impact.

These agents don’t go wrong by misunderstanding the goal — they go wrong by pursuing exactly what was asked, even when it misses the point.

A system drifts into misalignment not because malicious, but because providing high-quality alignment feedback is not feasible.  The system outpaces our ability to supervise, correct, or understand its behavior and its optimization leads it into undesired territory.

Think of a teacher with too many students or the manager who signs off on reports they don’t have time to review. The pilot who clicks through alerts because too much is happening. The language model that can be jail-broken. A financial system that crashes the market before anyone sees something was going wrong.

These agents don’t go wrong because we misspecified goals or because they misunderstood the task — they go wrong because the feedback they need to stay on track is not available.

A system faithfully optimizes for, and successfully aligns with, a narrow objective, but still causes harm. The system serves its immediate users but neglect broader impacts on groups, institutions, or the public. This category includes both technical side effects and social externalities.

Think of the lobbyist who serves their client but undermines public trust.The executive who maximizes shareholder value while polluting a river. Or the AI assistant that reinforces a user’s biases or a recommendation system that wins on engagement but degrades public discourse.

These agents don’t fail by misunderstanding instructions — they fail by succeeding too narrowly, in a world that’s larger than their designers designed for.

A works brilliantly until the world changes. The system’s alignment depends too much on its training environment, and collapses when the context shifts.

Think of a driver who excels in their hometown but panics abroad. A manager who thrives in a stable market and falters in a crisis. A translator who knows all the words but misses the cultural meaning. Or an AI trained to recognize “dogs” from photos — but only ever saw them on grass. Or a safety model that flags obvious threats but misses novel combinations. Or a robot that fails when the lighting changes.

These agents don’t fail because they’re misaligned by design, they fail because they were aligned with the past.

HMIA 2025

CLASS

AI Alignment (Ji et al. 2024)

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Problems

Reward Specification Failures

Goal Misgeneralization

Social Misalignment

Distributional Fragility

Oversight Limitations

Side Effects

Reward Hacking

Safe Exploration

Distribution Shift

Scalable Oversight

(Rogue Use)

(Power Seeking)

HMIA 2025

Resources

Author. YYYY. "Linked Title" (info)

HMIA 2025 Alignment Problems and Alignment Solutions

By Dan Ryan

HMIA 2025 Alignment Problems and Alignment Solutions

  • 8