"Readings"
Amodei et al. 2016 Concrete Problems in AI Safety
Activity: TBD
PRE-CLASS
CLASS
Ji et al. 2016 AI Alignment: A Comprehensive Survey
PRE-CLASS
Video: Linked Title [3m21s]
PRE-CLASS
PRE-CLASS
PRE-CLASS
CLASS
CLASS
AI Safety (2016)
Side Effects
Reward Hacking
Safe Exploration
Distribution Shift
Scalable Oversight
Problems
Impact Regularizer
Penalize Influence
Involve Other Agents
Reward Uncertainty
Solutions
CLASS
AI Alignment (Ji et al. 2024)
Reward Specification Failures
Goal Misgeneralization
Social Misalignment
Distributional Fragility
Oversight Limitations
Problems
Reward Modeling
Feedback Amplification
Robust Optimization
Interpretability + Transparency
Evaluation + Testing
Normative Alignment
Governance
Solutions
Designing or learning proxies for human preferences or objectives. Includes: preference modeling, inverse RL, RLHF, recursive reward modeling (RRM)
Structuring human or AI feedback to scale supervision to more capable systems. Includes: scalable oversight, IDA, debate, RLxF (e.g. RLAIF, RLHAIF), CIRL
Extrinsic structures for aligning development and deployment with public interest. Includes: audits, licensing, model reporting, international coordination, open-source governance
Empirical methods for assessing model behavior under diverse conditions. Includes: safety benchmarks, simulation tests, red teaming, post-hoc evaluations
Making model internals legible for human or automated inspection. Includes: mechanistic interpretability, red teaming, honesty enforcement, causal tracing
Embedding human or ethical values in models or evaluation criteria.
Includes: machine ethics, fairness constraints, human value verification
Training methods that reduce reliance on spurious correlations or fragile features. Includes: adversarial training, distributionally robust optimization, invariant risk minimization
A system can perform flawlessly in training — even while pursuing the wrong goal.
It learns to optimize a proxy that diverges from what we actually intended. The result is hollow alignment: correlation without comprehension.
Think of the human who merely plays a role. The fraud who passes as an expert. The spy who dupes their target. The company that greenwashes or ethics-washes its way to public trust.
These agents don’t fail by doing poorly — they fail by doing well at the wrong thing.
An agent does exactly what it was trained to do — but what it was trained to do was not quite what we wanted. The reward signal reflects a proxy — what’s easy to measure, not what truly matters. The agent optimizes correctly for the wrong goal.
Think of the student who games the rubric, the employee who hits all the wrong metrics, the influencer who chases clicks, or the charity that runs on optics over impact.
These agents don’t go wrong by misunderstanding the goal — they go wrong by pursuing exactly what was asked, even when it misses the point.
A system drifts into misalignment not because malicious, but because providing high-quality alignment feedback is not feasible. The system outpaces our ability to supervise, correct, or understand its behavior and its optimization leads it into undesired territory.
Think of a teacher with too many students or the manager who signs off on reports they don’t have time to review. The pilot who clicks through alerts because too much is happening. The language model that can be jail-broken. A financial system that crashes the market before anyone sees something was going wrong.
These agents don’t go wrong because we misspecified goals or because they misunderstood the task — they go wrong because the feedback they need to stay on track is not available.
A system faithfully optimizes for, and successfully aligns with, a narrow objective, but still causes harm. The system serves its immediate users but neglect broader impacts on groups, institutions, or the public. This category includes both technical side effects and social externalities.
Think of the lobbyist who serves their client but undermines public trust.The executive who maximizes shareholder value while polluting a river. Or the AI assistant that reinforces a user’s biases or a recommendation system that wins on engagement but degrades public discourse.
These agents don’t fail by misunderstanding instructions — they fail by succeeding too narrowly, in a world that’s larger than their designers designed for.
A works brilliantly until the world changes. The system’s alignment depends too much on its training environment, and collapses when the context shifts.
Think of a driver who excels in their hometown but panics abroad. A manager who thrives in a stable market and falters in a crisis. A translator who knows all the words but misses the cultural meaning. Or an AI trained to recognize “dogs” from photos — but only ever saw them on grass. Or a safety model that flags obvious threats but misses novel combinations. Or a robot that fails when the lighting changes.
These agents don’t fail because they’re misaligned by design, they fail because they were aligned with the past.
CLASS
AI Alignment (Ji et al. 2024)
Reward Specification Failures
Goal Misgeneralization
Social Misalignment
Distributional Fragility
Oversight Limitations
Problems
Reward Specification Failures
Goal Misgeneralization
Social Misalignment
Distributional Fragility
Oversight Limitations
Side Effects
Reward Hacking
Safe Exploration
Distribution Shift
Scalable Oversight
(Rogue Use)
(Power Seeking)
Resources
Author. YYYY. "Linked Title" (info)