Deterrence
HMIA 2025

Deterrence
HMIA 2025
"Readings"
Reading George Herbert Mead The Generalized Other
Activity: TBD
PRE-CLASS
CLASS
In a community of agents some measure of alignment can be achieved if agents respond to misaligned behavior. Response can range from In the individual instance, the response can arrest the behavior (restraining the agent causing disruption) but over time, consistent and certain response can be learned by agents and condition their choice of actions. We call that deterrence.
Deterrence assume that agents will sometimes cause harm, fail in their roles, or break rules but that alignment can facilitated by predictable responses to these. Mechanisms of liability, malpractice, and sanctioning rule violation reduce the impunity attached to misalignment.
Outline
- Under what conditions must one say excuse me in a social situation?
- What are some phrases that you might use to accuse a friend of malpractice?
- Consider "Machine systems operating in shared or high-stakes environments must follow formal policy constraints. Rule enforcement may include disabling behaviors, limiting capabilities, or escalating violations for review. Arresting rule-breaking can restore alignment but punishment-based deterrence, per se, may be problematic in the machine realm since systems lack intentionality and may respond unpredictably to externally imposed penalties." We can see this in reward functions - but is "deterrence" what we are seeing in RLHF? It shares surface level similarities, but missing some internals.
HMIA 2025
HMIA 2025
PRE-CLASS
Video: Linked Title [3m21s]
HMIA 2025
PRE-CLASS
HMIA 2025
PRE-CLASS
HMIA 2025
PRE-CLASS
Deterrence
HMIA 2025
CLASS
HMIA 2025
CLASS
So where do we stand on a dog that learns not to bark because of being trained by being hit whenever it barks? Is that deterrence?
Brings up functional, cognitive, and moral dimensions of deterrence.
For behaviorism and control theory: operant conditioning via punishment. Functional deterrence.
But no concept of rule violation or ethical boundary.
For humans, deterrence means behavior changes, agent knows why, agent anticipates punishment. The dog has an associative learning with the anticipated punishment. The RLHF does not?
Is the dog deterred, or just conditioned? What’s the difference — and does it matter for how we train AI systems?
HMIA 2025
Resources
Author. YYYY. "Linked Title" (info)
HMIA 2025 Deterrence
By Dan Ryan
HMIA 2025 Deterrence
- 11