Earlier this spring, I had the chance to sit down in person with Professor Gabriel Weil here in New York to discuss his proposal for mitigating catastrophic risk from artificial intelligence. Professor Weil's proposal involves instituting a new punitive damages framework, which would increase defiance to AI companies in near miss scenarios where an AI generated harm was limited in its impact, but could have been catastrophic.
Much of our discussion comes from Professor Weil's paper, “Tort law is a tool for mitigating catastrophic risk from artificial intelligence.” Professor Weil is a Professor of Law at Toro University, and his work is now partially funded by Open Philanthropy. We start by discussing the definition of harmful AI activity before walking through a case study to demonstrate how the proposal would work in practice. We also contrast Professor Weil's proposal with the current state of law and talk about some criticisms he's received in his response. I thought it was a fascinating conversation, and I think you will, too.
If you enjoy this episode, be sure to follow Professor Weil on Twitter/X. Coverage of his proposal that we discussed in this conversation has also be covered in accessible formats here:
“Can the courts save us from dangerous AI?” in Vox’s Future Perfect.
“Tort Law Can Play an Important Role in Mitigating AI Risk” in the Effective Altruism Forum.
“How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk” in the Effective Altruism Forum.
Professor Gabriel Weil, welcome to the podcast.
It's great to be here.
I wanted to start by giving a high level overview of your proposal, as I understood it from your paper. So this will be a test of my comprehension, and you can correct me with what I get wrong.
Sure, sounds good.
So the way I viewed your proposal and your framework that we'll be talking about throughout this conversation is that it's a proposal for harmful AI activities.
And there are five criteria that have to be met.
The harmful AI activity has to generate a negative externality. So that means the cost to society of the harmful AI activity is not fully borne by the AI company itself. And, you know, negative externalities are typically considered market failures because the negative activity is going to be overproduced, because there's insufficient incentives to the producer to encourage them to do less of that activity. So that's the first criteria.
The harmful AI activity has to have a catastrophic potential. So if that same activity were scaled up to a national or global level, the harmful AI activity could lead to catastrophic outcomes. Or as you put it in your paper, the harmful AI activity is correlated with catastrophic outcomes.
The harm has to be non-compensable. So, because the harmful AI activity would be catastrophic if it were, you know, scaled up and realized to its full potential, there's not enough financial resources to be able to make the harm parties whole. In the limiting case, it's something like human extinction, or close to it, and there's no courts, no parties left to be sued, so on and so forth.
Related to number three, because it's non compensable. It's also not insurable because the activity could be catastrophic if it were scaled up or fully, the harm was fully realized. There wouldn't be enough financial resources to be able to pay out all of the insurance claims, potentially an AI company. You wouldn't be able to get a policy for this kind of activity in the first place.
Not a criteria, but more of a note. You say in your paper that the harmful AI activity will likely be caused by misalignment rather than capabilities failure. So react to that and tell me what I got wrong and make any corrections.
Sure. So I think that's broadly on the right track. I just want to make a few distinctions. So the harm that you would be suing over wouldn't be catastrophic, it wouldn't be uninsurable. There would be some practically compensable harm that is correlated or associated with the uninsurable risk. And so we can't hold you liable if you actually cause that harm. Because, you know, in the limiting case, we're all dead. No one's around to sue or be sued or short of that, it's just a financially uninsurable risk. It would bankrupt the company to try to pay out a damages award, or maybe in an intermediate case, the legal system's no longer functioning. And so the idea is to try to pull forward that expected liability into the practically compensable cases that are associated with that risk. And so you would have some harm that actually occurs that is, that is insurable.
The other point I would make is there's sort of at least two questions you would think about in terms of how you classify different AI systems and different liability regimes. So one question is whether there's liability at all, and, you know, whether that's assessed under a negligence standard, which means, you know, the fed has effectively exercised reasonable care, then they're not liable. Or strict liability standardization, where they're liable for the harm they caused, at least if it's foreseeable, even if they did exercise reasonable care. And so I think strict liability should apply for frontier AI systems, for, you know, systems that have unpredictable properties, uncontrollable goals, even in the absence of catastrophic risk. I think strict liability is appropriate there. Cause it's gonna be very hard to prove negligence, and you are creating these external risks.
One other point on externality I would clarify is externalities, yes they're not borne by the producer of the system. They're also not bought by the sort of second party customer, right? So they're not internalized to any sort of economic transaction. They're born by the public or by some third party that doesn't consent to the transaction.
So I guess the idea is that we would have this kind of small harm that would occur, but that harm is almost like a flyer or a test case. We want to punish, using punitive measures, that AI activity, because it's correlated with risks that could be quite large if we just allow that activity to continue unabated or allow it to continue with the current damages system in place, because the damages would not be large enough to deter that activity.
Right. So except the one qualifier I want to give is damages needn't be actually small. They could be millions or hundreds of millions of dollars. They're just something that the company can pay. Microsoft or OpenAI, Google, they can pay out a pretty big damage award. They can't pay out a trillion dollar damage award. And the value of human civilization is much bigger than a trillion dollars.
So it wouldn't necessarily have to be small, but the idea is, you want them to be adequately accounting for the risks they're generating and exercising enough precautions so that they're optimizing the benefits of their activities versus the risks.
And if they don't expect to pay for a large fraction of the harm they expect to cause, because that comes in scenarios where compensation is impractical, that it's not going to give them adequate incentives to exercise precaution. And so what I'm trying to do with this framework is align the incentives of these companies with what they say they want to do, which is to promote social welfare, to build safe systems. But when you see in practice these companies that were founded with these high ideals, I think pretty quickly they're under market pressure to stray from those and to ship products out and to not exercise the ideal amount of caution.
But your plan would not cover all harmful AI activities. So to use a practical example, I had a conversation recently with Nina Brown and we discussed chatbot-generated defamation. And there's actually one ongoing legal case where someone is suing OpenAI for defamation. As I read your work, that doesn't seem like it's really something that's covered under your proposal or at least not what you have in mind.
First of all, that is insurable. And second of all, it's not really correlated with something that's catastrophic. Or if that activity were scaled up, it wouldn't be catastrophic. It would be bad if chatbots just run around defaming people all the time, and that's kind of like all they did, but it wouldn't be catastrophic.
So would something like chatbot generated defamation fall outside of your proposal, and you think it should be handled kind of via standard legal and insurance means?
Yeah. So absent some specific showing that a particular defamation was part of some, you know, AI takeover, failed attempt, or whatever, you could imagine exotic scenarios in which defamation is the practically compensable harm that is associated with some risk. But those are pretty unlikely scenarios, I would say, for run of the mill defamation I don't think punitive damages, at least on this catastrophic risk theory or uninsurable risk theory, would be appropriate.
Now, there might be some malice or recklessness, or there might particularly reprehensible conduct that under existing standard punitive damages theory, that might be appropriate, but that wouldn't be covered by my proposal in terms of whether strict liability would apply. You know, defamation is a sort of separate tort from what I'm talking about. And so I don't think that my framework would really have much to say there.
So let's take another hypothetical example I thought of. I think this is more what you have in mind for your framework. Let's suppose there's a company that makes AI power monitoring systems, and the company has a residential system that's installed in a bunch of homes, and it has the objective function, stating its goal is to save as much electricity as possible. And at some point, the AI realizes that the best way to save electricity is to hack into the home's smart meter and cut the electricity supply. Let's suppose for this example, the AI hacks into the smart meter of just a single residential home.
So, as I read your work, this seemed more like the kind of thing you have in mind. So the harm occurred in a single residential home, but it is certainly correlated with catastrophic risk, because if that same AI monitoring system were installed in tens of millions of homes, or we can think of other examples, maybe it's installed in some important commercial buildings, maybe it's put in charge of some portion of a power grid. In any of those cases, if the same kind of hacking occurred, it would be catastrophic, and the livelihoods of millions of people would be impacted.
So, in that sense, this minor harm of hacking in one residential home is correlated with catastrophic harm, and it's potentially uninsurable. An insurance company might not write a policy that would cover power outages of that magnitude that would impact millions of people. So is that the kind of example you had in mind?
Yeah. So, ultimately, under my framework, it would be a factual question for the jury, sort of how correlated this particular harm is with the catastrophic risk, and what catastrophic risks were generated by the deployment of the specific system that caused that harm.
But I think that's a case where that question should get to a jury, where it shouldn't be resolved by the judge as a matter of law. And I think in all these cases, it's going to be very difficult to do this, to quantify what this catastrophic risk was and how correlated this harm was with it. But that's why I think there needs to be more technical work to sort of lay the groundwork for that estimation.
And talk about the importance of a jury versus just a judge deciding as a matter of law. Why is that important?
So, deciding as a matter of law means that no reasonable jury could reach a particular conclusion that's contrary to that. And so judges, except in rare circumstances, where there's a bench trial where both parties waive the right to a jury trial, you know, juries generally resolve questions of fact unless a judge determines no reasonable jury could rule otherwise.
And all I'm saying there is that it should get by that bar. It should survive a motion to dismissed or a motion for summary judgment.
When you say should, are you speaking about your belief that that's how the system should operate, or are you thinking about just factually in practice, that's how it will probably operate?
Oh, so I guess that depends on the hypothetical. Under current law, punitive damages are not going to be available in a case like that. Almost certainly.
I guess we should back up a little bit. And I would say to implement my framework through the iteration of common law decisions that are made by judges would require a significant doctrinal innovation. The punitive damage is component of my framework, and so under longstanding punitive damages doctrine, it requires malice or recklessness. And I don't think that's going to be present in most of the cases.
At least human malice or recklessness. There could be some AI personhood theory where you say the AI acted intentionally or maliciously but that's not current doctrine either. And so if you're asking me for a prediction of how this case would be handed under current law, the answer is punitive damages would not be available as a matter of law.
Under my framework, what I'm saying is the way I would want a case like that to operate is it should get to a jury with instructions to do the sort of calculation to estimate how correlated and what catastrophic risks were undertaken by the deployment of the system. What should the deployer of the system have known about how risky the system was when they deployed?
Okay, yeah. Let's talk a little bit more about punitive damages for those who might not be as familiar. So maybe we can start with just a brief overview of compensatory versus punitive damages and kind of like what falls under those two categories.
Yeah, so compensatory damages are just what they sound like. They're to compensate the plaintiff for the harms they actually suffered. In theory, they should make them indifferent between having suffered the injury and getting the money or never having suffered the injury at all. In practice, it doesn't always work out quite like that. But that's the theory of what compensatory damages are trying to do.
Punitive damages are damages over and above compensatory damages. There's different theories of what punitive damage are for. Some people think they serve an expressive function. For me, the main function that they serve is to step in when there's reason to think that compensatory damages would be inadequate to deter the underlying tortious activity. And so even though this idea of, “Well, there were uninsurable risks being taken” isn't typically handled by punitive damage, I think it fits well with that key normative rationale for punitive damages, which is why I’ve incorporated that aspect into my framework.
Compensatory damages are things like hospital bills. It includes pain and suffering as well.
Lost wages.
Right, that kind of stuff. And punitive damages are fines above and beyond what a particular person who suffered would be paid by the company, to punish the company, I guess, for bad behavior, and just to send a signal that what they did was wrong. Because otherwise a wealthy company might just set up a system where they made a tradeoff and they would cause whatever harm they wanted and just pay compensatory damages to the harmed individuals.
So if you thought everyone who's harmed would sue and be able to successfully recover, then I don't think punitive damages would be appropriate. Because if it's worth it for the company to do the risky thing and they can pay for all the harm they cause, then, you know, standard economic theory would say that is actually a socially beneficial activity.
But the cases like the ones I'm talking about are the most similar to what I'm talking about, are not about catastrophic risk. But there's some reason to think most of the plaintiffs won't sue. So there's this case called at Accor Hotels where specific location of Motel 6 was. They knew they had bedbugs in a lot of the rooms and they decided not to treat them because they said, “Oh, it's too expensive to fix this bedbug problem. Most people won't sue. And the damages to any particular plaintiff would be $500 and it’ll be expensive to bring these lawsuits.”
And so a punitive damage award of, I think, something like 100x the compensitory damages were approved in that case because that was needed to get them to change their behavior.
In the AI context, it's a little different. The would-be plaintiffs in the end of the world can't sue, right, for different reasons. But it's the same idea that all the lawsuits that ideally would happen in the world where the catastrophic risk is realized, those can't happen. And so if we want to deter the conduct that gives rise to that risk, we need some other mechanism, and punitive damages are what's available.
Is there not though, a public policy argument that you would want to assign punitive damages regardless of whether everyone does sue, just to, I don't know, enforce certain kinds of ethical and moral norms?
I mean, I guess that depends on your moral theory, right? I tend to lean toward a more utilitarian approach to ethics. And so if it's true that the social value as measured by the profitability of the enterprise is positive, once they can pay for all the externalities they're causing, if they can do that, and the enterprise is still profitable, I'm inclined to think what they're doing isn't actually wrong and the compensation should occur, but the activity should go forward and punitive damages might stop socially valuable activities.
In the AI context, all I'm really trying to do, is internalize the externality. And then if AI, if it's worth pursuing, once they're paying for all the damage that they know or should know that they're risking, then I think that's fine. And I'm not someone who wants a sort of hard stop on AI development. I think we want a very cautious proceeding with this research, because AI can produce a lot of benefits, but it also carries enormous risks, and we need incentives for them to adequately account for those.
And what are some of the rules of thumb for the ratio between compensatory and punitive damages that are used by courts today?
Yeah, so this is a bit of a messy area of law. There are some constitutional constraints on punitive damages under the due process clause. And so the Supreme Court has indicated, but it's never been super clear about this, that they're gonna look with suspicion on punitive damages awards that are more than ten times compensatory damages. They haven't set that as a hard cap.
As I said, there are damages awards that have been approved in federal court even that are much higher than that. Some states have laws that also cap punitive damages at a much lower level, sometimes double or triple. Now, there's some potential for legislation to put companies on notice of these punitive damages, and then maybe that would obviate some of the constitutional concern. But if we get to the point where common law courts try to implement this, there is a concern that the Supreme Court could stand in their way.
Let's shift and talk more in depth about your proposal. So how do things work today in terms of assigning punitive damages? How would they work under your proposal and framework? And where's the novelty coming in? Because in your paper you mentioned that your plan is novel. It's a bit different than what's done today. You mentioned earlier it would require some doctrinal innovations. So talk about that piece as well.
So there's two changes to punitive damages doctrine that would be required to implement my framework. One, as I mentioned earlier, is this requirement of recklessness or maliciousness or malice as a threshold requirement before you're sort of in the punitive damages game. And I think that's just unlikely to be present, at least on the part of the humans training and deploying the systems.
Now, maybe in some misuse cases it would be present, but that would be on the part of the deployer, not on the part of the entity. Building the system is unlikely to be acting with malice, and they might not be really the entity you're trying to deter, like the terrorist group or whatever that's using AI to build a bio weapon, you might not be able to recover from them anyway.
And so at least if we're relying on the human conduct to be the basis for punitive damages, which for reasons I can get into, I think is probably what we want to do, I think you would need a changing law that allows punitive damages in cases of ordinary negligence or even strict liability, which would be a significant doctrinal change.
The other is that there's no real precedent for basing the punitive damages calculation on these counterfactual or speculative future harms. And so you're saying, well, the system didn't do this catastrophic harm but this revealed it's misalignment or whatever in this non-catastrophic way. But it could have gone this other way if the world had looked a little different. The people who deployed it couldn't have been confident it wasn't going to fail much more catastrophically. So we’re going to hold them liable for that. That theory is pretty novel. And so again, it would require a significant change to punitive damages doctrine to accommodate that.
And that would be undertaken by who? The courts? I suppose they would just have to start thinking about damages in a different way?
Could happen either through accumulation of precedent in different state courts. Any plaintiff could bring a case like this and the courts could cite my article and argue this theory, and courts could do that on their own. It's clearly within their common law powers, even though it would be a departure from past precedent.
Or you could have legislation either at the state or federal level that could implement this. So common law is always sort of below in the hierarchy of law statutes. State statutes always can preempt or displace the common law. And so if a state wants to overturn past decisions limiting punitive damages to case of malice or recklessness and allow them under this counterfactual or catastrophic risk theory, state legislatures are clearly within their authority to do that.
Have any state legislatures started to move in that direction at all?
So I don't want to blow up any one spot here, but I am in conversations with some state legislators who are interested in pursuing this, and that's all I'll say right now.
Okay, sure. So let's continue to walk through the details of your framework. Maybe we can use the hypothetical example I laid out earlier as a kind of case study. So again, an AI hacks into the smart meter of a home and cuts the power. The homeowner then brings forward a lawsuit. What are the key checkpoints of your framework as the case moves through the court system that we need to think about.
First, is the question of, is there liability at all under a negligence standard? The question would be either the company who deployed the system or the company who built it, did they exercise reasonable care? And so the question there is, is there some precautionary measure the company could have taken, some reasonable precautionary measure that a reasonable person would have taken, in fact, that would have prevented the injury. And you have to be able to point and say, well, that's what you should have done. You failed to do that, and therefore what you did was negligent and that negligence was the cause of the injury to the plaintiff.
And, you know, that would depend on the details of the facts, but it's not obvious that you'd be able to bring that claim. There's potentially also a products liability claim there, which is called strict liability, but the way it operates in practice would likely be as a manufacturing defect theory. And there the test ends up being fairly similar to negligence analysis where you would have to prove that there was some reasonable alternative product design that would have been safer.
Sorry, just to interrupt. And what if the answer to those negligence questions is no? Did I pick a bad example of a case study because I said the AI hacked into the smart meter, which I guess might imply some negligence?
So you would have to show that some human upstream failed to exercise some precaution that would have prevented the AI system from doing this hacking. If you can't point to something, some unreasonable thing they did or some precaution they unreasonably failed to take, then I don't think there would be negligence right now. There could have been, but I expect people to behave in ordinary, reasonable ways, at least in ways that aren't provably unreasonable.
Another important thing to note here is that the scope of the negligence inquiry is not infinitely broad. So to use an example outside of the AI context, if you hit a pedestrian with your car, if you were not doing something specifically negligent, if it's just the fact that driving is always kind of risky and they were in your blind spot and you weren't speeding, you weren't texting, even though you driving your car generate this risk — and say you were driving an SUV and you could have been driving a compact sedan, right; and the injuries to them are much worse because you were driving SUV because it's a heavier vehicle — that doesn't mean you're going to be able to be held liable for negligence.
Because there's not some specific negligence we can point to. And part of the inquiry is not “You should have been driving a smaller car or you should have walked because you shouldn't have generated this risk at all” even if, say, it wasn't a very important car trip you were taking. That's just not part of the negligent inquiry.
And similarly, I think it's unlikely to be part of the negligence inquiry that you just should have been building these advanced AI systems at all. Right? Now maybe you can say, “Oh, you should have done some specific red teaming,” particularly if it's one of the labs that are being less cautious. And you can point to say, “Well, this is the industry standard and you're not following that.” That would be evidence of negligence.
But if you're one of the labs being more cautious and just not being as cautious as I'd like them to be. But they're taking all the steps that are obvious. I can't tell them what they should be doing that's better. I think sometimes they should take six months and try to think harder and wait until they have better interpretability tools or whatever it is. But they're not doing things that I think a negligent standard would impose as reasonable care. Does that make sense?
Yeah, that was a great explanation. I'm glad I said something stupid so you could respond in that way.
No, not at all.
I guess the bottom line is there has to be some kind of negligence, given the test that you mentioned earlier. And if there isn't, even if there's harm, if the AI companies were acting reasonably, they couldn't foresee it, yada yada, then there's really no case to be.
Had under current law. I think that's likely, yeah.
Okay.
Now I should say there is this abnormally dangerous activities doctrine, right, which says that strict liability should apply for activities that are not in common use and that still create a significant amount of risk of large injuries even when reasonable care is exercised. This is a sort of meta doctrine. And then state courts tend to, under this doctrine, pick certain activities and label them abnormally dangerous and then apply strict liability.
So, for instance, in a lot of states, blasting with dynamite is an abnormally dangerous activity. Even if you exercise reasonable care. If someone gets hit with dynamite falling off concrete, that piece flies off and hits them, you'll be liable even though you took all the ordinary reasonable steps that someone would take when they're blasting with dynamite, unlike with the punitive damages that I was talking about earlier, I do not think it would be a significant departure from this sort of meta doctrine to extend that doctrine to training and deploying these advanced AI systems.
But if you're talking about what the status quo is and what a sort of naive extrapolation of existing law to AI, that would not be my base prediction of what's likely to happen. I think they should do it. I think it's not a significant doctrinal move. I think it's a justified move, but I don't think it's the sort of default thing that's likely to happen.
And so let's assume there is negligence. So what happens today, and what would you like to see happen?
Oh, so if there's negligence, then you would be able to recover compensatory damages. If there's not malice or recklessness, then you would only be able to recover compensatory damages. Even if there's some showing that, well, this could have gone a lot worse. Compensatory damages don't include things that didn't happen. They're only harms that you actually suffered.
And so if there's some reason to think that deploying the system that ended up doing this hacking could have gone a lot worse, the system could have had much more ambitious goals in saving power, or that it could have, even if its only goal were to reduce power consumption, it might have thought, well, “I’m afraid of being shut down. I have to take over all the computer networks in the world to avoid getting shut down.” If that was a near possibility, or at least something that the people who deployed the system couldn't have ruled out. If there was a one-in-a-million chance that that would happen and it would produce really catastrophic outcomes and say it would do trillions of dollars worth of damage in that case, right now you can't recover for that possibility.
Yeah, exactly. But your plan is you should be able to recover for that possibility because as we were talking about earlier, the potential harm if that system was scaled up or—
So the point about if the system scaled up, I think is a subtle point that I want to clarify. The harm that I want to internalize or the risk that I want to internalize is the risk that was actually undertaken by the conduct that the humans have taken that have already occurred. So if you deploy the system in a small scale way that in this deployment setting the risks were small, then I don't think catastrophic risk damages are appropriate. If the risk wouldn't arise absent some future human conduct. The idea is more that, okay, you did this deployment. Given that deployment, something much worse just could have happened as a result of the conduct you've already done. So it's a risk that you already took. It just wasn't realized.
What I'm trying to do is internalize that risk, which, if it's realized we can't use compensatory damages for. The only way to make you account for that risk is in doing it in the case where it doesn't arise, but it does some other practically compensable harm, but it is risk that you've actually taken, not that you might take in the future.
So the idea is we got lucky with this one. Like, it could have been a lot worse.
It was a near miss.
Yeah, a near miss. Okay. Another thing I wanted to clarify. Does your plan just apply to the U.S. context? How should we think about this framework in terms of international law?
I think in terms of my descriptive analysis of how the law is likely to play out, it's broadly similar in other common law countries. Those are mostly English speaking countries. I'm not as familiar with the way civil tort liability or civil liability systems work in general, but I think the high-level point that the sort of punitive damages that I'm calling for are unlikely to be available is likely to be true basically everywhere in the world right now. And importantly, I think the normative arguments I'm making for what an ideal liability framework should look like are the same everywhere. The doctrinal levers you would have to pull to create that are going to be different in different places. And I encourage legal scholars or lawyers who work in other legal systems to do that sort of work to figure out what those levers are. So my paper maps out sort of what moves you'd have to make under U.S. law and other common law systems. But I think, yeah, it's a fruitful project that people could undertake in other legal systems.
The foundation of your framework rests on an expected value calculation. And that calculation involves a multiplication of the probability of a harm occurring, times the magnitude of the harm if it does occur. And with large enough magnitudes, even small probabilities can be catastrophic.
Right. So, to be clear, in this context, I think there's often a caricature of the AI risk concern that we're worried about these infinitesimal probabilities or small, finite probabilities of catastrophic harm. I think most people are worried about think it's 10% or more likely that really catastrophic things will happen in the aggregate. That doesn't mean that any particular system that's deployed will have that high of a chance, and that we're going to be trying to apply 10% of the value of human civilization. Even that's going to be uninsurable. So that's not a feasible punitive damage reward.
The idea is we're trying to catch them when they deploy a system that had a one-in-a-million chance of causing something on the scale of human extinction. And still a very large damage reward would be appropriate in a case like that. You want to encourage them to do the types of things that would reduce that risk, say, from one-in-a-million to one-in-a-billion or one-in-a-trillion. I don't think you're ever going to be able to totally get it down to zero, but you want to do it to the point where the social value of the activity that they're undertaking outweighs the risk.
One might argue that similar logic about catastrophic risk can be applied to technologies outside of the AI space. So do you view your framework as primarily about AI?
So, I think there's two elements that get at what you're talking about. So one is whether it generates uninsurable risks, which is true of other technologies.
The other is whether we need to lean on tort law to address those uninsurable risks. So maybe nuclear power has uninsurable risks. We typically don't use the tort system to deal with that. We rely on prescriptive regulation. I have some issues with the way those prescriptive regulations are designed in the US, but we do know how to build safe nuclear power plants. And so we can tell people who want to build them, you have to follow these regulations. They have to be this safe. And we know what to tell them to do to make those power plants safe.
I don't think we know how to tell OpenAI or DeepMind or Anthropic how to build safe AI systems. I think they know better than the regulators do, and they still don't know. And so the idea that you're going to have — not that there's no scope for prescriptive regulations — but they are not going to be sufficient to give us the level of confidence we want that these systems are safe. And therefore, what I think we want to do is push the onus onto the companies that are building systems that have the most expertise about the risks, about how to make them safe. To say, well, you're going to pay for the harm. You expect a cause. You figure out how to make them safe. I'm much more confident that that's going to produce safe systems than just government regulators who have less knowledge about how these systems work, trying to come up with rules they have to follow.
And I guess there's a trade off, though, right? Because legislation is preemptive and can force a company to do something. But I think your approach is to wait until something harmful, but not catastrophic, happens to punish the company at that point or send a signal at that point.
Well, I think the signal is sent as soon as you have it clear that that's what the legal regime is. So the idea is less that the actual damages award is what changes behavior, that the expectation of the damage reward causes companies and empowers the more cautious voices within these companies to say, we need to. It's not just out of altruistic notions that we should try harder to make these systems safe. We should do it because that's what's in the interest of our bottom line.
In particular, I think about a scenario where, say, there's this organization called METR, formerly known as ARC Evals, that does these dangerous capability evaluations of models and also does some alignment evaluations. I'm imagining a future scenario where, say, GPT-6 shows dangerous capabilities, maybe shows some potential for misalignment. And the question is, what should OpenAI do about that? There's some cheap, dumb solution which is just applying reinforcement, learning from human feedback to iron out the specific failure mode, and basically no one thinks that's a good idea. And there's some intermediate thing where you roll it back a little bit, do some retraining. And some people say, “Oh, that's good enough,” and other people say, “No, no, we really need to either do some really rigorous adversarial training, or we need to wait however long it takes until we have better interpretability tools and put a lot of money into that before we can deploy the system.”
I want to empower the voices within OpenAI or within Anthropic or within DeepMind to say, look, it's not just for altruistic reasons, but it's in the interests of our shareholders or whoever the financial stakeholders are to do the more cautious things, do the things that will actually make the system safe enough that it's worth it from a social perspective to deploy it.
So if we think about the ecosystem holistically, legislatures will have put into place your punitive damages framework. And then in addition to their own red teaming, model producers will have their eye on these independent model evaluations and risk assessments, maybe a bit more than they do today because they have this possibility of large punitive damages if they get into this near miss scenario we talked about. And model producers want to minimize the potential of these punitive damages fees. And there'll be this tighter feedback between model evaluators and model producers, and that will lead to safer AI systems.
Right. And the ideal version, the most robust form version of my framework, would include liability insurance requirements that scale with model capabilities. And so you would have one set of evaluations that are used to determine what the coverage requirement is. And then ideally, there would be a second set of evaluations developed by the insurance industry to do the underwriting. And so if you could show that your model is safer, you could do some alignment evaluations that show it's very unlikely to do the kind of harm that would result in this premium damage award then they'll write you an insurance policy you can afford. If you can't persuade the insurance company to do that, then you can't deploy the system. And so when you said, oh, it's just sort of ex-post, I think there is some sort of prior restraint in that version that you have.
It's not a prescriptive regulation, but it is saying you have to be able to prove to someone, some financial backer that's willing to say, we'll write you an insurance policy you can afford because the system is safe enough or because it doesn't have particularly dangerous capabilities.
Let me run a critique by you and get your feedback. So some AI safety advocates, I'm thinking of folks like Emily Bender at my alma mater, UW, have said that concerns about AI, extinction and catastrophe are misplaced and that they're a distraction because they're far off in the future. There's a lot of uncertainty. Sure, it could happen, but we don't know much about what that would look like. Meanwhile, today, they would argue, we have AI bias. We have AI-generated misinformation, bias in judicial sentencing, potentially high levels of unemployment for certain fields that AI will impact coming very soon. Political capital for legislative and judicial reform is finite. Should we be focusing more on these near-term tangible harms and putting effort into reform there rather than thinking about something that's uncertain off in the distance having to do with catastrophic or extinction risk?
So I think it's subject to dispute how far off in the distance it is. I tend to have somewhat longer timelines than other people in the safety world. But I think even what are considered long timelines are something on the order of decades until we have transformative artificial intelligence. And there are certainly people that think it could be in the next five to ten years, and some people think it's even sooner than that. And I don't rule those out. I think I put significant probability on those outcomes. And so I don't think this is so far off. I think it takes time to change legal systems. And you want, and importantly, at least for my framework, it's very important that the expectation of liability be in place early. So it shapes the decisions that companies are making as they deploy these systems.
And I don't think worrying about catastrophic risk, it really competes to worrying about these present concerns. First of all, I think a lot of the concerns that are being raised there could be addressed through tort liability. And so there are ways to craft versions of this framework that would accommodate a lot of those concerns. And I think even the framework is written does accommodate a lot of those concerns. And more probably, I think a political coalition that includes people that are worried about more present concerns versus somewhat more speculative, but I wouldn't say far off, but I would agree more speculative, catastrophic risk concerns. I think political coalition that combines them could move some of these things also to the version of my proposal that just happens to the courts, doesn't sort of take legislative attention away from anything else.
So in terms of the coalition you mentioned, do you think that the more tangible concerns of some AI safety advocates and your important but more speculative concerns are actually complementary in terms of, of legislative or judicial reform?
Well, I certainly don't think there are odds with each other. I think there tends to be a sort of rhetorical and turf battle between what's sometimes known as the AI ethics community that's more worried about things like algorithmic bias and data protection and the AI safety or AI catastrophic risk community? And I think that is not really justified. I don't think our policy proposals are really, at least my preferred approach to regulating catastrophic risk is not really at odds with anything that people are worried about algorithmic bias are trying to do. And so there's lots of problems in the world. A lot of my career has been spent addressing climate change. I don't think that worrying about AI risk is taking away from that. I think they're just different problems that require different policy tools.
While we're on the topic of critiques, are there any critiques of your work that you've received that you want to comment on or respond to.
Yeah. So the critique I think I take most seriously is how implementable is this framework? So can you actually do these punitive damages calculations in a meaningful way to assess liability? And the short answer there is, I think it is feasible, but will take some real work.
But that whatever you think are the concerns about doing that, the technical barriers to doing prescriptive regulation are even harder because you not only need to be able to estimate the risk to figure out how stringent regulation you have to have, but you have to be able to figure out what to tell these companies to do to make their system safe and reliable. And I just have much less confidence in that.
The other thing I would note on that score is that I think even getting within an order of magnitude, a rough calculation of the risk is going to do a lot better than right now, where we're just not accounting for it at all. And so even if it's getting close to internalizing that risk with some bound of uncertainty, some reasonably accurate, or a rough calculation, I think that's going to be a big improvement over where we are now. And so, yeah, I think that right now we are relying almost entirely on the sort of goodwill of these companies. And luckily most of the frontier labs is at least, they're making noises like they're very concerned about this problem. I think you see from some of their behavior that there are incentives that they're under that make it hard for them to stick to those commitments. So OpenAI was founded largely to try to limit AI risk. And now you see that they're coming under a lot of criticism for moving fast. And Anthropic was formed by defectors from OpenAI who were concerned they weren't being safe enough. And now Claude 3 is deployed, and it's now, by at least many metrics, the most advanced AI system. Anthropoc's original rationale for doing near-frontier work was that they needed to have near-frontier systems to be able to study them, to be able to do alignment research. But they weren't planning to move the ball forward or to push the envelope on capabilities. It seems like now they have at least arguably defaulted on that commitment.
I don't mean to malign any of the leaders of these companies. I think they are under a lot of pressure, but I want to empower the actors within those companies that want to be more cautious. And I want to make it in their interest because I don’t think we can just count on the goodwill of these players when their objective incentives are pointing in different directions.
Were almost out of time. Lets close by talking about whats next for you and how you plan on extending this work going forward.
Yeah. So as I said, I’m working with some state legislators on legislation under this framework.
In terms of future research, one way I’m thinking about extending it, is in terms of international reciprocity, in terms of different countries recognizing and enforcing each other's tort judgments.
I'm also thinking about different potential failure modes for this problem. So one is this international coordination problem that even if one country does enacts this proposal, does that just shift AI development to other countries?
But there's other potential failure modes. So one concern is, like, what if we don't get these warning shots or these near missed cases, these punitive damages aren't going to be able to meaningfully internalize that risk, at least if we both don't get them and no one expects to get them. And so I want to think about sort of what policy tools or what legal tools work in those worlds.
There's also sort of maybe types of harm that aren't legible, that aren't even legally compensable or practically compensable. So some people are worried about AI based political misinformation causing chaos, right. I think that's unlikely to lead to a successful tort judgment. I want to think about what kind of policy tools you might want for that. And then there are pathways of harm that I think aren't plausibly legally compensable. So say a company, say Meta, open sources, Llama3. And it's not that someone takes that system and modifies it and does harm with it. I think plausibly you could hold Meta liable for that. But the pathway for harm is instead that, that Chinese AI labs learn from Llama3. It brings them closer to the frontier. And then that makes U.S. frontier labs like OpenAI feel like they need to move faster and it sort of accelerates arms race dynamics. And then that leads someone to be harmed by GPT-6. I think holding Meta liable for that is going to be totally infeasible. And I wouldn't even think that a plausible tort liability reform would address that. And so the question is, do we need some other policy tool that accounts for that risk that Meta is generating of accelerating race dynamics when they release the system. Now, there are safety benefits of open source, that alignment researchers can do more with open source models than they can behind an API. And so I would want to think carefully about how you balance those risks, but that's another area I'm thinking about.
Can't wait to read it when it comes out. Professor Gabriel Weil, thanks for being on the podcast.
Thanks so much. This was great.