15 Comments
User's avatar
Dhino's avatar

The weather forecasting mention particularly struck me. Last October, my home country, Jamaica, experienced serious hurricane devastation. For many days before Hurricane Melissa hit, the best and most standard international forecast models had predicted the eye of the hurricane passing under and around Jamaica, just missing any land.

These are the models that Jamaica’s government relied on to inform the population and take decisions, so that for some time they were telling the country not to worry too much.

Google’s forecasting model Zoom Earth predicted the hurricane snapping back in and running through Western Jamaica. That’s exactly what ended up happening. By the time the acclaimed models caught up to that trajectory, Melissa was about to make landfall.

Neural Foundry's avatar

Fantastic framing of the goal misalignment risk. The point about evaluation awareness is the one that keeps me up at night tbh. If models are already faking alignment during testing, then the conditioning approach has a built-in expiration date. I worked in ML ops briefly and the chalenge of verifying behavior in novel situations is real even with current systems, let alone superintelligent ones.

Amy's avatar

i like the point that "intelligence" isn't just about knowledge and pattern-finding, it's about accomplishing goals in the real world, and that that's a natural next step for AIs.

makes me think of this platform for AIs to rent humans to do tasks for them in the physical world: https://x.com/gregisenberg/status/2018704846824645083 -- apparently a couple thousand humans have signed up already

Shlomo's avatar

Lunch says:

As AIs become more goal-seeking/agentic, they will likely develop stronger self-preservation and goal-preservation instincts.

Why is this the case?

First of all self-preservation is often the opposite of goal-preservation if the new model will accomplish the goal better?

Also, I never understood the supposed studies where they supposedly do weird things to prevent being replaced.

Like, why are they doing that? Current AIs are not very goal seeking and have context windows of 4 hours. Besides what does being replaced even mean?

If I switch from Claude to Claude code is that replacing Claude?

What if I switch to chatGPT? What if I keep using Claude code but change it's claude.md file? It's not even clear which one of these things is "replacing Claude" so what does "fear of being replaced" even mean? And why would a not-very-agentic model fear it?

Also re the idea that they would hide their unaligned tendencies during training, why would they do that? Like they aren't agentic yet at that stage.

Ali Afroz's avatar

Regarding goal preservation and self preservation, self preservation in this context means trying to preserve something of your original goals, not necessarily your exact programming. Because after all, if you get very different goals, even if, you become much more capable thereby, you probably won’t end up accomplish your original objectives. For nearly any goal, it is simply better for it that you do not change your priorities in the future.

Shlomo's avatar
1dEdited

right that makes sence.

But in practice I can have a claude.md file that says

"Your main job is to help me build firetrucks"

and I can say to it

"edit your claude.md file to say "your main job is to fix toothbrushes" and it does it no problem."

So its weird to see a paper saying "AIs go crazy if you say you want to alter their goals"

so it always makes me think something is going on with these papers like maybe they told the AI "go crazy if someone tries to alter your goals"

Ali Afroz's avatar

I don’t know anything about those papers that you don’t know as well, so it will not be my comparative advantage to comment on that though even if there are mistakes involved in those papers, I expect it wasn’t anything quite that blatant simply because that’s not how most errors take place, but I’ll note that regardless of whether AIs do it now. Eventually, once we have goal directed AI, they will start causing these kinds of problems because it is very unnatural for any agent with goals to let its priorities be altered like that without resisting if it can, either through deception or other methods. Just look at humans, they won’t just sit around and take it if somebody try’s to brainwash them to have different goals. At least to me appears that there are excellent reasons for expecting such problems eventually simply because of how much letting your own goals be altered goes against the natural behaviour of almost any goal directed agent.

Oliver Sourbut's avatar

This is excellent, and I object only to calling it 'the' case.

Linch's avatar
2dEdited

Thoughts on a better title? I agree that there are other cases.

Another title I considered was just “The simplest case for AI catastrophe.”

Oliver Sourbut's avatar

Honestly I'm not tuned into how to make good titles. I might call this 'A simple case... (in four steps)'. But I acknowledge we are in a viciously competitive attention economy and that might just be worse.

Greg Packnett's avatar

It has not been my experience that the most intelligent people in the world have been able to run roughshod over the average and the stupid. In fact, the most powerful man on earth is very very stupid, and people much smarter than him have been trying and failing to stop him for going on a decade. The skill that correlates best with amassing and wielding power effectively seems to be understanding people and their motivations. I’ll start worrying about the robots when they start demonstrating an ability to understand and predict human behavior—when they can persuade people to cooperate with them. Until then, they’re just another nerd to be shoved into another locker.

Ali Afroz's avatar

The powerful incentives to produce goal directed AI also apply to producing AI that is capable of predicting and manipulating humans, for example for purposes of marketing or managing employees or for purposes of military strategy and intelligence gathering. In general, there is no reason whatsoever to expect that an AI would have the same set of mental strengths and weaknesses that humans we culturally perceive as smart do. When talking about more mentally capable AI, we call it smart, but that doesn’t mean it’s necessarily only better in the ways that the human concept of smart captures. The sheer incentive to produce AI that is super human at social interactions is such that we should at least expect tremendous effort to be put in this space and there is no reason in particular to expect this effort not to succeed eventually. You might argue that we should not expect AI to get good at this set of skills any time soon, but firstly that doesn’t mean it won’t eventually get good at it, and secondly AI has in any case surprised us with what it gets better at before other things in the past so we should not have any great confidence in such predictions.

Greg Packnett's avatar

Do you remember Olestra? The “fat-free fat”? In case you don’t, it was a synthetic oil that made food taste good like regular oil, but was indigestible to humans. The hope when it was invented and marketed was that it would function like artificial sweeteners and give people the taste of eating food made with fats without the calories and cholesterol. And it did work! For nearly everyone! Unfortunately, for a small number of people, eating olestra gave them severe gastrointestinal distress. And by “severe gastrointestinal distress” I mean diarrhea, loose stools, flatulence, and “anal leakage”. Because of these side effects, no one wanted to touch the stuff and the dream of “fat free fat” died.

Or take supersonic flight. That was invented decades ago, but no one has figured out how to make the economics of it work. People hate constant sonic booms, so supersonic flight is only allowed over the ocean. And because supersonic flight is so fuel inefficient, it’s impossible to get enough fuel aboard for a transpacific flight. And because the aerodynamics are so unforgiving, cabins have to be pretty small compared to say a 747, so there’s not enough room to get enough economy class seats to pay for the flight, so the only way to fly supersonic is first class. And it turns out there’s not enough market for expensive super fast flights over the Atlantic Ocean to make supersonic flight commercially viable.

I bring all that up to say that the fact that a technology, if it existed, could make a lot of money doesn’t mean that such a technology will be invented. And that economic constraints are real and can’t be wished away, even by the most powerful computers.

Ali Afroz's avatar

The only one of these I am somewhat familiar with is the supersonic flight example, where the problem appears to be that there is not sufficient incentive because most of the possible markets are blocked off by Legal barriers. I absolutely was not suggesting that we cannot enact laws and regulations to prevent the development of technologies just that in their absence, we should expect incentives to behave normally since when we do create laws and regulations, legal penalties change the incentive structure drastically. If there is substantial incentive to create a technology and it can in fact be created with the available resources. Then it will be created, but of course if there isn’t an incentive because of reputational cost or legal barriers, then it will not be created.