Why powerful instrumental reasoning is by default misaligned, and what we can do about it?

Meta: There are some ways I intend to update and expand this line of thinking. Due to uncertainty about when I will be able to do so, I decided to go ahead with posting this version of the essay already.

The AI alignment problem poses an unusual, and maybe unusually difficult, engineering challenge. Instrumental reasoning, while being plausibly a core feature of effective and general intelligent behaviour, is tightly linked to the emergence of behavioural dynamics that lead to potentially existential risks. In this essay, I will argue that one promising line of thinking towards solving AI alignment seeks to find expressions of instrumental reasoning that avoid such risk scenarios and are compatible with or constitutive of aligned AI. After taking a closer look at the nature of instrumental rationality, I will formulate a vision for what such an aligned expression of instrumental reasoning may look like.

I will proceed as follows. In the first section, I introduce the problem of AI alignment and argue that instrumental reasoning plays a central role in many AI risk scenarios. I will do so by describing a number of failure scenarios - grouped into failures related to instrumental convergence and failures related to Goodhart’s law - and illustrate the role of instrumental reasoning in each of these cases. In the second section, I will argue that we need to envision what alignable expressions of instrumental rationality may look like. To do so, I will first analyse the nature of instrumental rationality, thereby identifying two design features—cognitive horizons and reasoning about means—that allow us to mediate which expressions of instrumental reasoning will be adopted. Building from this analysis, I will sketch a positive proposal for alignable instrumental reasons, centrally building on the concept of embedded agency. 

AI risk and instrumental rationality; or: why AI alignment is hard

Recent years have seen rapid progress in the design and training of ever more impressive-looking applications of AI. In their most general form, state-of-the-art AI applications can be understood as specific instantiations of a class of systems characterised by having goals which they pursue by making, evaluating, and implementing plans. Their goal-directedness can be an explicit feature of their design or an emergent feature of training. Putting things this way naturally brings up the question of what goals we want these systems to pursue. Gabriel (2020) refers to this as the “normative” question of AI alignment. However, at this stage of intellectual progress in AI, we don’t know how to give AI systems any goal in such a way that the AI systems end up (i) pursuing the goal we intended to give them and (ii) continuing to pursue the intended goal reliably, across, e.g. distributional shifts. We can refer to these two technical aspects of the problem as the problem of goal specification (Krakovna, 2020; Hadfield-Menell, 2016) and the problem of goal (mis)generalisation (Shah et al. 2022; Di Langosco et al., 2022; Hubinger et al., 2019), respectively. 

At first sight, one might think that finding solutions to these problems is ‘just another engineering problem’. Given human ingenuity, it should be only a matter of time before satisfactory solutions will be presented. However, there are reasons to believe the problem is harder than it may initially appear. In particular, I will argue that a lot of what makes solving alignment hard is behavioural dynamics that emerge from powerful instrumental reasoning. 

By instrumental rationality, I refer to means–end reasoning which selects those means which are optimal in view of the end that is being pursued. It is itself a core feature of general intelligence as we understand it. ‘Intelligence’ is a broad and elusive term, and many definitions have been proposed (Chollet, 2019). In the context of this essay, we will take ‘intelligence’ to refer to a (more or less) general problem-solving capacity, where its generality is a function of how well it travels (i.e., generalised) across different problems or environments. As such, AI is intelligent in virtue of searching over a space of action–outcome pairs (or behavioural policies) and selecting whichever actions appear optimal in view of the goal it is pursuing. 

The intelligence in AI—i.e., its abilities for learning, adaptive planning, means–end reasoning, etc.—renders safety concerns in AI importantly disanalogous to safety concerns in more ‘classical’ domains of engineering such as aerospace or civil engineering. (We will substantiate this claim in the subsequent discussion.) Furthermore, it implies that we can reasonably expect the challenges of aligning AI to increase as those systems become more capable. That said, observing the difficulty of solving AI alignment is interesting not only because it can motivate more work on the problem; asking why AI alignment is hard may itself be a productive exercise towards progress. It can shine some light on the ‘shape’ of the problem, identify which obstacles need to be overcome, clarify necessary and/or sufficient features of a solution, and illuminate potential avenues for progress. 

Both the philosophical and technical literature has discussed several distinct ways in which such means–end optimisation can lead to unintended, undesirable, and potentially even catastrophic outcomes. In what follows, we will discuss two main classes of such failure modes – instrumental convergence and Goodhart’s law – and clarify the role of instrumental reasoning in their emergence. 

Instrumental convergence describes a phenomenon where rational agents pursue instrumental goals, even if those are distinct from their terminal goals  (Omohundru, 2008; Bostrom, 2012). Instrumental goals are things that appear useful towards achieving a wide range of other objectives; common examples include the accumulation of resources or power, as well as self-preservation. The more resources or power are available to an agent, the better positioned it will be at pursuing its terminal goals. Importantly, this also applies if the agent in question is highly uncertain about what the future will look like. Similarly, an agent cannot hope to achieve its objectives unless it is around to pursue them.This gives rise to the instrumental drive of self-preservation. Due to their instrumental and near-universal usefulness, instrumental goals act as sort of strategic attractors which the behaviour of intelligent agents will tend to converge towards. 

How can this lead to risks? Carlsmith (2022) discusses at length whether power-seeking AI may pose an existential risk. He argues that misaligned agents “would plausibly have instrumental incentives to seek power over humans”, which could lead to “the full disempowerment of humanity”, which in turn would “constitute an existential catastrophe”. Similarly, the instrumental drive of self-preservation gives rise to what in the literature is discussed under the term of safe interruptibility (e.g., Oreseau and Armstrong, 2016; Hadfield-Menell et al., 2017) AI systems may resist attempts to be interrupted or modified by humans, as this would jeopardize the successful pursuit of the AI’s goals.

The second class of failure modes can be summarised under the banner of Goodhart’s law. Goodhart’s law describes how, when metrics are used to improve the performance of a system, and when sufficient optimisation pressure is applied, optimising for these metrics ends up undermining progress towards the actual goal. In other words, as discussed by Garrabrant and Manheim (2019), the “Goodhart effect [occurs] when optimisation causes a collapse of the statistical relationship between a goal which the optimiser intends and the proxy used for that goal.” This problem is well-known far beyond the field of AI and finds application in such domains as management or policy-making, where proxies are commonly used to gauge progress towards a different, and often more complex, terminal goal. 

How does this manifest in the case of AI? It is relevant to this discussion to understand that, in modern-day AI training, the objective is not directly specified. Instead, what is specified is a reward (or loss) function, based on which the AI learns behavioural policies that are effective at achieving the goal (in the training distribution). Goodhart’s effects can manifest in Ai applications in different forms. For example, the phenomenon of specification gaming occurs when AI systems come up with creative ways to meet the literal specification of the goal without bringing about the intended outcomes (e.g., Krakovna et al., 2020a). Tthe optimisation of behaviour via the reward signal can be understood as exerting pressure on the agents to come up with new ways to achieve the reward effectively. However, achieving the reward may not reliably equate to achieving the intended objective. 

Sketching a vision of alignable instrumental reason

In the previous section, I argued that instrumental rationality is linked to behavioural dynamics that can lead to a range of risk scenarios. From here, two high-level avenues towards aligned AI present themselves. 

First, one may attempt to avoid instrumental reasoning in AI applications altogether, thereby eliminating (at least one) important driver of risk. However, I argue that this avenue appears implausible upon closer inspection. At least to a large extent, what makes AI dangerous is also what makes it effective in the first place. Therefore, we might not be able to get around some form of instrumental rationality in AI, insofar as these systems are meant to be effective at solving problems for us. Solutions to AI alignment need not only be technically valid; they also need to be economically competitive enough such that they will be adopted by AI developers in view of commercial or military–strategic incentives. Given this tight link between instrumental rationality and the sort of effectiveness and generality we are looking to instantiate in AI, I deem it implausible to try to build AI systems that avoid means–end reasoning entirely.  

This leads us to the second avenue. How can we have AI that will effective and useful, while also being reliably safe and beneficial? The idea is to find expressions of instrumental reasoning that do not act as risk drivers and, instead, are compatible with, or constitutive of aligned AI. Given what we discussed in the first section, this approach may seem counterintuitive. However, I will argue that instrumental reasoning can be implemented in ways that lead to different behavioural dynamics than the ones we discussed earlier. In other words, I suggest recasting the problem of AI alignment (at least in part) as the search for alignable expressions of instrumental rationality. 

Before we can properly envision what alignable instrumental rationality may look like, we first need to analyse instrumental reasons more closely. In particular, we will consider two aspects: cognitive horizons and means. The first aspect is about restricting the domain of instrumental reason along temporal, spatial, material, or functional dimensions; the second point investigates how, if at all, instrumental rationality can substantially reason about means. 

Thinking about modifications to instrumental rationality that lead to improvements in safety is not an entirely novel idea. One way to modify the expression of instrumental reasoning is by defining, and specifically limiting, the ‘cognitive horizon’ of the reasoners, which can be done in different ways. One approach, usually discussed under the term myopia (Hubinger 2020), is to restrict the time horizon that the AI takes into account when evaluating its plans. Other approaches of a similar type include limiting the AI’s resource budget (e.g,. Drexler (2019), chapter 8), or restricting its domain of action. For example, we may design a coffee-making AI to act only within and reason only about the kitchen, thereby preventing it from generating plans that involve other parts of the world (e.g., the global supply chain). The hope is that, by restricting the cognitive horizon of AI, we can undercut concerning behavioural dynamics like excessive power-seeking. While each of these approaches comes with its own limitations and no fully satisfying solution to AI alignment has been proposed, they nevertheless serve as a case-in-point for how modifying the expression of instrumental reasoning is possible and may contribute towards AI alingment solutions. 

A different avenue for modifying the expression of instrumental reasoning is to look more closely at the role of means in means–end reasoning. It appears instrumental reasoners do not have substantive preferences over the sorts of means they adopt, beyond wanting them to be effective at promoting the desired ends. However, there are some considerations that let us re-examine this picture. Consider, for example, that, given the immense complexity of the consequentialist calculus combined with the bounded computational power of real agents, sufficiently ‘sophisticated’ consequentialist reasoners will fare better by sometimes choosing actions on the basis of rule- or virtue-based reasoning as opposed to explicit calculation of consequences. Another argument takes into account the nature of multi-agent dynamics, arguing that positive-sum dynamics emerge when agents are themselves transparent and prosocial. Oftentimes, the best way to credibly appear transparent or prosocial is to be transparent and prosocial. Finally, functional decision theory suggests that, instead of asking what action should be taken in a given situation, we should ask what decision algorithm we should adopt for this, and any future or past decision problems with structural equivalence (Yudkowsky and Soares, 2017). Generally speaking, all of these examples look a lot like rule- or virtue-based reasoning, instead of what we typically imagined means–end reasoning to look like. Furthermore, they refute at least a strong version of the claim that instrumental reasoners cannot substantially reason about means is disputable. 

With all of these pieces in place, we can now proceed to sketch out a positive proposal of aligned instrumental rationality. In doing so, I will centrally draw on work on the embedded agency as discussed by Demski and Garrabrant (2019). 

Embeddedness refers to the fact that an agent is itself part of the environment in which it acts. In contrast to the way agents are typically modelled—as being cleanly separated from their environment and able to reason about it completely—embedded agents are understood as living ‘inside’ of, or embedded in, their environment.  Adopting the assumption of embedded agency into our thinking on instrumental rationality has the following implications. For a Cartesian agent, the optimality of a means M is determined by the M’s effects on the world W. Effective means move the agent closer to its desired world state, where the desirability of a world state is determined by the agent’s ends or preferences. In the case of an embedded agent, however, the evaluation of the optimality of means involves not only M’s effects on the world (as it does in the case of the Cartesian agent), but also M’s effects on the future instantiation of the agent itself.

For example, imagine Anton meets a generous agent, Beta, who makes the following offer: Beta will give Anton a new laptop, conditional on Anton having sufficient need for it. Beta will consider Anton legitimately needy of a new laptop if his current one is at least 5 years old. However, Anton’s current laptop is 4 years old. Should Anton lie to Beta in order to receive a new laptop nevertheless? If Anton is a ‘Cartesian’ instrumental reasoner, he will approach this question by weighing up the positive and negative effects that lying would have compared to telling the truth. If Anton is an embedded instrumental reasoner, he will additionally consider how taking one action over the other will affect his future self, e.g., whether it will make him more likely to lie or be honest in the future. What is more, beyond taking a predictive stance towards future instantiations of himself, Anton can use the fact that his actions today will affect his future self as an avenue to self-modification (within some constraints). For example, Anton might actively wish he was a more honest person. Given embeddedness, acting honestly today contributes towards Anton being a more honest person in the future (i.e., would increase the likelihood that he will act honestly in the future). As such, embedded agency entails that instrumental reasoning becomes entangled with self-predictability and self-modification. 

Bringing things back to AI risk and alignment. Insofar as instrumental reasoning in Cartesian agents tends to converge towards power-seeking behaviours, what will instrumental reasoning in Cartesian agents converge towards? This is, largely, an open question. However, I want to provide one last example to gesture at what the shape of an answer might look like. Let’s assume agent P’s goal is to bring about global peace. Under ‘classical’ instrumental convergence, we may worry that an ‘attractive’ strategy for an instrumental reasoner is to first seek enough power to dominate the world and then, as P’s final ruling so to speak, instantiate global peace. Was a powerful future AI system to pursue such a strategy, that would surely seem problematic. 

Let us now consider the same setup but this time with P’ as an embedded instrumental reasoner. P’ would likely not adopt a strategy of ruthless power-seeking because doing so would make future versions of itself more greedy or violent and less peace-loving. Instead, P’ may be attracted to a strategy that could be described as ‘pursuing peace peacefully’. Importantly, a world of global peace is a world where only peaceful means are being adopted by the actors. Taking actions today that make it more likely that future versions of P’ will act peacefully thus starts to look like a compelling, potentially even converging, course of action. 

As such, instrumental reasoning in embedded agents, via such mechanisms as self-prediction and self-modification, is able to reason substantially about its means beyond their first-order effects on the world. I believe this constitutes a fruitful ground for further alignment research. Key open questions include what sorts of means–ends combinations are both stable (i.e. convergent) and have desirable alignment properties (e.g. truthful, corrgibility, etc.). 

In summary, I have argued that, first, the AI alignment problem is a big deal. The fact that alignment solutions need to be robust in face of intelligent behaviour (e.g. adaptive planning, instrumental reasoning) puts it apart from more ‘typical’ safety problems in engineering. Second, with the help of a number of examples and conceptual arguments mainly centred around the notion of embedded agency,  I contended that it is possible to find expressions of instrumental reasoning that avoid the failure modes discussed earlier. As such, while I cannot conclude whether AI alignment is ultimately solvable, or will de facto be solved in time, I wish to express reason for hope and a concrete recommendation for further investigation. 


Resources

  • Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

  • Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., ... & Amodei, D. (2018). The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228.

  • Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22(2), 71-85.

  • Carlsmith, J. (2022). Is Power-Seeking AI an Existential Risk?. arXiv preprint arXiv:2206.13353.

  • Christiano, P. (2019). Current work in AI alignment. [talk audio recording]. https://youtu.be/-vsYtevJ2bc 

  • Chollet, François. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.

  • Demski, A., & Garrabrant, S. (2019). Embedded agency. arXiv preprint arXiv:1902.09469.

  • Di Langosco, L. L., Koch, J., Sharkey, L. D., Pfau, J., & Krueger, D. (2022). Goal misgeneralisation in deep reinforcement learning. In International Conference on Machine Learning (pp. 12004-12019). PMLR.

  • Drexler, K. E. (2019). Reframing superintelligence. The Future of Humanity Institute, The University of Oxford, Oxford, UK.

  • Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674.

  • Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines. 30(3), 411-437.

  • Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. (2016).Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29:3909–3917

  • Hadfield-Menell, D., Dragan, A., Abbeel, P., & Russell, S. (2017). The off-switch game. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

  • Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., and Garrabrant, S. (2019). Risks from learned optimisation in advanced machine learning systems. arXiv preprint arXiv:1906.01820.

  • Hubinger, E. (2020). An overview of 11 proposals for building safe advanced AI. arXiv preprint arXiv:2012.07532.

  • Krakovna, V., Uesato, J., Mikulik, V., Rahtz, M., Everitt, T., Kumar, R., Kenton, Z., Leike, J., and Legg, S. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind Blog.

  • Krakovna, V., Orseau, L., Ngo, R., Martic, M., & Legg, S. (2020). Avoiding side effects by considering future tasks. Advances in Neural Information Processing Systems, 33, 19064-19074.

  • Manheim, D., & Garrabrant, S. (2018). Categorizing variants of Goodhart's Law. arXiv preprint arXiv:1803.04585.

  • Omohundro, S. M. (2008). The basic AI drives. In AGI (Vol. 171, pp. 483-492).

  • Orseau, L., & Armstrong, M. S. (2016). Safely interruptible agents.

  • Prunkl, C., & Whittlestone, J. (2020). Beyond near-and long-term: Towards a clearer account of research priorities in AI ethics and society. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (pp. 138-143).

  • Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals. arXiv preprint arXiv:2210.01790.

  • Soares, N., Fallenstein, B., Armstrong, S., & Yudkowsky, E. (2015). Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.

  • Wedgwood, R. (2011). Instrumental rationality. Oxford studies in metaethics, 6, 280-309.