Towards a Philosophy of Science of the Artificial

One common way to start an essay on the philosophy of science is to ask: ”Should we be scientific realists?” While this isn’t precisely the question I’m interested in here, it is the entry point of this essay. So bear with me.

Scientific realism, in short, is the view that scientific theories are (approximately) true. Different philosophers have proposed different interpretations of “approximately true”, e.g., as meaning that scientific theories “aim to give us literally true stories about what the world is like” (Van Fraassen, 1980, p. 9), or that “what makes them true or false is something external—that is to say, it is not (in general) our sense data, actual or potential, or the structure of our minds, or our language, etc.” (Hilary Putnam, 1975, p. 69f), or that their terms refer (e.g. Boyd, 1983), or that they correspond to reality (e.g. Fine, 1986). 

Much has been written about whether or not we should be scientific realists. A lot of this discussion has focused on the history of science as a source for empirical support for or against the conjecture of scientific realism. For example, one of the most commonly raised argument in support of scientific realism is the so-called no miracle argument (Boyd 1989): what can better explain the striking success of the scientific enterprise than that scientific theories are (approximately) true—that they are “latching on” to reality in some way or form. Conversely, an influential arguments against scientific realism is the argument from pessimistic meta-induction (e.g. Laudan 1981) which suggests that, given the fact that most past scientific theories have turned out to be false, we should expect our current theories to face the fate of being proven false (as opposed to approximately true). 

In this essay, I consider a different angle on the discussion. Instead of discussing whether scientific realism can adequately explain what we empirical understand about the history of science, I ask whether scientific realism provides us with a satisfying account of the nature and functioning of the future-oriented scientific enterprise—or, what Herbert Simon (1996) called the sciences of the artificial. What I mean by this, in short, is that an epistemology of science needs to be able to account for the fact that the scientist's theorising affects what will come into existence, as well as for how this happens. 

I proceed as follows. In part 1, I explain what I mean by the sciences of the artificial, and motivate the premise of this essay—namely that we should aspire for our epistemology of science to provide an adequate account not only of the inquiry into the “natural”, but also into the nature and coming-into-being of the “artificial”. In part 2, I analyse whether or not scientific realism provides such an account. I conclude that its application to the sciences of the artificial world exposes scientific realism, while not as false, as insufficiently expressive. In part 3, we briefly sketch how an alternative to scientific realism—pragmatism—might present a more satisfactory account. I conclude with a summary of the arguments raised and the key takeaways.

Part 1: The need for a philosophy of science of the artificial

The Sciences of the Artificial—a term coined by Herbert Simon in the eponymous book—refer to domains of scientific enterprise that deal not only with what is, but also with what might be. Examples include the many domains of engineering, medicine or architecture but also fields like psychology, economics or administration. What characterises all of these domains is their descriptive-normative dual nature. The study of what is is, both, informed and mediated by some normative ideal(s). Medicine wants to understand the functioning of the body in order to bring about and relative to a body’s healthy functioning; civil engineering studies materials and applied mechanics in order to build functional and safe infrastructure. In either case, at the end of the day, the central subject of their study is not a matter of what is—it does not exist (yet)—, instead it is the goal of their study to bring it into existence. In Simon’s words, the sciences of the artificial are “concerned not with the necessary but with the contingent [sic] not with how things are but with how they might be [sic] in short, with design” (p. xii).

Some might doubt that a veritable science of the artificial exists. After all, is science not quintessentially concerned with understanding what is—the laws and regularity of the natural world? However, Simon provides what I think is a convincing case that, not only is there a valid and coherent notion of a “science of the artificial”, but also that one of its most interesting dimensions is precisely its epistemology. In the preface to the second edition of the book, he writes: “The contingency of artificial phenomena has always created doubts as to whether they fall properly within the compass of science. Sometimes these doubts refer to the goal-directed character of artificial systems and the consequent difficulty of disentangling prescription from description. This seems to me not to be the real difficulty. The genuine problem is to show how empirical propositions can be made at all about systems that, given different circumstances, might be quite other than they are.” (p. xi).

As such, one of the things we want from our epistemology of science—a theory of the nature of scientific knowledge and the functioning of scientific inquiry—is to provide an adequate treatment of the science of the artificial. It ought, for example, to allow us to think not only about what is true but also what might be true, and how our own theorising affects what comes into being (i.e., what comes to be true). By means of metaphor, insofar as the natural sciences are typically concerned with the making of “maps”, the sciences of the artificial are interested in what goes into the making of “blueprints”. In the philosophy of science, we then get to ask: What is the relationship between maps and blueprints? For example, on one hand, the quality of our maps (i.e., scientific understanding) shape what blueprints we are able to draw (i.e., the things we are able to build). At the same time, our blueprints also end up affecting our maps. As Simon puts it: “The world we live in today is much more a man-made, or artificial, world than it is a natural world. Almost every element in our environment shows evidence of human artifice.“ (p. 2).

One important aspect of the domain of the artificial is that there is usually more than one low-level implementation (and respective theoretical-technological paradigm) through which desired function can be achieved. For example, we have found several ways to travel long distances (e.g. by bicycle, by car, by train, by plane). What is more, there exist several different types of trains (e.g. coal-powered, steam-powered or electric trains; high-speed trains using specialised rolling stocks to reduce friction, or so-called maglevs trains which use magnetic levitation). Most of the time, we must not concern ourselves with the low-level implementation because of their functional equivalency. Precisely because they are designed artefacts, we can expect them to depict largely similar high-level functional properties. If I want to travel from London to Paris, I generally don't have much reason to care what specific type of train I end up finding myself in. 

However, differences in their respective low-level implementation can start to matter under the ‘right’ circumstances, i.e., given relevant variation in external environments. Simon provides us with useful language to talk about this. He writes: “An artifact can be thought of as a meeting point[—]an ‘interface’ in today's terms[—]between an ‘inner’ environment, [i.e.,] the substance and organization of the artifact itself, and an ‘outer’ environment, [i.e.,] the surroundings in which it operates. If the inner environment is appropriate to the outer environment, or vice versa, the artifact will serve its intended purpose.” (p. 6). As such, the (proper) functioning of the artefact is cast in terms of the relationship between the inner environment (i.e., the artefact’s structure or character, its implementation details) and the outer environment (i.e., the conditions in which the artefact operates).” He further provides a simple example to clarify the point: “A bridge, under its usual conditions of service, behaves simply as a relatively smooth level surface on which vehicles can move. Only when it has been overloaded do we learn the physical properties of the materials from which it is built.” (p. 13). 

The point is that two artefacts that have been designed with the same purpose in mind, and which in a wide range of environments behave equivalently, will start to show different behaviours if we enter environments to which their design hasn’t been fully adapted to. Now, their low-level implementation (or ‘inner environments’, in Simon’s terminology) starts to matter. 

To further illustrate the relevance of this point to the current discussion, let us consider the field artificial intelligence (AI)—surely a prime example of a science of the artificial. The aspiration of the field of AI is to find ways to instantiate advanced intelligent behaviour in artificial substrates. As such, it can be understood in terms of its dual nature: it aims to both (descriptively) understand what intelligent behaviour is and how it functions, as well as how to (normatively) implement it. The dominant technological paradigm for building AGI at the moment is machine learning. However, nothing in principle precludes that other paradigms could (or will) be used for implementing AGI (e.g., some variation on symbolic AI, some cybernetic paradigm, soft robotics, or any number of paradigms that haven’t been discovered yet). 

Furthermore, different implementation paradigms for AGI imply different safety- or governance-relevant properties. Imagine, for example, an AGI built in the form of a “singleton” (i.e. a single, monolithic) compared to one built as a multi-agent, distributed system assembly. A singleton AGI, for example, seems more likely to lack interpretability (i.e. behave in ways and for reasons that, by default, remain largely obscure to humans), while a distributed system might be more likely to fall prey to game theoretic pitfalls such as collusion (e.g. Christiano, 2015). It is not at this point properly understood what the different implications of the different paradigms are, but the specifics of this must not matter for the argument I am trying to make here. The point is that, if the goal is to make sure that future AI systems will be safe and used to the benefit of humanity, it may matter a huge deal which of these paradigms is adopted, and to understand what different paradigms imply for considerations of safety and governability. 

As such, the problem of paradigm choice—choices over which implementation roadmap to adopt and which theories to use to inform said roadmap—comes into focus. As philosophers of science, we must ask: What determines paradigm choice? As well as: How, if at all, can a scientist or scientific community navigate questions of paradigm choice “from within” the history of science?

This is where our discussion of the appropriate epistemology for the sciences of the artificial properly begins. Next, let us evaluate whether we can find satisfying answers in scientific realism.  

Part 2: Scientific realism of the artificial

Faced with the question of paradigm choice, one answer that a scientific realist might give is that, what determines the right paradigm choices comes down entirely to how the world is. In other words, what the AI researcher does when trying to figure out how to build AGI is equivalent to uncovering the truth about what AGI, fundamentally, is. We can, of course, at a given point in time be uncertain about the ‘true nature’ of AGI, and thus be exploring different paradigms; but eventually, we will discover which of those paradigms turns out to be the correct one. In other words, the notion of paradigm choice is replaced with the notion of paradigm change. In essence, the aspiration of building AGI is rendered equivalent to the question of what AGI, fundamentally, is. 

As I will argue in what follows, I consider this answer to be dissatisfying in that it denies the very premise of the science of the artificial we have discussed in the earlier section. Consider the following arguments. 

First, the answer by the scientific realist seems to be fundamentally confused about the type-signature of the concept of “AGI”. AGI, in the sense I’ve proposed here, is best understood as a functional description—a design requirement or aspiration. As discussed earlier, it is entirely plausible that there exist several micro-level implementations which are functionally-equivalent to (i.e. depict) generally intelligent behaviour. As such, by treating “the aspiration of building AGI [as equivalent] to the question of what AGI is”, the scientific realist has implicitly already gone away—and thus failed to properly engage with—from the premise of the question. 

Second, note that there are different vantage points from where we could be asking the question. We could take the vantage point of a “forecaster” and ask what artefacts we should expect to exist 100 years from now. Or, we could take the vantage point of a “designer” and ask which artefacts we want to create (or ought to create, given some set of moral, political, aesthetic, or other commitments). While it naturally assumes the vantage point of the forecaster, scientific realism appears inadequate for taking seriously the vantage point of the designer. 

Third, let’s start with the assumption that what will come into existence is a matter of fact. While plausible-sounding at first, further inspection reveals problems with this claim. To show how this is the case, let us consider the following tri-partite characterisation proposed by Eric Drexler (2018). We want to distinguish between three notions of “possible”, namely: (physically) realistic, (techno-economically) plausible, and (socio-politically) credible. This is to say, beyond such facts as the fundamental laws of physics (i.e., the physically realistic), there are other factors—less totalising and yet endowed with some degree of causal force—which shape what comes to be in the future (e.g., economic and sociological pressures). 

Importantly, the physically realistic does not on its own determine what sorts of artefacts come into existence. For example, paradigm A (by mere chance, or for reasons of historical contingency) receives differential economic investment compared to paradigm B, resulting in its faster maturation; or inversely, it might get restrained or banned through political means, resulting in it being blocked, and maybe even eventually forgotten. Examples of political decisions (e.g. regulation, subvention, taxation, etc.) affecting technological trajectories abound. To name just one, consider how the ban on human cloning has, in fact, stopped human cloning activities, as well as any innovations related to making human cloning ‘better’ (cheaper, more convenient, etc.) in some way.

The scientific realist might react to this by arguing that, while the physically realistic is not the only factor that determines what sorts of artefacts come into existence, there is still a matter of fact to the nature and force of economic, political and social factors affecting technological trajectories, all of which could, at least in principle be understood scientifically. While I am happy to grant this view, I argue that the problem lies elsewhere. As we have seen, the interactions between the physically realistic, the techno-economically plausible, and the socio-politically crediblethe are highly complex and, importantly, self-referential. It is exactly this self-referentiality that makes this a case of paradigm choice, rather than paradigm change, when viewed from the position of the scientist. In other words, an adequate answer to the problem of paradigm choice must necessarily consider a “view from inside of the history of science”, as opposed to a “view from nowhere”. After all, the paradigm is being chosen by the scientific community (and the researchers making up that community), and they are making said choice from their own situated perspective.

In summary, it is less that the answers provided by the scientific realist are outright wrong. It rather appears as if the view provided by scientific realism is not expressive enough to deal with the realities of the sciences of the artificial. It can not usefully guide the scientific enterprise when it comes to the considerations brought to light by the sciences of the artificial. Philosophy of science needs to do better if it wants to avoid confirming the accusation raised by Richard Feynman—that philosophers of science are to scientists what ornithologists are to birds; namely irrelevant. 

Next, we will consider whether we can find a different epistemological framework, while holding onto as much realism as possible, appears more adequate for the needs of the sciences of the artificial. 

Part 3: A pragmatic account of the artificial

So far, we have introduced the notion of the science of the artificial, discussed what it demands from the philosophy of science, and observed how scientific realism fails to appropriately respond to those demands. The question is then: Can we do better? 

An alternative account to scientific realism—and the one we will consider in this last section—is pragmatic realism, chiefly originating from the American pragmatists William James, Charles Sander Pierce, John Dewey. For the present discussion, I will largely draw on contemporary work trying to revive a pragmatic philosophy of science that is truly able to guide and support scientific inquiry, such by Robert Toretti, Hasok Chang, Rein Vihalemm. 

Such a pragmatist philosophy of science emphasises scientific research as a practical activity, and the role of an epistemology of science as helping to successfully conduct this activity. While sharing with the scientific realist a commitment to an external reality, pragmatism suggests that our ways of getting to know the world are necessarily mediated by the ways knowledge is created and used, i.e., by our epistemic aims and means of “perception”—both the mind and scientific tools, as well as our scientific paradigms.

Note that pragmatism, as I have presented it here, does at no point do away with the notion of an external reality. As Giere (2006) clarifies, not all types of realism must conscribe to a “full-blown objective realism”(or what Putnam called “metaphysical realism”)—roughly speaking the view that “[t]here is exactly one true and complete description of ‘the way the world is.’” (Putnam, 1981, p. 49). As such, pragmatic realism, while rejecting objective or metaphysical realism, remains squarely committed to realism, and understands scientific inquiry as an activity directed at better understanding reality (Chang, 2022, p. 5; p. 208).  

Let us now consider whether pragmatism is better able to deal with the epistemological demands of the sciences of the artificial than scientific realism. Rather than providing a full-fledged account of how the sciences of the artificial can be theorised within the framework of pragmatic realism, what I set out to do here is more humble in its ambition. Namely, I aim to support my claim that scientific realism is insufficiently expressive as an epistemology of the sciences of the artificial by showcasing there exist alternative different frameworks—in this case pragmatic realism—that do not face the same limitations. In other words, I aim to showcase that, indeed, we can do better. 

First, as we have seen, scientific realism fails to adopt the “view point of the scientist''. As a result, it collapses the question of paradigm choice to a question of paradigm change. This makes scientific realism incapable of addressing the (very real) challenge faced by the scientist; afterall, as I have argued, different paradigms might come with different properties we care about (such as when they concern questions of safety or governance). In contrast to scientific realism, pragmatism explicitly rejects the idea that scientific inquiry can ever adopt a “view from nowhere” (or, a “God’s eye view” as Putnam (1981, p. 49) puts it). Chang (2019) emphasises the “humanistic impulse” in pragmatism. “Humanism in relation to science is a commitment to understand and promote science as something that human agents do, not as a body of knowledge that comes from accessing information about nature that exists completely apart from ourselves and our investigations.” (p. 10). This aligns well with the need of the sciences of the artificial to be able to reason from the point of view of the scientist.

Second, pragmatism, in virtue of focusing on the means through which scientific knowledge is created, recognise the historicity of scientific activity (also see, e.g. Vihalemm, p.3; Chang, 2019). This capacity allows pragmatic realism to reflect the historicity that is also present in the science of the artificial. Recall that, as we discussed earlier, one central epistemological question of the sciences of the artificial concerns how our theorising affects what comes into existence. As such, our prior beliefs, scientific frameworks and tools affect, by means of ‘differential investment’ in designing artefacts under a given paradigm, what sort of reality comes to be. Moreover, the nature of technological progress itself affects what we become able to understand, discover and build in the future. Pragmatism suggests that, rather than there being already a predetermined answer as to which will be the most successful paradigm, the scientist must understand their own scientific activity as part of an iterative and path-dependent epistemic process.

Lastly, consider how the sciences of the artificial entail a ‘strange inversion’ of ‘functional’ and ‘mechanistic’ explanations. In the domain of the natural, the ‘function’ of a system is understood as a viable post-hoc description of the system, resulting from its continuous adaptation to the environment by external pressures. In contrast, in design, the ‘function’ of an artefact becomes that which is antecedent, while the internal environment of the artefact - its low-level implementation - becomes post-hoc. It appears difficult, through the eyes of a scientific realist, to fully accept this inversion. At the same time, accepting it appears to be useful, if not required, in order to epistemology of the sciences of the artificial on its own terms. Pragmatic realism, on the other hand, does not face the same trouble. To exemplify this, let us take Chang’s notion of operational coherence, a deeply pragmatist notion of the yardstick of scientific inquiry, which he describes as “a harmonious fitting-together of actions that is conducive to a successful achievement of one’s aims” (Chang, 2019, p. 14). As such, insofar as we are able to argue that a given practice in the sciences of the artificial possesses such operational coherence, it is compatible with pragmatic realism. What I have tried to show hereby is that the sciences of the artificial, including the ‘strange inversion’ of the role of ‘ functions’ which it entails, is fully theorisable inside of the framework of pragmatic realism. As such, unlike scientific realism, the latter does not fail to engage with the sciences of the artificial on its own terms.

To summarise this section, I have argued, by means of three examples, that pragmatic realism is a promising candidate for a philosophy of science within which it is possible to theorise the sciences of the artificial. In that, pragmatic realism differs from scientific realism. In particular, I have invoked the fact that the sciences of the artificial requires us to take the “point of view of the scientist”, to acknowledge the iterative, path-dependent and self-referential nature of scientific inquiry (i.e., its historicity), and, finally, to accept the central role of ‘function’ in understanding designed artefacts. 

Conclusion

In section 1, I have laid out the case for why we need a philosophy of science that can encompass questions arising from the sciences of the artificial. One central such question is the problem of paradigm choice, which requires the scientific practitioner to understand the ways in which their own theorising affects what will come into existence.

In section 2, I have considered whether scientific realism provides a sufficient account, and concluded that it doesn’t. I have listed three examples of ways in which scientific realism seems to be insufficiently expressive as an epistemology of the sciences of the artificial. Finally, in section 3, I explored whether we can do better, and have provided three examples of epistemic puzzles, arising from the sciences of the artificial, that pragmatic realism, in contrast with scientific realism, is able to account for. 

While scientific realism seems attractive on the basis of its explaining the success of science (of the natural), it does not in fact present a good explanation of the success of the science of the artificial. How, before things like, say, planes, computers, or democratic institutions existed, could we have learnt to build them if all that was involved in the scientific enterprise was uncovering that which (already) is? As such, I claim that the sciences of the artificial provide an important reason for why we should not be satisfied with the epistemological framework provided by scientific realism with respect to understanding and—importantly—guiding scientific inquiry. 


References

Boyd, R. N. (1983). On the current status of the issue of scientific realism. Methodology, Epistemology, and Philosophy of Science: Essays in Honour of Wolfgang Stegmüller on the Occasion of His 60th Birthday, June 3rd, 1983, 45-90.

Chang, H. (2019). Pragmatism, perspectivism, and the historicity of science. In Understanding perspectivism. pp. 10-27. Routledge.

Chang, H. (2022). Realism for Realistic People. Cambridge University Press.

Christiano, P. (2015). On heterogeneous objectives. AI Alignment (medium.com). Retrieved from. https://ai-alignment.com/on-heterogeneous-objectives-b38d0e003399.

Drexler, E. (2018). Paretotopian goal alignment, Talk at EA Global: London 2018. 

Drexler, E. (2019). Reframing Superintelligence. Future of Humanity Institute.

Fine, A. (1986). Unnatural Attitudes: Realist and Antirealist Attachments to Science. Mind, 95(378): 149–177. 

Fu, W., Qian Q. (2023). Artificial Intelligence and Dual Contract. arXiv preprint arXiv:2303.12350.

Laudan, L. (1981). A confutation of convergent realism. Philosophy of science. 48(1), 19-49.

Normile, D. (2018). CRISPR bombshell: Chinese researcher claims to have created gene-edited twins. Science. doi: 10.1126/science.aaw1839.

Putnam, H. (1975). Mathematics,  Matter  and  Method. Cambridge:  Cambridge University Press.

Putnam, H. (1981). Reason, Truth and History. Cambridge: Cambridge University Press.

Simon, H. (1996). The Sciences of the Artificial - 3rd ed, The MIT Press.

Soares, N., Fallenstein, B. (2017). Agent foundations for aligning machine intelligence with human interests: a technical research agenda. The technological singularity: Managing the journey, 103-125.

Toretti, R. (2000). Scientific Realism’ and Scientific Practice. In Evandro Agazzi and Massimo Pauri (eds.), The Reality of the Unobservable. Dordrecht: Kluwer.

Van Fraassen, B. (1980). The scientific image. Oxford University Press.

Vihalemm, R. (2012). Practical Realism: Against Standard Scientific Realism and Anti-realism. Studia Philosophica Estonica. 5/2: 7–22.

Why powerful instrumental reasoning is by default misaligned, and what we can do about it?

Meta: There are some ways I intend to update and expand this line of thinking. Due to uncertainty about when I will be able to do so, I decided to go ahead with posting this version of the essay already.

The AI alignment problem poses an unusual, and maybe unusually difficult, engineering challenge. Instrumental reasoning, while being plausibly a core feature of effective and general intelligent behaviour, is tightly linked to the emergence of behavioural dynamics that lead to potentially existential risks. In this essay, I will argue that one promising line of thinking towards solving AI alignment seeks to find expressions of instrumental reasoning that avoid such risk scenarios and are compatible with or constitutive of aligned AI. After taking a closer look at the nature of instrumental rationality, I will formulate a vision for what such an aligned expression of instrumental reasoning may look like.

I will proceed as follows. In the first section, I introduce the problem of AI alignment and argue that instrumental reasoning plays a central role in many AI risk scenarios. I will do so by describing a number of failure scenarios - grouped into failures related to instrumental convergence and failures related to Goodhart’s law - and illustrate the role of instrumental reasoning in each of these cases. In the second section, I will argue that we need to envision what alignable expressions of instrumental rationality may look like. To do so, I will first analyse the nature of instrumental rationality, thereby identifying two design features—cognitive horizons and reasoning about means—that allow us to mediate which expressions of instrumental reasoning will be adopted. Building from this analysis, I will sketch a positive proposal for alignable instrumental reasons, centrally building on the concept of embedded agency. 

AI risk and instrumental rationality; or: why AI alignment is hard

Recent years have seen rapid progress in the design and training of ever more impressive-looking applications of AI. In their most general form, state-of-the-art AI applications can be understood as specific instantiations of a class of systems characterised by having goals which they pursue by making, evaluating, and implementing plans. Their goal-directedness can be an explicit feature of their design or an emergent feature of training. Putting things this way naturally brings up the question of what goals we want these systems to pursue. Gabriel (2020) refers to this as the “normative” question of AI alignment. However, at this stage of intellectual progress in AI, we don’t know how to give AI systems any goal in such a way that the AI systems end up (i) pursuing the goal we intended to give them and (ii) continuing to pursue the intended goal reliably, across, e.g. distributional shifts. We can refer to these two technical aspects of the problem as the problem of goal specification (Krakovna, 2020; Hadfield-Menell, 2016) and the problem of goal (mis)generalisation (Shah et al. 2022; Di Langosco et al., 2022; Hubinger et al., 2019), respectively. 

At first sight, one might think that finding solutions to these problems is ‘just another engineering problem’. Given human ingenuity, it should be only a matter of time before satisfactory solutions will be presented. However, there are reasons to believe the problem is harder than it may initially appear. In particular, I will argue that a lot of what makes solving alignment hard is behavioural dynamics that emerge from powerful instrumental reasoning. 

By instrumental rationality, I refer to means–end reasoning which selects those means which are optimal in view of the end that is being pursued. It is itself a core feature of general intelligence as we understand it. ‘Intelligence’ is a broad and elusive term, and many definitions have been proposed (Chollet, 2019). In the context of this essay, we will take ‘intelligence’ to refer to a (more or less) general problem-solving capacity, where its generality is a function of how well it travels (i.e., generalised) across different problems or environments. As such, AI is intelligent in virtue of searching over a space of action–outcome pairs (or behavioural policies) and selecting whichever actions appear optimal in view of the goal it is pursuing. 

The intelligence in AI—i.e., its abilities for learning, adaptive planning, means–end reasoning, etc.—renders safety concerns in AI importantly disanalogous to safety concerns in more ‘classical’ domains of engineering such as aerospace or civil engineering. (We will substantiate this claim in the subsequent discussion.) Furthermore, it implies that we can reasonably expect the challenges of aligning AI to increase as those systems become more capable. That said, observing the difficulty of solving AI alignment is interesting not only because it can motivate more work on the problem; asking why AI alignment is hard may itself be a productive exercise towards progress. It can shine some light on the ‘shape’ of the problem, identify which obstacles need to be overcome, clarify necessary and/or sufficient features of a solution, and illuminate potential avenues for progress. 

Both the philosophical and technical literature has discussed several distinct ways in which such means–end optimisation can lead to unintended, undesirable, and potentially even catastrophic outcomes. In what follows, we will discuss two main classes of such failure modes – instrumental convergence and Goodhart’s law – and clarify the role of instrumental reasoning in their emergence. 

Instrumental convergence describes a phenomenon where rational agents pursue instrumental goals, even if those are distinct from their terminal goals  (Omohundru, 2008; Bostrom, 2012). Instrumental goals are things that appear useful towards achieving a wide range of other objectives; common examples include the accumulation of resources or power, as well as self-preservation. The more resources or power are available to an agent, the better positioned it will be at pursuing its terminal goals. Importantly, this also applies if the agent in question is highly uncertain about what the future will look like. Similarly, an agent cannot hope to achieve its objectives unless it is around to pursue them.This gives rise to the instrumental drive of self-preservation. Due to their instrumental and near-universal usefulness, instrumental goals act as sort of strategic attractors which the behaviour of intelligent agents will tend to converge towards. 

How can this lead to risks? Carlsmith (2022) discusses at length whether power-seeking AI may pose an existential risk. He argues that misaligned agents “would plausibly have instrumental incentives to seek power over humans”, which could lead to “the full disempowerment of humanity”, which in turn would “constitute an existential catastrophe”. Similarly, the instrumental drive of self-preservation gives rise to what in the literature is discussed under the term of safe interruptibility (e.g., Oreseau and Armstrong, 2016; Hadfield-Menell et al., 2017) AI systems may resist attempts to be interrupted or modified by humans, as this would jeopardize the successful pursuit of the AI’s goals.

The second class of failure modes can be summarised under the banner of Goodhart’s law. Goodhart’s law describes how, when metrics are used to improve the performance of a system, and when sufficient optimisation pressure is applied, optimising for these metrics ends up undermining progress towards the actual goal. In other words, as discussed by Garrabrant and Manheim (2019), the “Goodhart effect [occurs] when optimisation causes a collapse of the statistical relationship between a goal which the optimiser intends and the proxy used for that goal.” This problem is well-known far beyond the field of AI and finds application in such domains as management or policy-making, where proxies are commonly used to gauge progress towards a different, and often more complex, terminal goal. 

How does this manifest in the case of AI? It is relevant to this discussion to understand that, in modern-day AI training, the objective is not directly specified. Instead, what is specified is a reward (or loss) function, based on which the AI learns behavioural policies that are effective at achieving the goal (in the training distribution). Goodhart’s effects can manifest in Ai applications in different forms. For example, the phenomenon of specification gaming occurs when AI systems come up with creative ways to meet the literal specification of the goal without bringing about the intended outcomes (e.g., Krakovna et al., 2020a). Tthe optimisation of behaviour via the reward signal can be understood as exerting pressure on the agents to come up with new ways to achieve the reward effectively. However, achieving the reward may not reliably equate to achieving the intended objective. 

Sketching a vision of alignable instrumental reason

In the previous section, I argued that instrumental rationality is linked to behavioural dynamics that can lead to a range of risk scenarios. From here, two high-level avenues towards aligned AI present themselves. 

First, one may attempt to avoid instrumental reasoning in AI applications altogether, thereby eliminating (at least one) important driver of risk. However, I argue that this avenue appears implausible upon closer inspection. At least to a large extent, what makes AI dangerous is also what makes it effective in the first place. Therefore, we might not be able to get around some form of instrumental rationality in AI, insofar as these systems are meant to be effective at solving problems for us. Solutions to AI alignment need not only be technically valid; they also need to be economically competitive enough such that they will be adopted by AI developers in view of commercial or military–strategic incentives. Given this tight link between instrumental rationality and the sort of effectiveness and generality we are looking to instantiate in AI, I deem it implausible to try to build AI systems that avoid means–end reasoning entirely.  

This leads us to the second avenue. How can we have AI that will effective and useful, while also being reliably safe and beneficial? The idea is to find expressions of instrumental reasoning that do not act as risk drivers and, instead, are compatible with, or constitutive of aligned AI. Given what we discussed in the first section, this approach may seem counterintuitive. However, I will argue that instrumental reasoning can be implemented in ways that lead to different behavioural dynamics than the ones we discussed earlier. In other words, I suggest recasting the problem of AI alignment (at least in part) as the search for alignable expressions of instrumental rationality. 

Before we can properly envision what alignable instrumental rationality may look like, we first need to analyse instrumental reasons more closely. In particular, we will consider two aspects: cognitive horizons and means. The first aspect is about restricting the domain of instrumental reason along temporal, spatial, material, or functional dimensions; the second point investigates how, if at all, instrumental rationality can substantially reason about means. 

Thinking about modifications to instrumental rationality that lead to improvements in safety is not an entirely novel idea. One way to modify the expression of instrumental reasoning is by defining, and specifically limiting, the ‘cognitive horizon’ of the reasoners, which can be done in different ways. One approach, usually discussed under the term myopia (Hubinger 2020), is to restrict the time horizon that the AI takes into account when evaluating its plans. Other approaches of a similar type include limiting the AI’s resource budget (e.g,. Drexler (2019), chapter 8), or restricting its domain of action. For example, we may design a coffee-making AI to act only within and reason only about the kitchen, thereby preventing it from generating plans that involve other parts of the world (e.g., the global supply chain). The hope is that, by restricting the cognitive horizon of AI, we can undercut concerning behavioural dynamics like excessive power-seeking. While each of these approaches comes with its own limitations and no fully satisfying solution to AI alignment has been proposed, they nevertheless serve as a case-in-point for how modifying the expression of instrumental reasoning is possible and may contribute towards AI alingment solutions. 

A different avenue for modifying the expression of instrumental reasoning is to look more closely at the role of means in means–end reasoning. It appears instrumental reasoners do not have substantive preferences over the sorts of means they adopt, beyond wanting them to be effective at promoting the desired ends. However, there are some considerations that let us re-examine this picture. Consider, for example, that, given the immense complexity of the consequentialist calculus combined with the bounded computational power of real agents, sufficiently ‘sophisticated’ consequentialist reasoners will fare better by sometimes choosing actions on the basis of rule- or virtue-based reasoning as opposed to explicit calculation of consequences. Another argument takes into account the nature of multi-agent dynamics, arguing that positive-sum dynamics emerge when agents are themselves transparent and prosocial. Oftentimes, the best way to credibly appear transparent or prosocial is to be transparent and prosocial. Finally, functional decision theory suggests that, instead of asking what action should be taken in a given situation, we should ask what decision algorithm we should adopt for this, and any future or past decision problems with structural equivalence (Yudkowsky and Soares, 2017). Generally speaking, all of these examples look a lot like rule- or virtue-based reasoning, instead of what we typically imagined means–end reasoning to look like. Furthermore, they refute at least a strong version of the claim that instrumental reasoners cannot substantially reason about means is disputable. 

With all of these pieces in place, we can now proceed to sketch out a positive proposal of aligned instrumental rationality. In doing so, I will centrally draw on work on the embedded agency as discussed by Demski and Garrabrant (2019). 

Embeddedness refers to the fact that an agent is itself part of the environment in which it acts. In contrast to the way agents are typically modelled—as being cleanly separated from their environment and able to reason about it completely—embedded agents are understood as living ‘inside’ of, or embedded in, their environment.  Adopting the assumption of embedded agency into our thinking on instrumental rationality has the following implications. For a Cartesian agent, the optimality of a means M is determined by the M’s effects on the world W. Effective means move the agent closer to its desired world state, where the desirability of a world state is determined by the agent’s ends or preferences. In the case of an embedded agent, however, the evaluation of the optimality of means involves not only M’s effects on the world (as it does in the case of the Cartesian agent), but also M’s effects on the future instantiation of the agent itself.

For example, imagine Anton meets a generous agent, Beta, who makes the following offer: Beta will give Anton a new laptop, conditional on Anton having sufficient need for it. Beta will consider Anton legitimately needy of a new laptop if his current one is at least 5 years old. However, Anton’s current laptop is 4 years old. Should Anton lie to Beta in order to receive a new laptop nevertheless? If Anton is a ‘Cartesian’ instrumental reasoner, he will approach this question by weighing up the positive and negative effects that lying would have compared to telling the truth. If Anton is an embedded instrumental reasoner, he will additionally consider how taking one action over the other will affect his future self, e.g., whether it will make him more likely to lie or be honest in the future. What is more, beyond taking a predictive stance towards future instantiations of himself, Anton can use the fact that his actions today will affect his future self as an avenue to self-modification (within some constraints). For example, Anton might actively wish he was a more honest person. Given embeddedness, acting honestly today contributes towards Anton being a more honest person in the future (i.e., would increase the likelihood that he will act honestly in the future). As such, embedded agency entails that instrumental reasoning becomes entangled with self-predictability and self-modification. 

Bringing things back to AI risk and alignment. Insofar as instrumental reasoning in Cartesian agents tends to converge towards power-seeking behaviours, what will instrumental reasoning in Cartesian agents converge towards? This is, largely, an open question. However, I want to provide one last example to gesture at what the shape of an answer might look like. Let’s assume agent P’s goal is to bring about global peace. Under ‘classical’ instrumental convergence, we may worry that an ‘attractive’ strategy for an instrumental reasoner is to first seek enough power to dominate the world and then, as P’s final ruling so to speak, instantiate global peace. Was a powerful future AI system to pursue such a strategy, that would surely seem problematic. 

Let us now consider the same setup but this time with P’ as an embedded instrumental reasoner. P’ would likely not adopt a strategy of ruthless power-seeking because doing so would make future versions of itself more greedy or violent and less peace-loving. Instead, P’ may be attracted to a strategy that could be described as ‘pursuing peace peacefully’. Importantly, a world of global peace is a world where only peaceful means are being adopted by the actors. Taking actions today that make it more likely that future versions of P’ will act peacefully thus starts to look like a compelling, potentially even converging, course of action. 

As such, instrumental reasoning in embedded agents, via such mechanisms as self-prediction and self-modification, is able to reason substantially about its means beyond their first-order effects on the world. I believe this constitutes a fruitful ground for further alignment research. Key open questions include what sorts of means–ends combinations are both stable (i.e. convergent) and have desirable alignment properties (e.g. truthful, corrgibility, etc.). 

In summary, I have argued that, first, the AI alignment problem is a big deal. The fact that alignment solutions need to be robust in face of intelligent behaviour (e.g. adaptive planning, instrumental reasoning) puts it apart from more ‘typical’ safety problems in engineering. Second, with the help of a number of examples and conceptual arguments mainly centred around the notion of embedded agency,  I contended that it is possible to find expressions of instrumental reasoning that avoid the failure modes discussed earlier. As such, while I cannot conclude whether AI alignment is ultimately solvable, or will de facto be solved in time, I wish to express reason for hope and a concrete recommendation for further investigation. 


Resources

  • Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

  • Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., ... & Amodei, D. (2018). The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228.

  • Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22(2), 71-85.

  • Carlsmith, J. (2022). Is Power-Seeking AI an Existential Risk?. arXiv preprint arXiv:2206.13353.

  • Christiano, P. (2019). Current work in AI alignment. [talk audio recording]. https://youtu.be/-vsYtevJ2bc 

  • Chollet, François. (2019). On the measure of intelligence. arXiv preprint arXiv:1911.01547.

  • Demski, A., & Garrabrant, S. (2019). Embedded agency. arXiv preprint arXiv:1902.09469.

  • Di Langosco, L. L., Koch, J., Sharkey, L. D., Pfau, J., & Krueger, D. (2022). Goal misgeneralisation in deep reinforcement learning. In International Conference on Machine Learning (pp. 12004-12019). PMLR.

  • Drexler, K. E. (2019). Reframing superintelligence. The Future of Humanity Institute, The University of Oxford, Oxford, UK.

  • Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674.

  • Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines. 30(3), 411-437.

  • Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. (2016).Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29:3909–3917

  • Hadfield-Menell, D., Dragan, A., Abbeel, P., & Russell, S. (2017). The off-switch game. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

  • Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., and Garrabrant, S. (2019). Risks from learned optimisation in advanced machine learning systems. arXiv preprint arXiv:1906.01820.

  • Hubinger, E. (2020). An overview of 11 proposals for building safe advanced AI. arXiv preprint arXiv:2012.07532.

  • Krakovna, V., Uesato, J., Mikulik, V., Rahtz, M., Everitt, T., Kumar, R., Kenton, Z., Leike, J., and Legg, S. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind Blog.

  • Krakovna, V., Orseau, L., Ngo, R., Martic, M., & Legg, S. (2020). Avoiding side effects by considering future tasks. Advances in Neural Information Processing Systems, 33, 19064-19074.

  • Manheim, D., & Garrabrant, S. (2018). Categorizing variants of Goodhart's Law. arXiv preprint arXiv:1803.04585.

  • Omohundro, S. M. (2008). The basic AI drives. In AGI (Vol. 171, pp. 483-492).

  • Orseau, L., & Armstrong, M. S. (2016). Safely interruptible agents.

  • Prunkl, C., & Whittlestone, J. (2020). Beyond near-and long-term: Towards a clearer account of research priorities in AI ethics and society. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (pp. 138-143).

  • Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). Goal Misgeneralisation: Why Correct Specifications Aren't Enough For Correct Goals. arXiv preprint arXiv:2210.01790.

  • Soares, N., Fallenstein, B., Armstrong, S., & Yudkowsky, E. (2015). Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.

  • Wedgwood, R. (2011). Instrumental rationality. Oxford studies in metaethics, 6, 280-309.