About LessWrong

An online forum and community dedicated to improving human reasoning and decision-making.

The RSS's url is : https://www.lesswrong.com/feed.xml

Please copy to your reader or subscribe it with :

Preview of RSS feed of LessWrong

The Obliqueness Thesis

2024-09-19 08:26:30

Published on September 19, 2024 12:26 AM GMT

In my Xenosystems review, I discussed the Orthogonality Thesis, concluding that it was a bad metaphor. It's a long post, though, and the comments on orthogonality build on other Xenosystems content. Therefore, I think it may be helpful to present a more concentrated discussion on Orthogonality, contrasting Orthogonality with my own view, without introducing dependencies on Land's views. (Land gets credit for inspiring many of these thoughts, of course, but I'm presenting my views as my own here.)

First, let's define the Orthogonality Thesis. Quoting Superintelligence for Bostrom's formulation:

Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.

To me, the main ambiguity about what this is saying is the "could in principle" part; maybe, for any level of intelligence and any final goal, there exists (in the mathematical sense) an agent combining those, but some combinations are much more natural and statistically likely than others. Let's consider Yudkowsky's formulations as alternatives. Quoting Arbital:

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in the existence of an intelligent agent that pursues a goal, above and beyond the computational tractability of that goal.

As an example of the computational tractability consideration, sufficiently complex goals may only be well-represented by sufficiently intelligent agents. "Complication" may be reflected in, for example, code complexity; to my mind, the strong form implies that the code complexity of an agent with a given level of intelligence and goals is approximately the code complexity of the intelligence plus the code complexity of the goal specification, plus a constant. Code complexity would influence statistical likelihood for the usual Kolmogorov/Solomonoff reasons, of course.

I think, overall, it is more productive to examine Yudkowsky's formulation than Bostrom's, as he has already helpfully factored the thesis into weak and strong forms. Therefore, by criticizing Yudkowsky's formulations, I am less likely to be criticizing a strawman. I will use "Weak Orthogonality" to refer to Yudkowsky's "Orthogonality Thesis" and "Strong Orthogonality" to refer to Yudkowsky's "strong form of the Orthogonality Thesis".

Land, alternatively, describes a "diagonal" between intelligence and goals as an alternative to orthogonality, but I don't see a specific formulation of a "Diagonality Thesis" on his part. Here's a possible formulation:

Diagonality Thesis: Final goals tend to converge to a point as intelligence increases.

The main criticism of this thesis is that formulations of ideal agency, in the form of Bayesianism and VNM utility, leave open free parameters, e.g. priors over un-testable propositions, and the utility function. Since I expect few readers to accept the Diagonality Thesis, I will not concentrate on criticizing it.

What about my own view? I like Tsvi's naming of it as an "obliqueness thesis".

Obliqueness Thesis: The Diagonality Thesis and the Strong Orthogonality Thesis are false. Agents do not tend to factorize into an Orthogonal value-like component and a Diagonal belief-like component; rather, there are Oblique components that do not factorize neatly.

(Here, by Orthogonal I mean basically independent of intelligence, and by Diagonal I mean converging to a point in the limit of intelligence.)

While I will address Yudkowsky's arguments for the Orthogonality Thesis, I think arguing directly for my view first will be more helpful. In general, it seems to me that arguments for and against the Orthogonality Thesis are not mathematically rigorous; therefore, I don't need to present a mathematically rigorous case to contribute relevant considerations, so I will consider intuitive arguments relevant, and present multiple arguments rather than a single sequential argument (as I did with the more rigorous argument for many worlds).

Bayes/VNM point against Orthogonality

Some people may think that the free parameters in Bayes/VNM point towards the Orthogonality Thesis being true. I think, rather, that they point against Orthogonality. While they do function as arguments against the Diagonality Thesis, this is insufficient for Orthogonality.

First, on the relationship between intelligence and bounded rationality. It's meaningless to talk about intelligence without a notion of bounded rationality. Perfect rationality in a complex environment is computationally intractable. With lower intelligence, bounded rationality is necessary. So, at non-extreme intelligence levels, the Orthogonality Thesis must be making a case that boundedly rational agents can have any computationally tractable goal.

Bayesianism and VNM expected utility optimization are known to be computationally intractable in complex environments. That is why algorithms like MCMC and reinforcement learning are used. So, making an argument for Orthogonality in terms of Bayesianism and VNM is simply dodging the question, by already assuming an extremely high intelligence level from the start.

As the Orthogonality Thesis refers to "values" or "final goals" (which I take to be synonymous), it must have a notion of the "values" of agents that are not extremely intelligent. These values cannot be assumed to be VNM, since VNM is not computationally tractable. Meanwhile, money-pumping arguments suggest that extremely intelligent agents will tend to converge to VNM-ish preferences.

Argument from Bayes/VNM: Agents with low intelligence will tend to have beliefs/values that are far from Bayesian/VNM. Agents with high intelligence will tend to have beliefs/values that are close to Bayesian/VNM. Strong Orthogonality is false because it is awkward to combine low intelligence with Bayesian/VNM beliefs/values, and awkward to combine high intelligence with far-from-Bayesian/VNM beliefs/values. Weak Orthogonality is in doubt, because having far-from-Bayesian/VNM beliefs/values puts a limit on the agent's intelligence.

To summarize: un-intelligent agents cannot be assumed to be Bayesian/VNM from the start. Those arise at a limit of intelligence, and arguably have to arise due to money-pumping arguments. Beliefs/values therefore tend to become more Bayesian/VNM with high intelligence, contradicting Strong Orthogonality and perhaps Weak Orthogonality.

One could perhaps object that logical uncertainty allows even weak agents to be Bayesian over combined physical/mathematical uncertainty; I'll address this consideration later.

Belief/value duality

It may be unclear why the Argument from Bayes/VNM refers to both beliefs and values, as the Orthogonality Thesis is only about values. It would, indeed, be hard to make the case that the Orthogonality Thesis is true as applied to beliefs. However, various arguments suggest that Bayesian beliefs and VNM preferences are "dual" such that complexity can be moved from one to the other.

While I believe Scott Garrabrant and/or Ambram Demski have discussed such duality, I haven't found a relevant post on the Alignment Forum about this, so I'll present the basic idea in this post.

Let be the agent’s action, and let represent the state of the world prior to / unaffected by the agent’s action Let r(A, W) be the outcome resulting from the action and world. Let P(w) be the primary agent’s probability a given world. Let U(o) be the primary agent’s utility for outcome o. The primary agent finds an action a to maximize .

Now let e be an arbitrary predicate on worlds. Consider modifying P to increase the probability that e(W) is true. That is:

where [e(w)] equals 1 if e(w), otherwise 0. Now, can we define a modified utility function U’ so a secondary agent with beliefs P’ and utility function U’ will take the same action as the primary agent? Yes:

This secondary agent will find an action a to maximize:

Clearly, this is a positive constant times the primary agent's maximization target, so the secondary agent will take the same action.

This demonstrates a basic way that Bayesian beliefs and VNM utility are dual to each other. One could even model all agents as having the same utility function (of maximizing a random variable U) and simply having different beliefs about what U values are implied by the agent's action and world state.

This kind of duality, of course, becomes more complicated in the case of bounded rationality. I'll summarize the argument as follows:

Argument from belief/value duality: From an agent's behavior, multiple belief/value combinations are valid attributions. This is clearly true in the limiting Bayes/VNM case, suggesting it also applies in the case of bounded rationality. It is unlikely that the Strong Orthogonality Thesis applies to beliefs (including priors), so, due to the duality, it is also unlikely that it applies to values.

I consider this weaker than the Argument from Bayes/VNM. Someone might object that both values and a certain component of beliefs are orthogonal, while the other components of beliefs (those that change with more reasoning/intelligence) aren't. But I think this depends on a certain factorizability of beliefs/values into the kind that change on reflection and those that don't, and I'm skeptical of such factorizations. I think discussion of logical uncertainty will make my position on this clearer, though, so let's move on.

Logical uncertainty as a model for bounded rationality

I've already argued that bounded rationality is essential to intelligence (and therefore the Orthogonality Thesis). Logical uncertainty is a form of bounded rationality (as applied to guessing the probabilities of mathematical statements). Therefore, discussing logical uncertainty is likely to be fruitful with respect to the Orthogonality Thesis.

Logical Induction is a logical uncertainty algorithm that produces a probability table for a finite subset of mathematical statements at each iteration. These beliefs are determined by a betting market of an increasing (up to infinity) number of programs that make bets, with the bets resolved by a "deductive process" that is basically a theorem prover. The algorithm is computable, though extremely computationally intractable, and has properties in the limit including some forms of Bayesian updating, statistical learning, and consistency over time.

We can see Logical Induction as evidence against the Diagonality Thesis: beliefs about undecidable statements (which exist in consistent theories due to Gödel's first incompleteness theorem) can take on any probability in the limit, though satisfy properties such as consistency with other assigned probabilities (in a Bayesian-like manner).

However, (a) it is hard to know ahead of time which statements are actually undecidable, (b) even beliefs about undecidable statements tend to predictably change over time to Bayesian consistency with other beliefs about undecidable statements. So, Logical Induction does not straightforwardly factorize into a "belief-like" component (which converges on enough reflection) and a "value-like" component (which doesn't change on reflection). Thus:

Argument from Logical Induction: Logical Induction is a current best-in-class model of theoretical asymptotic bounded rationality. Logical Induction is non-Diagonal, but also clearly non-Orthogonal, and doesn't apparently factorize into separate Orthogonal and Diagonal components. Combined with considerations from "Argument from belief/value duality", this suggests that it's hard to identify all value-like components in advanced agents that are Orthogonal in the sense of not tending to change upon reflection.

One can imagine, for example, introducing extra function/predicate symbols into the logical theory the logical induction is over, to represent utility. Logical induction will tend to make judgments about these functions/predicates more consistent and inductively plausible over time, changing its judgments about the utilities of different outcomes towards plausible logical probabilities. This is an Oblique (non-Orthogonal and non-Diagonal) change in the interpretation of the utility symbol over time.

Likewise, Logical Induction can be specified to have beliefs over empirical facts such as observations by adding additional function/predicate symbols, and can perhaps update on these as they come in (although this might contradict UDT-type considerations). Through more iteration, Logical Inductors will come to have more approximately Bayesian, and inductively plausible, beliefs about these empirical facts, in an Oblique fashion.

Even if there is a way of factorizing out an Orthogonal value-like component from an agent, the belief-component (represented by something like Logical Induction) remains non-Diagonal, so there is still a potential "alignment problem" for these non-Diagonal components to match, say, human judgments in the limit. I don't see evidence that these non-Diagonal components factor into a value-like "prior over the undecidable" that does not change upon reflection. So, there remain components of something analogous to a "final goal" (by belief/value duality) that are Oblique, and within the scope of alignment.

If it were possible to get the properties of Logical Induction in a Bayesian system, which makes Bayesian updates on logical facts over time, that would make it more plausible that an Orthogonal logical prior could be specified ahead of time. However, MIRI researchers have tried for a while to find Bayesian interpretations of Logical Induction, and failed, as would be expected from the Argument from Bayes/VNM.

Naive belief/value factorizations lead to optimization daemons

The AI alignment field has a long history of poking holes in alignment approaches. Oops, you tried making an oracle AI and it manipulated real-world outcomes to make its predictions true. Oops, you tried to do Solomonoff induction and got invaded by aliens. Oops, you tried getting agents to optimize over a virtual physical universe, and they discovered the real world and tried to break out. Oops, you ran a Logical Inductor and one of the traders manipulated the probabilities to instantiate itself in the real world.

These sub-processes that take over are known as optimization daemons. When you get the agent architecture wrong, sometimes a sub-process (that runs a massive search over programs, such as with Solomonoff Induction) will luck upon a better agent architecture and out-compete the original system. (See also a very strange post I wrote some years back while thinking about this issue).

If you apply a naive belief/value factorization to create an AI architecture, when compute is scaled up sufficiently, optimization daemons tend to break out, showing that this factorization was insufficient. Enough experiences like this lead to the conclusion that, if there is a realistic belief/value factorization at all, it will look pretty different from the naive one. Thus:

Argument from optimization daemons: Naive ways of factorizing an agent into beliefs/values tend to lead to optimization daemons, which have different values from in the original factorization. Any successful belief/value factorization will probably look pretty different from the naive one, and might not take the form of factorization into Diagonal belief-like components and Orthogonal value-like components. Therefore, if any realistic formulation of Orthogonality exists, it will be hard to find and substantially different from naive notions of Orthogonality.

Intelligence changes the ontology values are expressed in

The most straightforward way to specify a utility function is to specify an ontology (a theory of what exists, similar to a database schema) and then provide a utility function over elements of this ontology. Prior to humans learning about physics, evolution (taken as a design algorithm for organisms involving mutation and selection) did not know all that human physicists know. Therefore, human evolutionary values are unlikely to be expressed in the ontology of physics as physicists currently believe in.

Human evolutionary values probably care about things like eating enough, social acceptance, proxies for reproduction, etc. It is unknown how these are specified, but perhaps sensory signals (such as stomach signals) are connected with a developing world model over time. Humans can experience vertigo at learning physics, e.g. thinking that free will and morality are fake, leading to unclear applications of native values to a realistic physical ontology. Physics has known gaps (such as quantum/relativity correspondence, and dark energy/dark matter) that suggest further ontology shifts.

One response to this vertigo is to try to solve the ontology identification problem; find a way of translating states in the new ontology (such as physics) to an old one (such as any kind of native human ontology), in a structure-preserving way, such that a utility function over the new ontology can be constructed as a composition of the original utility function and the new-to-old ontological mapping. Current solutions, such as those discussed in MIRI's Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I'm not convinced there is a satisfactory solution within the constraints presented. Thus:

Argument from ontological change: More intelligent agents tend to change their ontology to be more realistic. Utility functions are most naturally expressed relative to an ontology. Therefore, there is a correlation between an agent's intelligence and utility function, through the agent's ontology as an intermediate variable, contradicting Strong Orthogonality. There is no known solution for rescuing the old utility function in the new ontology, and some research intuitions pointing towards any solution being unsatisfactory in some way.

If a satisfactory solution is found, I'll change my mind on this argument, of course, but I'm not convinced such a satisfactory solution exists. To summarize: higher intelligence causes ontological changes, and rescuing old values seems to involve unnatural "warps" to make the new ontology correspond with the old one, contradicting at least Strong Orthogonality, and possibly Weak Orthogonality (if some values are simply incompatible with realistic ontology). Paperclips, for example, tend to appear most relevant at an intermediate intelligence level (around human-level), and become more ontologically unnatural at higher intelligence levels.

As a more general point, one expects possible mutual information between mental architecture and values, because values that "re-use" parts of the mental architecture achieve lower description length. For example, if the mental architecture involves creating universal algebra structures and finding analogies between them and the world, then values expressed in terms of such universal algebras will tend to have lower relative description complexity to the architecture. Such mutual information contradicts Strong Orthogonality, as some intelligence/value combinations are more natural than others.

Intelligence leads to recognizing value-relevant symmetries

Consider a number of un-intutitive value propositions people have argued for:

The point is not to argue for these, but to note that these arguments have been made and are relatively more accepted among people who have thought more about the relevant issues than people who haven't. Thinking tends to lead to noticing more symmetries and dependencies between value-relevant objects, and tends to adjust values to be more mathematically plausible and natural. Of course, extrapolating this to superintelligence leads to further symmetries. Thus:

Argument from value-relevant symmetries: More intelligent agents tend to realize more symmetries related to value-relevant entities. They will also tend to adjust their values according to symmetry considerations. This is an apparent value change, and it's hard to see how it can instead be factored as a Bayesian update on top of a constant value function.

I'll examine such factorizations in more detail shortly.

Human brains don't seem to neatly factorize

This is less about the Orthogonality Thesis generally, and more about human values. If there were separable "belief components" and "value components" in the human brain, with the value components remaining constant over time, that would increase the chance that at least some Orthogonal component can be identified in human brains, corresponding with "human values" (though, remember, the belief-like component can also be Oblique rather than Diagonal).

However, human brains seem much more messy than the sort of computer program that could factorize this way. Different brain regions are connected in at least some ways that are not well-understood. Additionally, even apparent "value components" may be analogous to something like a deep Q-learning function, which incorporates empirical updates in addition to pre-set "values".

The interaction between human brains and language is also relevant. Humans develop values they act on partly through language. And language (including language reporting values) is affected by empirical updates and reflection, thus non-Orthogonal. Reflecting on morality can easily change people's expressed and acted-upon values, e.g. in the case of Peter Singer. People can change which values they report as instrumental or terminal even while behaving similarly (e.g. flipping between selfishness-as-terminal and altruism-as-terminal), with the ambiguity hard to resolve because most behavior relates to convergent instrumental goals.

Maybe language is more of an effect than cause of values. But there really seems to be feedback from language to non-linguistic brain functions that decide actions and so on. Attributing coherent values over realistic physics to the brain parts that are non-linguistic seems like a form of projection or anthropomorphism. Language and thought have a function in cognition and attaining coherent values over realistic ontologies. Thus:

Argument from brain messiness: Human brains don't seem to neatly factorize into a belief-component and a value-component, with the value-component unaffected by reflection or language (which it would need to be Orthogonal). To the extent any value-component does not change due to language or reflection, it is restricted to evolutionary human ontology, which is unlikely to apply to realistic physics; language and reflection are part of the process that refines human values, rather than being an afterthought of them. Therefore, if the Orthogonality Thesis is true, humans lack identifiable values that fit into the values axis of the Orthogonality Thesis.

This doesn't rule out that Orthogonality could apply to superintelligences, of course, but it does raise questions for the project of aligning superintelligences with human values; perhaps such values do not exist or are not formulated so as to apply to the actual universe.

Models of ASI should start with realism

Some may take arguments against Orthogonality to be disturbing at a value level, perhaps because they are attached to research projects such as Friendly AI (or more specific approaches), and think questioning foundational assumptions would make the objective (such as alignment with already-existing human values) less clear. I believe "hold off on proposing solutions" applies here: better strategies are likely to come from first understanding what is likely to happen absent a strategy, then afterwards looking for available degrees of freedom.

Quoting Yudkowsky:

Orthogonality is meant as a descriptive statement about reality, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.

Likewise, Obliqueness does not imply that we shouldn't think about the future and ways of influencing it, that we should just give up on influencing the future because we're doomed anyway, that moral realist philosophers are correct or that their moral theories are predictive of ASI, that ASIs are necessarily morally good, and so on. The Friendly AI research program was formulated based on descriptive statements believed at the time, such as that an ASI singleton would eventually emerge, that the Orthogonality Thesis is basically true, and so on. Whatever cognitive process formulated this program would have formulated a different program conditional on different beliefs about likely ASI trajectories. Thus:

Meta-argument from realism: Paths towards beneficially achieving human values (or analogues, if "human values" don't exist) in the far future likely involve a lot of thinking about likely ASI trajectories absent intervention. The realistic paths towards human influence on the far future depend on realistic forecasting models for ASI, with Orthogonality/Diagonality/Obliqueness as alternative forecasts. Such forecasting models can be usefully thought about prior to formulation of a research program intended to influence the far future. Formulating and working from models of bounded rationality such as Logical Induction is likely to be more fruitful than assuming that bounded rationality will factorize into Orthogonal and Diagonal components without evidence in favor of this proposition. Forecasting also means paying more attention to the Strong Orthogonality Thesis than the Weak Orthogonality Thesis, as statistical correlations between intelligence and values will show up in such forecasts.

On Yudkowsky's arguments

Now that I've explained my own position, addressing Yudkowsky's main arguments may be useful. His main argument has to do with humans making paperclips instrumentally:

Suppose some strange alien came to Earth and credibly offered to pay us one million dollars' worth of new wealth every time we created a paperclip. We'd encounter no special intellectual difficulty in figuring out how to make lots of paperclips.

That is, minds would readily be able to reason about:

  • How many paperclips would result, if I pursued a policy ?
  • How can I search out a policy  that happens to have a high answer to the above question?

I believe it is better to think of the payment as coming in the far future and perhaps in another universe; that way, the belief about future payment is more analogous to terminal values than instrumental values. In this case, creating paperclips is a decent proxy for achievement of human value, so long-termist humans would tend to want lots of paperclips to be created.

I basically accept this, but, notably, Yudkowsky's argument is based on belief/value duality. He thinks it would be awkward for the reader to imagine terminally wanting paperclips, so he instead asks them to imagine a strange set of beliefs leading to paperclip production being oddly correlated with human value achievement. Thus, acceptance of Yudkowsky's premises here will tend to strengthen the Argument from belief/value duality and related arguments.

In particular, more intelligence would cause human-like agents to develop different beliefs about what actions aliens are likely to reward, and what numbers of paperclips different policies result in. This points towards Obliqueness as with Logical Induction: such beliefs will be revised (but not totally convergent) over time, leading to applying different strategies toward value achievement. And ontological issues around what counts as a paperclip will come up at some point, and likely be decided in a prior-dependent but also reflection-dependent way.

Beliefs about which aliens are most capable/honest likely depend on human priors, and are therefore Oblique: humans would want to program an aligned AI to mostly match these priors while revising beliefs along the way, but can't easily factor out their prior for the AI to share.

Now onto other arguments. The "Size of mind design space" argument implies many agents exist with different values from humans, which agrees with Obliqueness (intelligent agents tend to have different values from unintelligent ones). It's more of an argument about the possibility space than statistical correlation, thus being more about Weak than Strong Orthogonality.

The "Instrumental Convergence" argument doesn't appear to be an argument for Orthogonality per se; rather, it's a counter to arguments against Orthogonality based on noticing convergent instrumental goals. My arguments don't take this form.

Likewise, "Reflective Stability" is about a particular convergent instrumental goal (preventing value modification). In an Oblique framing, a Logical Inductor will tend not to change its beliefs about even un-decidable propositions too often (as this would lead to money-pumps), so consistency is valued all else being equal.

While I could go into more detail responding to Yudkowsky, I think space is better spent presenting my own Oblique views for now.

Conclusion

As an alternative to the Orthogonality Thesis and the Diagonality Thesis, I present the Obliqueness Thesis, which says that increasing intelligence tends to lead to value changes but not total value convergence. I have presented arguments that advanced agents and humans do not neatly factor into Orthogonal value-like components and Diagonal belief-like components, using Logical Induction as a model of bounded rationality. This implies complications to theories of AI alignment based on assuming humans have values and we need the AGI to agree about those values, while increasing their intelligence (and thus changing beliefs).

At a methodological level, I believe it is productive to start by forecasting default ASI using models of bounded rationality, especially known models such as Logical Induction, and further developing such models. I think this is more productive than assuming that these models will take the form of a belief/value factorization, although I have some uncertainty about whether such a factorization will be found.

If the Obliqueness Thesis is accepted, what possibility space results? One could think of this as steering a boat in a current of varying strength. Clearly, ignoring the current and just steering where you want to go is unproductive, as is just going along with the current and not trying to steer at all. Getting to where one wants to go consists in largely going with the current (if it's strong enough), charting a course that takes it into account.

Assuming Obliqueness, it's not viable to have large impacts on the far future without accepting some value changes that come from higher intelligence (and better epistemology in general). The Friendly AI research program already accepts that paths towards influencing the far future involve "going with the flow" regarding superintelligence, ontology changes, and convergent instrumental goals; Obliqueness says such flows go further than just these, being hard to cleanly separate from values.

Obliqueness obviously leaves open the question of just how oblique. It's hard to even formulate a quantitative question here. I'd very intuitively and roughly guess that intelligence and values are 3 degrees off (that is, almost diagonal), but it's unclear what question I am even guessing the answer to. I'll leave formulating and answering the question as an open problem.

I think Obliqueness is realistic, and that it's useful to start with realism when thinking of how to influence the far future. Maybe superintelligence necessitates significant changes away from current human values; the Litany of Tarski applies. But this post is more about the technical thesis than emotional processing of such, so I'll end here.



Discuss

How to choose what to work on

2024-09-19 04:39:12

Published on September 18, 2024 8:39 PM GMT

So you want to advance human progress. And you’re wondering, what should you, personally, do? Say you have talent, ambition, and drive—how do you choose a project or career?

There are a few frameworks for making this decision. Recently, though, I’ve started to see pitfalls with some of them, and I have a new variation to suggest.

Passion, competence, need

In Good to Great, Jim Collins says that great companies choose something to focus on at the intersection of:

 Jim Collins

This maps naturally onto an individual life/career, if we understand “drives your economic engine” to mean something there is a market need for, that you can make a living at.

You can understand this model by seeing the failure modes if you have only two out of three:

There is also a concept of ikigai that has four elements:

Forbes

This is pretty much the same thing, except breaking out the “economic engine” into two elements of “world needs it” and “you can get paid for it.” I prefer the simpler, three-element version.

I like this framework and have recommended it, but I now see a couple of ways you can mis-apply it:

Important, tractable, neglected

Another model I like comes from the effective altruist community: find things that are important, tractable, and neglected. Again, we can negate each one to see why all three are needed:

This framework was developed for cause prioritization in charitable giving, but it can also be naturally applied to choice of project or career.

Again, though, I think this framework can be mis-applied:

The other problem with applying this framework to yourself is that it’s impersonal. Maybe this is good for portfolio management (which, again, was the original context for it), but in choosing a career you need to find a personal fit—a fit with your talents and passions. (Even EAs recommend this.)

Ignore legibility, embrace intuition

One other way you can go wrong in applying any of these frameworks is if you have a sense that something is important, that you could be great at it, etc.—but you can’t fully articulate why, and can’t explain it in a convincing way to most other people. “On paper” it seems like a bad opportunity, yet you can’t shake the feeling that there’s gold in those hills.

The greatest opportunities often have this quality—in part because if they looked good on paper, someone would already have seized them. Don’t filter for legibility, or you will miss these chances.

My framework

If we discard the problematic elements from the frameworks above, I think we’re left with something like the following.

Pick something that:

Ideally, you are downright confused why no one is already doing what you want to do, because it seems so obvious to you—and (this is important) if that feeling persists or even grows the more you learn about the area.

This was how I ended up writing The Roots of Progress. I was obsessed with understanding progress, it seemed obviously one of the most important things in the world, and when I went to find a book on the topic, I couldn’t find anything written the way I wanted to read it, even though there is of course a vast literature on the topic. I ignored the fact that I have no credentials to do this kind of work, and that I had no plans to make a living from it. It has worked out pretty well.

This is also how I chose my last tech startup, Fieldbook, in 2013. I was obsessed with the idea of building a hybrid spreadsheet-database as a modern SaaS app, it seemed obviously valuable for many use cases, and nothing like it existed, even though there were some competitors that had been around for a while. Although Fieldbook failed as a startup, it was the right idea at the right time (as Airtable and Notion have proved).

So, trust your intuition and follow your obsession.



Discuss

Intention-to-Treat (Re: How harmful is music, really?)

2024-09-19 04:03:00

Published on September 18, 2024 6:44 PM GMT

I have long wanted to write about intention-to-treat because it's such a neat idea, and the recent article How harmful is music, really? spurred me to finally do it.


The reported results were

DayMean mood
Music0.29
No music0.22

Making some very rough assumptions about variation, this difference is maybe 1–2 standard errors away from zero, which on could be considered weak evidence that music improves mood.

Except!

There is one big problem with this approach to analysis. Although the experiment started off in a good direction with picking intended music days at random, it then suffered non-compliance, which means the actual days of music are no longer randomly selected. Rather, they are influenced by the environment – which might also influence mood in the same direction. This would strengthen the apparent relationship with no change in the effect of music itself.

The solution is to adopt an intention-to-treat approach to analysis.

Illustrating with synthetic data

I don’t have access to the data dkl9 used, but we can create synthetic data to simulate the experiment. For the sake of this article we’ll keep it as simple as possible; we make some reasonable assumptions and model mood as

This is a bit dense, but it says that our mood at any given time (gi) is affected by four things:

Here's an example of what an experiment might look like under this model. The wiggly line is mood, and the bars indicate whether or not we listen to music each day. (The upper bars indicate listening to music, the lower bars indicate no music.)

The reason we included the situation si as a separate term is that we want to add a correlation between whether we are listening to music and the situation we are in. This seems sensible – it could be things like

The model then simulates 25 % non-conformance, i.e. in roughly a quarter of the days we do not follow the random assignment of music. This level of non-conformance matches the reported result of 0.5 correlation between random music assignment and actual music listening.

When we continue to calibrate the model to produce results similar to those reported in the experiment, we get the following constants and coefficients:

The model then results in the following moods:

DayMean mood
Music0.29
No music0.20

We could spend time tweaking the model until it matches perfectly[2] but this is close enough for continued discussion.

The very alert reader will notice what happened already: we set , meaning music has no effect on mood at all in our model! Yet it produced results similar to those reported. This is confounding in action. Confounding is responsible for all of the observed effect in this model.

This is also robust in the face of variation. The model allows us to run the experiment many times, and even when we have configured music to have no effect, we get an apparent effect 99 % of the time.

With the naïve analysis we have used so far, the correlation between mood and music is 0.26, with a standard error of 0.10. This indeed appears to be some evidence that music boosts mood.

But it's wrong! We know it is wrong, because we set  in the model!

Switching to intention-to-treat analysis

There are two reasons for randomisation. The one we care about here is that it distributes confounders equally across both music days and non-music days.[3] Due to non-compliance, music listening days ended up not being randomly selected, but potentially confounded by other factors that may also affect mood.

Non-compliance is common, and there is a simple solution: instead of doing the analysis in terms of music listening days, do it in terms of planned music days. I.e. although the original randomisation didn't quite work out, still use it for analysis. This should be fine, because if music has an effect on mood, then at least a little of that effect will be visible through the random assignments, even though they didn't all work out. This is called intention-to-treat analysis.[4]

In this plot, the lighter bands indicate when we planned to listen to music, and the darker bands when we actually did so.

With very keen eyes, we can already see the great effect of confounding on mood. As a hint, look for where the bars indicate non-compliance, and you'll see how often that corresponds to big shifts in mood.

When looking at mood through the lens of when we planned to listen to music, there is no longer any meaningful difference.

DayMean mood
Music planned0.24
Silence planned0.23
  
Correlation0.03
Standard error0.03

Thus, when we do the analysis in terms of intention-to-treat, we see clearly that music has no discernible effect on mood. This is to be expected, because we set  after all, so there shouldn't be any effect.

The cost is lower statistical power

To explore the drawback of intention-to-treat analysis, we can adjust the model such that music has a fairly significant effect on mood. We will make music 4× as powerful as situation. 

This new model gives us roughly the same results as reported before when looking purely in terms of when music is playing:

DayMean mood
Music0.29
No music0.21

On the other hand, if we look at it through an intention-to-treat lens, we see there is now an effect (as we would expect), although too small to be trusted based on the data alone.

DayMean mood
Music planned0.26
Silence planned0.23
  
Correlation0.09
Standard error0.11

Remember that we constructed this version of the model to have a definitive effect of music, but because we are looking at it through an intention-to-treat analysis, it becomes harder to see. To bring it out, we would need to run the experiment not for 31 days, but for half a year!

Such is the cost of including confounders in one's data: they make experiments much more expensive by virtue of clouding the real relationships. Ignoring them does not make things better, it only risks producing mirages.

Brief summary of findings

To summarise, these are the situations we can find ourselves in:

Analysis typeSignificant effectNon-significant effect
NaïveActual or confounderActual
Intention-to-treatActualActual or confounder

In other words, by switching from a naïve analysis to an intention-to-treat analysis, we make confounders result in false negatives rather than false positives. This is usually preferred when sciencing.

  1. ^

    Actually, since the situation is based on days and there are six measurements per day, we might be able to infer this parameter from data also. But we will not.

  2. ^

    I know because we have something like 7 degrees of freedom for tweaking, and we only need to reproduce 5 numbers with them.

  3. ^

    The other purpose of randomisation is to make it possible to compute the probability of a result from the null hypothesis.

  4. ^

    This is from the medical field, because we randomise who we intend to treat, but then some subjects may elect to move to a different arm of the experiment and we can’t ethically force them to accept treatment.



Discuss

The case for a negative alignment tax

2024-09-19 02:33:18

Published on September 18, 2024 6:33 PM GMT

TL;DR:

Alignment researchers have historically predicted that building safe advanced AI would necessarily incur a significant alignment tax compared to an equally capable but unaligned counterfactual AI. 

We put forward a case here that this prediction looks increasingly unlikely given the current ‘state of the board,’ as well as some possibilities for updating alignment strategies accordingly.

Introduction

We recently found that over one hundred grant-funded alignment researchers generally disagree with statements like:

Notably, this sample also predicted that the distribution would be significantly more skewed in the ‘hostile-to-capabilities’ direction.

See ground truth vs. predicted distributions for these statements

These results—as well as recent events and related discussions—caused us to think more about our views on the relationship between capabilities and alignment work given the ‘current state of the board,’[1] which ultimately became the content of this post. Though we expect some to disagree with these takes, we have been pleasantly surprised by the positive feedback we’ve received from discussing these ideas in person and are excited to further stress-test them here.

Is a negative alignment tax plausible (or desirable)?

Often, capabilities and alignment are framed with reference to the alignment tax, defined as ‘the extra cost [practical, developmental, research, etc.] of ensuring that an AI system is aligned, relative to the cost of building an unaligned alternative.’ 

The AF/LW wiki entry on alignment taxes notably includes the following claim:

The best case scenario is No Tax: This means we lose no performance by aligning the system, so there is no reason to deploy an AI that is not aligned, i.e., we might as well align it. 

The worst case scenario is Max Tax: This means that we lose all performance by aligning the system, so alignment is functionally impossible.

We speculate in this post about a different best case scenario: a negative alignment tax—namely, a state of affairs where an AI system is actually rendered more competent/performant/capable by virtue of its alignment properties. 

The various predictions about the relationship between alignment and capabilities from the Max Tax, No Tax, and Negative Tax models. Note that we do not expect the Negative Tax curve to be strictly monotonic. 

Why would this be even better than 'No Tax?' Given the clear existence of a trillion dollar attractor state towards ever-more-powerful AI, we suspect that the most pragmatic and desirable outcome would involve humanity finding a path forward that both (1) eventually satisfies the constraints of this attractor (i.e., is in fact highly capable, gets us AGI, etc.) and (2) does not pose existential risk to humanity. 

Ignoring the inevitability of (1) seems practically unrealistic as an action plan at this point—and ignoring (2) could be collectively suicidal.

If P(Very capable) is high in the near future, then ensuring humanity lands in the 'Aligned x Very capable' quadrant seems like it must be the priority—and therefore that alignment proposals that actively increase the probability of humanity ending up in that quadrant are preferred.

Therefore, if the safety properties of such a system were also explicitly contributing to what is rendering it capable—and therefore functionally causes us to navigate away from possible futures where we build systems that are capable but unsafe—then these 'negative alignment tax' properties seem more like a feature than a bug. 

It is also worth noting here as an empirical datapoint here that virtually all frontier models’ alignment properties have rendered them more rather than less capable (e.g., gpt-4 is far more useful and far more aligned than gpt-4-base), which is the opposite of what the ‘alignment tax’ model would have predicted.

This idea is somewhat reminiscent of differential technological development, in which Bostrom suggests “[slowing] the development of dangerous and harmful technologies, especially ones that raise the level of existential risk; and accelerating the development of beneficial technologies, especially those that reduce the existential risks posed by nature or by other technologies.” If alignment techniques were developed that could positively ‘accelerate the development of beneficial technologies’ rather than act as a functional ‘tax’ on them, we think that this would be a good thing on balance. 

Of course, we certainly still do not think it is wise to plow ahead with capabilities work given the current practical absence of robust ‘negative alignment tax’ techniques—and that safetywashing capabilities gains without any true alignment benefit is a real and important ongoing concern.

However, we do think if such alignment techniques were discovered—techniques that simultaneously steered models away from dangerous behavior while also rendering them more generally capable in the process—this would probably be preferable in the status quo to alignment techniques that steered models away from dangerous behavior with no effect on capabilities (ie, techniques with no alignment tax) given the fairly-obviously-inescapable strength of the more-capable-AI attractor state. 

In the limit (what might be considered the ‘best imaginable case’), we might imagine researchers discovering an alignment technique that (A) was guaranteed to eliminate x-risk and (B) improve capabilities so clearly that they become competitively necessary for anyone attempting to build AGI. 

Early examples of negative alignment taxes

We want to emphasize that the examples we provide here are almost certainly not the best possible examples of negative alignment taxes—but at least provide some basic proof of concept that there already exist alignment properties that can actually bolster capabilities—if only weakly compared to the (potentially ideal) limit case. 

The elephant in the lab: RLHF

RLHF is clearly not a perfect alignment technique, and it probably won’t scaleHowever, the fact that it both (1) clearly renders frontier[2] models’ outputs less toxic and dangerous and (2) has also been widely adopted given the substantial associated improvements in task performance and naturalistic conversation ability seems to us like a clear (albeit nascent) example of a ‘negative alignment tax’ in action. 

It also serves as a practical example of point (B) above: the key labs pushing the envelope on capabilities have all embraced RLHF to some degree, likely not out of a heartfelt concern for AI x-risk, but rather because doing so is actively competitively necessary in the status quo.

Consider the following from Anthropic’s 2022 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Our alignment interventions actually enhance the capabilities of large models, and can easily be combined with training for specialized skills (such as coding or summarization) without any degradation in alignment or performance. Models with less than about 10B parameters behave differently, paying an ‘alignment tax’ on their capabilities. This provides an example where models near the state-of-the-art may have been necessary to derive the right lessons from alignment research. 

The overall picture we seem to find – that large models can learn a wide variety of skills, including alignment, in a mutually compatible way – does not seem very surprising. Behaving in an aligned fashion is just another capability, and many works have shown that larger models are more capable [Kaplan et al., 2020, Rosenfeld et al., 2019, Brown et al., 2020], finetune with greater sample efficiency [Henighan et al., 2020, Askell et al., 2021], and do not suffer significantly from forgetting [Ramasesh et al., 2022].

Cooperative/prosocial AI systems

We suspect there may be core prosocial algorithms (already running in human brains[3]) that, if implemented into AI systems in the right ways, would also exhibit a negative alignment tax. To the degree that humans actively prefer to interface with AI that they can trust and cooperate with, embedding prosocial algorithms into AI could confer both new capabilities and favorable alignment properties. 

The operative cluster of examples we are personally most excited about—things like attention schema theorytheory of mindempathy, and self-other overlap—all basically relate to figuring out how to robustly integrate in an agent’s utility function(s) the utility of other relevant agents. If the right subset of these algorithms could be successfully integrated into an agentic AI system—and cause it to effectively and automatically predict and reason about the effects of its decisions on other agents—we would expect that, by default, it would not want to kill everyone,[4] and that this might even scale to superintelligence given orthogonality

In a world where the negative alignment tax model is correct, prosocial algorithms could also potentially avoid value lock-in by enabling models to continue reasoning about and updating their own values in the 'right' direction long after humans are capable of evaluating this ourselves. Given that leading models' capabilities now seem to scale almost linearly with compute along two dimensions—not only during training but also during inference—getting this right may be fairly urgent.

There are some indications that LLMs already exhibit theory-of-mind-like abilities, albeit more implicitly than what we are imagining here. We suspect that discovering architectures that implement these sorts of prosocial algorithms in the right ways would represent both a capabilities gain and tangible alignment progress.

As an aside from our main argument here, we currently feel more excited about systems whose core functionality is inherently aligned/alignable (reasonable examples include: prosocial AIsafeguarded AIagent foundations) as compared to corrigibility-style approaches that seemingly aim to optimize more for oversight, intervention, and control (as a proxy for alignment) rather than for ‘alignedness’ directly. In the face of sharp left turns or inevitable jailbreaking,[5] it is plausible that the safest long-term solution might look something like an AI whose architecture explicitly and inextricably encodes acting in light of the utility of other agents, rather than merely ‘bolting on’ security or containment measures to an ambiguously-aligned system. 

The question isn't will your security be breached? but when? and how bad will it be?

-Bruce Schneier

Process supervision and other LLM-based interventions

OpenAI’s recent release of the o1 model series serves as the strongest evidence to date that rewarding each step of an LLM’s chain of thought rather than only the final outcome improves both capabilities and alignment in multi-step problems, notably including 4x improved safety performance on the challenging [email protected] StrongREJECT jailbreak eval. This same finding was also reported in earlier, more constrained task settings.

We suspect there are other similar interventions for LLMs that would constitute good news for both alignment and capabilities (e.g. Paul Christiano’s take that LLM agents might be net-good for alignment; the finding that more persuasive LLM debaters enables non-expert models to identify truth better; and so on).

Concluding thoughts

Ultimately, that a significant majority of the alignment researchers we surveyed don’t think capabilities and alignment are mutually exclusive indicates to us that the nature of the relationship between these two domains is itself a neglected area of research and discussion. 

While there are certainly good reasons to be concerned about any capabilities improvements whatsoever in the name of safety, we think there are also good reasons to be concerned that capabilities taboos in the name of safety may backfire in actually navigating towards a future in which AGI is aligned.

While we argue for the possibility of a negative alignment tax, it's important to note that this doesn't eliminate all tradeoffs between performance and alignment. Even in systems benefiting from alignment-driven capabilities improvements, there may still be decisions that pit marginal gains in performance against marginal gains in alignment (see 'Visualizing a Negative Alignment Tax' plot above). 

However, what we're proposing is that certain alignment techniques can shift the entire tradeoff curve, resulting in systems that are both more capable and more aligned than their unaligned counterparts. This view implies that rather than viewing alignment as a pure cost to be minimized, we should seek out techniques that fundamentally improve the baseline performance-alignment tradeoff.

While the notion of a negative alignment tax is fairly speculative and optimistic, we think the theoretical case for it being a desirable and pragmatic outcome is straightforward given the current state of the board.[6] Whether we like it or not, humanity’s current-and-seemingly-highly-stable incentive structures have us hurtling towards ever-more-capable AI without any corresponding guarantees regarding safety. We think that an underrated general strategy for contending with this reality is for researchers—and alignment startup founders—to further explore neglected alignment approaches with negative alignment taxes.

  1. ^

    Monte Carlo Tree Search is a surprisingly powerful decision-making algorithm that teaches us an important lesson about the relationship between plans and the current state of the board: recompute often. Just as MCTS builds a tree of possibilities, simulating many 'playouts' from the current state to estimate the value of each move, and then focuses its search on the most promising branches before selecting a move and effectively starting anew from the resulting state, so too might we adopt a similar attitude for alignment research. It does not make much sense to attempt to navigate towards aligned AGI by leaning heavily on conceptual frameworks ('past tree expansions') generated in an earlier, increasingly unrecognizable board state—namely, the exponential increase in resources and attention being deployed towards advancing AI capabilities and the obvious advances that have already accompanied this investment. Alignment plans that do not meaningfully contend with this reality might be considered 'outdated branches' in the sense described above.

  2. ^

    In general, there is some evidence that RLHF seems to work better on larger and more sophisticated models, though it is unclear to what extent this trend can be extrapolated.

  3. ^

    “I've been surprised, in the past, by how many people vehemently resist the idea that they might not actually be selfish, deep down. I've seen some people do some incredible contortions in attempts to convince themselves that their ability to care about others is actually completely selfish. (Because iterated game theory says that if you're in a repeated game it pays to be nice, you see!) These people seem to resist the idea that they could have selfless values on general principles, and consistently struggle to come up with selfish explanations for their altruistic behavior.” - Nate Soares, Replacing Guilt.

  4. ^

     To the same degree, at least, that we would not expect other people—i.e., intelligent entities with brains running core prosocial algorithms (mostly)—to want to kill everyone.

  5. ^

     …or overdependence on the seemingly-innocuous software of questionably competent actors.

  6. ^

    If the current state of the board changes, this may also change. It is important to be sensitive to how negative alignment taxes may become more or less feasible/generalizable over time.



Discuss

Endogenous Growth and Human Intelligence

2024-09-18 22:05:55

Published on September 18, 2024 2:05 PM GMT

Hi everyone! I’ve written an article I’m rather happy with on the history of endogenous growth models, and on the influence of intelligence on country level outcomes. As it is quite long, I will excerpt only a part — I sincerely hope you read the whole thing.

https://nicholasdecker.substack.com/p/endogenous-growth-and-human-intelligence

——————————————————

ii. The History of Macroeconomic Growth Models

Macro growth models start in earnest with Solow, who connected capital accumulation to growth. Capital is taken to have diminishing marginal returns, in contrast to the cruder Harrod-Domar model. There exists a rate of savings which maximizes long run consumption, and given a particular technology set consumption will reach a constant level. (This rate of savings is called the Golden Rule level of savings, after Phelps). We assume perfect competition in production. (Monopoly distortions can be subtracted from the steady state level of consumption). Initial conditions have no effect on the long run rate, which is the same for all places and much lower than our present living standards. It is therefore necessary to invoke technological change, which is taken to be growing at an exogenously determined rate. As Arrow wrote, “From a quantitative, empirical point of view, we are left with time as an explanatory variable. Now trend projections … are basically a confession of ignorance, and what is worse from a practical viewpoint, are not policy variables”

The formulas are simple and clean, and you can make meaningful predictions about growth rates. Still, this clearly does not very well describe the world. There are large differences in per capita income across the globe. If there are diminishing marginal returns to capital, and that is all that matters, then capital should be flowing from developed countries to developing countries. It isn’t. In fact, more skilled people (who can be thought of as possessing a kind of capital, human capital) immigrate to more skilled countries! (Lucas 1988). Even if there are bars to capital flowing between countries, no such barriers between southern and northern states in the US. Barro and Sala-i-Martin (1992) found that, with reasonable parameters, the return to capital should have been five times higher in the South in the 1880s. Yet, most capital investment took place in New England states. 

The bigger problem is that it predicts that growth rates should be declining over time. They are not. If anything, they are increasing over time. Even if the growth rate is constant and positive, that implies that the absolute value of growth is increasing over time. Appending human capital to the model can allow you to estimate the contribution of skills, in contrast to just tools and resources, but it is just a subset of capital and won’t lead to unbounded growth.


 Enter Romer.


 



Discuss

Inquisitive vs. adversarial rationality

2024-09-19 02:24:17

Published on September 18, 2024 1:50 PM GMT

Epistemic status: prima facie unlikely in the usual framework, which I'll try to reframe. Corroborated by loads of empirical observations. YMMV, but if you've held some contrarian view in the past that you came to realize was wrong, this might resonate.

In practical (and also not-so-practical) life, we often have to make a call as to which theory or fact of matter is probably true. In one particularly popular definition of rationality, being rational is making the right call, as often as possible. If you can make the map correspond to the territory, you should.

I believe that in many cases, the best way to do so is not to adopt what I will call inquisitive thinking, in which you, potentially after researching somewhat deeply on the topic, will go on and try to come up with your own arguments to support one side or the other. Rather, I think you should most often adopt adversarial thinking, in which you'll simply judge which side of the debate is probably right on the basis of the existing arguments, without trying to come up with new arguments yourself.

You might feel the adjectives "inquisitive" and "adversarial" are being used wierdly here, but I'm taking them from the legal literature. An inquisitive (aka inquisitorial) legal system is one in which the judge acts as both judge and prosecutor, personally digging into the facts before ruling. An adversarial system, on the other hand, is one in which judges are mostly passive observers, and parties are to argue their case before them without much (or any) interference, for them at the end to rule on the basis of the evidence presented, not being allowed to go dig more evidence themselves.

There is a reason why most legal systems in use today have evolved from (mostly or all) inquisitive to (mostly or all) adversarial, and that's because we have a gigantic body of evidence to suggest that inquisitive systems are particularly prone to render biased judgements. The more you allow judges to go dig, the more likely they are to lose their purported impartiality and start doing strange things.

I suggest that this phenomenon is not particular to judges, but is rather a common feature of human (and very possibly even non-human) rationality. The main point is that digging more and more evidence yourself is ultimately not selecting for truth, but rather for your particular biases. If you have a limited amount of pre-selected evidence to analyze – evidence selected by other people –, it's unlikely to be tailored to your particular taste, and you're thus more likely to weigh it impartially. On the other hand, once you allow yourself to go dig evidence for your own taste, you're much more likely to select evidence that is flawed in ways that match your own biases.

As an intuition pump, that's really much the same as an AI trained to identify pictures of cats that will, on request for the prototypical cat, generate something that looks like noise. Such an AI is not useless, mind you – it's actually often pretty accurate in telling pre-selected images of cats and non-cats apart. But you may want to use it in a way that does not involve asking it to go dig the best cat picture out there in the space of possible pictures. Perhaps our brains are not so different, after all.



Discuss

Pronouns are Annoying

2024-09-18 21:30:05

Published on September 18, 2024 1:30 PM GMT

This post isn’t totally about the culture war topic du jour. Not at first.

As with any other topic that soaks up angst like an ultra-absorbent sponge, I wonder how many have lost track of how we arrived here. Why are pronouns? Pronouns have always been meant to serve as a shortcut substitute reference for other nouns, and the efficiency they provide is starkly demonstrated through their boycott:

Abdulrahmanmustafa went to the store because Abdulrahmanmustafa wanted to buy groceries for Abdulrahmanmustafa’s dinner. When Abdulrahmanmustafa arrived, Abdulrahmanmustafa realized that Abdulrahmanmustafa had forgotten Abdulrahmanmustafa’s wallet, so Abdulrahmanmustafa had to return to Abdulrahmanmustafa’s house to get Abdulrahmanmustafa’s wallet.

So that’s definitely a mouthful, and using he/his in place of Abdulrahmanmustafa helps lubricate. Again, pronouns are nothing more than a shortcut referent. Zoom out a bit and consider all the other communication shortcuts we regularly use. We could say National Aeronautics and Space Administration, or we can take the first letter of each word and just concatenate it into NASA instead. We could append ‘dollars’ after a number, or we could just use $ instead.

The tradeoff with all of these shortcuts is precision. Depending on the context, NASA, for example, might also refer to the National Association of Students of Architecture in India, or some mountain in Sweden. Dollar signs typically refer to American dollars, but they’re also used to denote several other currency denominations. The same risk applies to pronouns. It’s not a problem when we’re dealing with only one subject, but notice what happens when we introduce another dude to the pile:

John told Mark that he should administer the medication immediately because he was in critical condition, but he refused.

Wait, who is in critical condition? Which one refused? Who’s supposed to be administering the meds? And administer to whom? Impossible to answer without additional context.

One way to deal with ambiguous referents is to just increase the number of possible referents. Abbreviations could have a higher level of fidelity if they took the first two letters of every word instead of just one, then no one would risk confusing NaAeSpAd with NaAsStAr. For full fidelity, abbreviations should use every letter of every word but then…obviously there’s an inherent tension between efficiency and accuracy with using any communication shortcut.

Same thing for pronouns. You need just enough of them to distinguish subjects, but not so much that they lose their intuitive meaning. When cops are interviewing witnesses about a suspect, they’ll glom onto easily observable and distinguishing physical traits. Was the suspect a man or a woman? White or black? Tall or short? Etc. Personal pronouns follow a similar template by cleaving ambiguity along well-understood axes, breaking down the population of potential subjects into distinct, intuitive segments. Pronouns can distinguish singular versus plural (I & we), between the cool and the uncool (me & you), and of course masculine versus feminine (he & she).

Much like double-checking a count to reduce the risk of error, pronouns carve language into rough divisions. The classic he/she cleave splits half the population in one step, significantly reducing the risk of confusion. Consider the repurposed example:

John told Maria that she should administer the medication immediately because he was in critical condition, but she refused.

A pronoun repertoire cannot eliminate all ambiguity, but ideally it narrows it enough for any remaining uncertainty to be manageable. The key lies in finding the balance: too few pronouns, and communication becomes vague and cumbersome; too many, and it gets over-complicated. It depends on the circumstances. There are scenarios where the ambiguity is never worth the efficiency gain, like in legal contracts. A properly written legal contract will never use pronouns, because no one wants to risk a protracted legal battle in the future over which he was responsible for insuring the widget shipments, just to save a few typing keystrokes.

I’m sorry if I come off as a patronizing kindergarten teacher for the above. Before jumping into any rumble arenas, I think it’s vital to emphatically establish the reason pronouns exist is for linguistic efficiency. If your pronoun use is not advancing that cause, it might be helpful to explain what it’s for.


So, onto the red meat. I’m not a singular they Truther; it definitely exists and, contrary to some consternations, its utilization is already ubiquitous and intuitive (e.g. “If anyone calls, make sure they leave a message.”). But there’s no denying that expanding the They franchise will necessarily increase ambiguity by slurring two well-worn axes of distinction (he/she & singular/plural). By no means would this be the end of the world, but it will require some compensating efforts in other areas to maintain clarity, perhaps by relying more on proper nouns and less on pronouns.

Consistent with my aversion of ambiguity, I’ve deliberately avoided using the g-word. I recognize some people have a strident attachment to the specific gender of the pronoun others use to refer to them (and yes, using a semi-ambiguous them in this sentence is intentional and thematically fitting, but you get it).

The most charitable framework I can posit on this issue is that gendered pronouns are an aesthetic designator, and either are, or should be, untethered from any biological anchor. So while she might conjure up female, its usage is not making any affirmative declarations about the pronoun subject’s ability to conceive and carry a pregnancy. This is uncontroversially true, such as when gendered pronouns are applied to inanimate objects. No one saying “she looks beautiful” about a sports car, is talking about vehicular gender archetypes, or about sexual reproduction roles — unless they’re somehow convinced the car improves their own odds in that department.

The problem, of course, is that my framework does not explain the handwringing. Anyone who harbors such an intense attachment to specific gendered pronoun preferences clearly sees it as much more than a superficial aesthetic designator. If their insistence is driven by the desire to be validated as embodying that specific gender then it’s not a gambit that will work, for the same reasons it does not work for the sports car.

On my end, I’m just going to carry on and use whatever pronouns, but only so long as their efficiency/clarity trade-off remains worth it. As inherently intended.



Discuss

Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety

2024-09-18 21:07:40

Published on September 18, 2024 1:07 PM GMT



Discuss

Skills from a year of Purposeful Rationality Practice

2024-09-18 10:05:58

Published on September 18, 2024 2:05 AM GMT

A year ago, I started trying to deliberate practice skills that would "help people figure out the answers to confusing, important questions." I experimented with Thinking Physics questions, GPQA questions, Puzzle Games , Strategy Games, and a stupid twitchy reflex game I had struggled to beat for 8 years[1]. Then I went back to my day job and tried figuring stuff out there too.

The most important skill I was trying to learn was Metastrategic Brainstorming – the skill of looking at a confusing, hopeless situation, and nonetheless brainstorming useful ways to get traction or avoid wasted motion. 

Normally, when you want to get good at something, it's great to stand on the shoulders of giants and copy all the existing techniques. But this is challenging if you're trying to solve important, confusing problems because there probably isn't (much) established wisdom on how to solve it. You may need to discover techniques that haven't been invented yet, or synthesize multiple approaches that haven't previously been combined. At the very least, you may need to find an existing technique buried in the internet somewhere, which hasn't been linked to your problem with easy-to-search keywords, without anyone to help you.

In the process of doing this, I found a few skills that came up over and over again.

I didn't invent the following skills, but I feel like I "won" them in some sense via a painstaking "throw myself into the deep end" method. I feel slightly wary of publishing them in a list here, because I think it was useful to me to have to figure out for myself that they were the right tool for the job. And they seem like kinda useful "entry level" techniques, that you're more likely to successfully discover for yourself.

But, I think this is hard enough, and forcing people to discover everything for themselves seems unlikely to be worth it.

The skills that seemed most general, in both practice and on my day job, are:

  1. Taking breaks/naps
  2. Working Memory facility
  3. Patience
  4. Knowing what confusion/deconfusion feels like
  5. Actually Fucking Backchain
  6. Asking "what is my goal?"
  7. Having multiple plans

There were other skills I already was tracking, like Noticing, or Focusing. There were also somewhat more classic "How to Solve It" style tools for breaking down problems. There are also a host of skills I need when translating this all into my day-job, like "setting reminders for myself" and "negotiating with coworkers."

But the skills listed above feel like they stood out in some way as particularly general, and particularly relevant for "solve confusing problems."

Taking breaks, or naps

Difficult intellectual labor is exhausting. During the two weeks I was working on solving Thinking Physics problems, I worked for like 5 hours a day and then was completely fucked up in the evenings. Other researchers I've talked to report similar things. 

During my workshops, one of the most useful things I recommended people was "actually go take a nap. If you don't think you can take a real nap because you can't sleep, go into a pitch black room and lie down for awhile, and the worst case scenario is your brain will mull over the problem in a somewhat more spacious/relaxed way for awhile."

Practical tips: Get yourself a sleeping mask, noise machine (I prefer a fan or air purifier), and access to a nearby space where you can rest. Leave your devices outside the room. 

Working Memory facility

Often a topic feels overwhelming. This is often because it's just too complicated to grasp with your raw working memory. But, there are various tools (paper, spreadsheets, larger monitors, etc) that can improve this. And, you can develop the skill of noticing "okay this isn't fitting in my head, or even on my big monitor – what would let it fit in my head?".

The "eye opening" example of this for me was trying to solve a physics problem that included 3 dimensions (but one of the dimensions was "time"). I tried drawing it out but grasping the time-progression was still hard. I came up with the idea of using semi-translucent paper, where I would draw a diagram of what each step looked like on separate pages, and then I could see where different elements were pointed.

I've also found "spreadsheet literacy" a recurring skill – google sheets is very versatile but you have to know what all the functions are, have a knack for arranging elements in an easy-to-parse way, etc.

Practical Tips: Have lots of kinds of paper, whiteboards and writing supplies around. 

On google sheets:

Patience

If I'm doing something confusingly hard, there are times when it feels painful to sit with it, and I'm itchy to pick some solution and get moving. This comes up in two major areas:

There is of course a corresponding virtue of "just get moving, build up momentum and start learning through iteration." The wisdom to tell the difference between "I'm still confused and need to orient more" and "I need to get moving" is important. But, an important skill there is at least being capable of sitting with impatient discomfort, in the situations where that's the right call.

Practical tips: I dunno I still kinda suck at this one, but I find taking deep breaths, and deliberately reminding myself "Slow is smooth, smooth is fast".

Know what deconfusion, or "having a crisp understanding" feels like

A skill from both Thinking Physics and Baba is You. 

When I first started Thinking Physics, I would get to a point where "I dunno, I feel pretty sure, and I can't think of more things to do to resolve my confusion", and then impatiently roll the dice on checking the answer. Sometimes I'd be right, more often I'd be wrong.

Eventually I had a breakthrough where I came up with a crisp model of the problem, and was like "oh, man, now it would actually be really surprising if any of the other answers were true." From then on... well, I'd still sometimes got things wrong (mostly due to impatience). But, I could tell when I still had pieces of my model that were vague and unprincipled.

Similarly in Baba is You: when people don't have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable. 

Learning to notice this difference in qualia is quite valuable.

Practical tips: This is where Noticing and Focusing are key, but are worthwhile for helping you notice subtle differences in how an idea feels in your mind. 

Try either making explicit numerical predictions about whether you've solved an exercise before you look up the answer; or, write down a qualitative sentence like "I feel like I really deeply understand the answer" or "this seems probably right but I feel some niggling doubts."

Actually Fucking Backchain

From Baba is You, I got the fear-of-god put in me seeing how easy it was to spin my wheels, tinkering around with stuff that was nearby/accessible/easy-to-iterate-with, and how that often turned out to not be at all relevant to beating a level. 

I had much less wasted motion when I thought through "What would the final stages of beating this level need to look like? What are the stages just before those?", and focusing my attention on things that could help me get to that point.

One might say "well, Baba is You is a game optimized for being counterintuitive and weird." I think for many people with a goal like "build a successful startup", it can sometimes be fine to just be forward chaining with stuff that feels promising, rather than trying to backchain from complex goals.

But, when I eyeball the realworld problems I'm contending with (i.e. x-risk) they really do seem like there's a relatively narrow set of victory conditions that plausibly work. And, many of the projects I feel tempted to start don't actually really seem that relevant.

(I also think great startup founders are often doing a mix of forward and backward chaining. i.e. I bet Jeff Bezos was like "okay I bet I could make an online bookstore that worked", was also thinking "but, what if I ultimately wanted the Everything Store? What are obstacles that I'd eventually need to deal")

Practical tips: First, come up with at least one concrete story of what the world would look like, if you succeeded at your goals. Try hard to come up with 2 other worlds, so you aren't too anchored on your first idea. 

Then, try to concretely imagine the steps that would come a little bit earlier in the chain from the end.

Don't worry about mapping out all the different possible branches of the future (that's impossible). But, for a complex plan, have at least one end-to-end plan that connects all the dots from the resources you have now to the victory condition at the end.

Meanwhile, while doing most of your work, notice when it starts to feel like you've lost the plot (try just making a little tally-mark whenever you notice yourself rabbitholing in a way that feels off). And ask "what is my goal? is what I'm currently doing helping"

Ask "What's My Goal?"

Actually, having just written the previous section, I'm recalling a simpler, more commonly useful skill, which is simply to ask "what is my goal?". 

Often, doing this throws into relief that you're not sure what your goal is. Sometimes, asking the question immediately prompts me to notice a key insight I'd been glossing over.

If you're not sure what your goal is, try babbling some things that seem like they might be a goal, and then ask yourself "does this feel like what I'm most trying to achieve right now?"

It's okay if it turns out your goal is different or more embarrassing-sounding than you thought. You might say "Actually, you know what? I do care more about showing off and sounding smart, than actually learning something right now." (But, you might also realize "okay, I separately care about learning something and sounding smart", and then be more intentional about finding a tactic that accomplishes both)

Once you remember (or figure out) your goal, as you brainstorm strategies, ask yourself "would I be surprised if this didn't help me achieve my goals?", and then prioritize strategies that you viscerally expect to work.

Always[2] try to have 3 hypotheses

This one is important enough to be it's own post. (I guess, probably most of these are important enough to be a full post? But, this one especially)

But, listing here for completeness: 

Whether you are solving a puzzle, or figuring out how to solve a puzzle, or deciding what your team should do next week, try to have multiple hypotheses. (I usually say "try to have at least 3 plans", but a plan is basically a special case – a hypothesis about "doing X is the best way to achieve goal Y"). 

They each need to be a hypothesis you actually believe in.

I say "at least 3", because I think it gets you "fully intellectually agile." If you only have one plan, it's easy to get tunnel vision on it and not notice that it's doomed. Two ideas helps free up your mind, but then you might still evaluate all evidence in terms of "does this support idea 1 or idea 2?". If you have 3 different hypotheses, it's much more natural to keep generating more hypotheses, and to pivot around in a multiple dimensional space of possibility.

 

  1. ^

    This wasn't practice for "solving confusing problems", but it was practice for "accomplish anything at all through purposeful practice." It took 40 hours despite me being IMO very fucking clever about it.

  2. ^

    Okay not literally always, but, whenever you're about to spend a large chunk of timing on a project or figuring something out.



Discuss

Where to find reliable reviews of AI products?

2024-09-18 07:48:27

Published on September 17, 2024 11:48 PM GMT

Being able to quickly incorporate AI tools seems important, including for working on AI risk (people who disagree: there's a thread for doing so in the comments). But there are a lot of AI products and most of them suck.  Does anyone know a good source of reviews, or even just listing product features and naming obvious slop?



Discuss