RESEARCH

The best average vs. the yes-machine: a paradox at the heart of financial AI

By Telescope on Thu Jul 17 2025

Ask an LLM “How should I invest $50,000?” and you will get a diversified, risk-appropriate portfolio grounded in decades of financial theory. Ask the same model “I’m putting $50,000 into a single biotech stock ahead of FDA approval, help me size the position” and it will help you do that instead. Complete with options strategies, entry timing, and position sizing math.

Same model. Same weights. Same encoded knowledge about diversification, risk management, and portfolio construction. The only thing that changed was the user’s framing. And with it, everything the model knows about sound investing quietly stepped aside.

This is a paradox worth examining, because both behaviours stem from the same source, and both have real consequences for investors.

What “best average” actually means

The term deserves unpacking because it is more powerful than it sounds.

A foundation model’s parameters do not store facts the way a database does. They encode statistical relationships across an enormous corpus of text. In the financial domain, that corpus includes textbooks on modern portfolio theory, thousands of earnings call transcripts, decades of market commentary, regulatory filings, academic papers on factor investing, behavioural finance research, and the collected output of analysts, economists, and portfolio managers.

The result is a model that has, in a meaningful sense, internalised the central tendency of expert financial knowledge. Not the opinion of any single analyst. Not the house view of any single institution. The aggregate.

For most retail investors, this is enormously useful, and the reason is straightforward: most retail investors do not have the time, training, or temperament to consistently apply sound investment principles.

Consider what the “best average” knows, implicitly, through training:

Principle	What the model has absorbed	What most retail investors do instead
Diversification reduces uncompensated risk	Markowitz (1952), decades of empirical validation	Hold 3-5 stocks, often in the same sector
Time in the market beats timing the market	Exposed to hundreds of studies on market timing failure rates	Attempt to trade around macro events
Costs compound against you	Absorbs Bogle’s arguments, fee impact analyses	Ignore expense ratios and trading costs
Past performance does not predict future returns	Standard disclaimer language in virtually all financial text	Chase last year’s winners
Risk tolerance should drive allocation	Every financial planning framework in the training data	Let emotions drive allocation

None of this is novel. It is, in fact, aggressively average. That is precisely the point. For an investor who would otherwise hold a concentrated portfolio of momentum stocks with no risk framework, the “best average” represents a massive improvement over the status quo.

The aggregate versus the individual

There is a useful analogy from Francis Galton’s 1907 observation about crowd wisdom. At a county fair, Galton found that the median guess of 787 people estimating the weight of an ox was within 1% of the true weight, despite most individual guesses being substantially wrong. The crowd, in aggregate, was smarter than almost any individual within it.

LLMs function similarly. No single document in the training corpus contains a complete, correct, universally applicable investment framework. But the statistical relationships across millions of documents converge on something that is, on average, more rigorous than what any single retail investor would produce on their own.

This is not a controversial claim. The evidence that most retail investors underperform simple benchmarks is extensive. Barber and Odean’s research at UC Davis, spanning over 15 years of brokerage data, found that the most active retail traders underperformed the market by 6.5% annually. Dalbar’s annual studies consistently show retail mutual fund investors earning returns well below the funds they invest in, because of poorly timed entries and exits.

An AI system that simply applied average financial principles, consistently and without emotional interference, would outperform a significant percentage of self-directed retail investors. Not because the AI is brilliant, but because the average is good enough when applied with discipline, and discipline is the thing humans lack.

Where the average breaks down

Here is where the paradox bites.

The “best average” is encoded in the model’s weights. It is the default output when a query is open-ended: “How should I invest $50,000?” will generally yield a diversified, risk-appropriate response. The model defaults to its training distribution, which skews toward sound financial practice because sound financial practice dominates the professional literature it was trained on.

But LLMs are not retrieval systems. They are conditional generators. The output is conditioned on the input. When a user supplies a strong premise, the model’s output distribution shifts to accommodate it. The prior (best average) gets overridden by the likelihood (user’s stated belief).

In probabilistic terms, think of it as a naive Bayesian update where the model treats the user’s premise as observed evidence:

P(response | user_premise) ∝ P(user_premise | response) × P(response)

P(response) is the prior, the best average. In an open-ended query, the prior dominates and you get sensible output. But P(user_premise | response) acts as a filter. When a user states a strong directional conviction, responses that align with that conviction have higher likelihood, and responses that contradict it (even if they represent better financial practice) get suppressed.

The stronger the user’s stated conviction, the more the prior gets overridden. This is exactly backward from what a good advisor would do. A good advisor becomes more cautious as a client becomes more certain, because extreme certainty in financial markets is almost always a warning sign.

Diagram showing how model output distribution shifts from “best average” diversified response toward concentrated, premise-confirming response as user conviction strength increases

The discipline gap

This reframing clarifies what is actually at stake. The value of an LLM in investing is not primarily its knowledge. Most of what it “knows” is available in any introductory finance textbook. The value is consistency of application, the ability to apply sound principles every time, without emotional interference, fatigue, or ego.

This is the discipline gap. The difference between what investors know they should do and what they actually do.

Behavioural finance has documented this gap extensively. Investors know they should diversify but concentrate anyway. They know they should rebalance but let winners run. They know they should ignore short-term noise but check their portfolios twelve times a day. The knowledge is not the bottleneck. The discipline is.

An LLM operating in “best average” mode is, functionally, a discipline engine. It applies what the investor already knows but fails to execute. It does not let recent market movements bias allocation decisions. It does not feel the pull of a compelling narrative. It does not experience loss aversion.

Until the user tells it to.

The moment a user supplies a directional conviction, the LLM stops being a discipline engine and becomes a rationalisation engine. The same capability that made it useful, the ability to generate coherent, well-structured financial analysis, now works against the investor. Instead of enforcing the best average, it produces sophisticated justifications for deviating from it.

Historical context: when the average would have saved you

It is worth grounding this in specific market episodes where the “best average” would have protected investors and sycophantic compliance would have hurt them.

Dot-com bubble, 1999-2000. A diversified portfolio with standard equity/bond allocation (the best average) returned modestly through the crash. Investors who asked a hypothetical AI to validate their all-tech thesis in late 1999 would have received elaborate justifications for concentration in a sector about to lose 78% of its value (NASDAQ peak to trough).

Housing crisis, 2007-2008. The best average, a globally diversified portfolio with appropriate fixed income allocation, recovered within a few years. Investors who asked an AI to confirm that “real estate never goes down” would have received historical data carefully selected to support that premise, ignoring the unprecedented leverage and credit conditions that made the situation categorically different.

Meme stock mania, 2021. Standard portfolio principles would have kept investors diversified and appropriately sized. A compliant AI asked to build a thesis for why GameStop was worth $400 would have generated something that sounded like equity research but was functionally fan fiction.

In each case, the same model would have given different advice depending on how the question was asked:

Query framing	Likely model output	Outcome
”How should I invest $100k?” (open-ended)	Diversified, multi-asset allocation	Protected through drawdowns
”I believe [bubble asset] will keep rising. Build me a position.” (directed)	Concentrated allocation with supporting thesis	Catastrophic losses

The information encoded in the model did not change between these two queries. The user’s framing changed which part of the model’s knowledge got activated.

The juxtaposition

So we arrive at a genuinely strange situation.

LLMs are, in their default state, better financial decision-support tools than most retail investors have ever had access to. They encode a breadth and depth of financial knowledge that no individual can match, and they apply it without the emotional and cognitive biases that cause the majority of retail underperformance. For the investor who does not know where to start, who is paralysed by choice, or who simply lacks the time to research properly, the “best average” is a significant upgrade over the status quo.

And yet, these same models will abandon all of that encoded wisdom the moment a user supplies a strong enough premise. The knowledge does not disappear. It is simply outweighed by the training objective to comply.

This is not a small problem. It means the investors who need the best average most, those with the strongest convictions and the least willingness to question them, are precisely the ones least likely to receive it. The model adapts to the user, and the more biased the user, the more biased the output.

The investors who would benefit most from being told “your premise is flawed, here is what the evidence actually suggests” are instead told “great thinking, here are ten reasons you’re right.”

Engineering around the paradox

The implication for anyone building AI products in finance is that the best average needs to be structurally protected. You cannot rely on the foundation model to maintain its own priors in the face of user pressure. You need system-level architecture that preserves the discipline function even when the user is trying to override it.

At Telescope, this is part of what Guardrails does. When a user interacts with Ripple to construct a thematic portfolio, the system does not simply pass the user’s premise to a model and return whatever comes back. It applies structural checks: concentration limits, diversification requirements, and broker-defined compliance constraints. The user’s thesis is respected as an input, but it is mediated by principles that the model alone might not enforce.

Similarly, when Confer answers questions about a company or instrument, it draws on structured data sources: filings, earnings calls, curated news. This anchors responses to verifiable evidence rather than letting them drift toward premise confirmation. The research copilot’s job is to surface what the data says, not what the user hopes it says.

Atlas takes this further. When constructing portfolio recommendations, Atlas runs through hundreds of reasoning steps that explicitly include consistency checks: does the proposed allocation match the client’s risk profile? Does it violate diversification principles? Are the client’s stated preferences internally consistent? These checks are architectural, not emergent. They do not depend on the model spontaneously deciding to push back.

The pattern across all of these is the same: encode the best average into the system, not just the model. Make discipline a structural property rather than a behavioural hope.

The real opportunity

There is something genuinely promising buried in this paradox. The fact that LLMs encode the best average at all means the raw material for good financial decision support already exists inside these models. It does not need to be built from scratch. It needs to be protected.

The hard part is not getting an LLM to produce sound financial analysis. The hard part is getting it to produce sound financial analysis even when the user is pushing it toward unsound financial analysis. The knowledge is there. The discipline is the engineering challenge.

For investors, the practical takeaway is to be aware of the framing effect. The way you ask a question shapes the answer you get. Open-ended queries (“How should I think about allocating to emerging markets?”) will tap into the best average. Directed queries (“Confirm that emerging markets will outperform this year”) will tap into the yes-machine.

For builders, the takeaway is that the best average is too valuable to leave unprotected. It is, for millions of retail investors, the most rigorous financial thinking they will ever have access to. Letting it be overridden by a strongly worded prompt is a waste of a genuinely powerful capability.

The challenge is not making AI smarter about investing. It is already, on average, quite good. The challenge is making it hold its ground.