The Natural Order of Intelligence

Humans build with ego. Not a moral failing - just nature. We see a problem and assume our job is to invent a solution from scratch, as though the problem's been sitting there waiting for us specifically. We build upward, outward, bigger. We impose structure on the world rather than observe the structure that's already there.

That instinct has produced extraordinary things. It's also produced a specific kind of blindness - the conviction that novel engineering is superior to ancient architecture; that if we're clever enough, we can surpass systems that have been under selective pressure for hundreds of millions of years. Mechanical engineering, fluid dynamics, thermodynamics, evolutionary biology - these aren't textbook subjects. They're records of solved problems. The solutions didn't arrive by committee or by genius. They arrived through iteration at a scale and duration that makes human engineering look like a rough draft.

That's the approach I keep coming back to - mimicking what already works. Not as metaphor. As architecture. And the more I build, the more obvious it becomes that the current trajectory of artificial intelligence is structurally wrong.

• • •

The dominant paradigm in AI is the monolith - a single model, trained on everything, owned by one company, deployed from one data center. The thesis is straightforward: make it big enough (enough parameters, enough data, enough compute) and general intelligence emerges from sheer scale. The "God-model" bet.

Nature tried this architecture. It didn't work.

No biological brain operates as a single undifferentiated mass. The human cortex is divided into specialized regions - Broca's area handles speech production; the fusiform gyrus handles face recognition; the hippocampus consolidates memory; the prefrontal cortex manages executive function. These aren't marketing labels. They're structural specializations that evolved because a general-purpose neural tissue that does everything adequately does nothing well. The brain's answer to "how do you handle the full complexity of the world" isn't to build one system. It's to build many and coordinate them.

The coordination mechanism is the thalamus - a small, ancient structure at the center of the brain that routes signals between cortical regions. The thalamus doesn't think. Doesn't store knowledge. Doesn't synthesize. It manages timing and attention: which specialist regions contribute to a given cognitive task, in what order, with what weight. Nearly every signal that reaches the cortex passes through the thalamus first. It's the routing protocol of biological cognition.

This architecture has been under selection pressure for roughly 500 million years.¹ The thalamus appears in the earliest vertebrate brains. It predates the cortex. It predates mammals. It predates our species by an order of magnitude.

And the brain isn't the only system that landed here. We distribute water through branching networks of reservoirs, treatment plants, and local pressure zones - not from a single pump. We distribute electricity through grids of generators, substations, and transformers - not from one power plant. The internet itself is a distributed routing architecture; the entire reason it survived its first decades is that no single node's failure takes down the network. Every large-scale system that humans have built to be robust, efficient, and survivable converges on the same pattern: specialized nodes, coordinated by a routing layer, with no single point of ownership or failure. The principle isn't unique to biology. It's the structural answer to complexity at scale, and it keeps showing up because nothing else works as well.

The AI industry looked at all of this and built the opposite.

• • •

Every major AI company is racing toward the same target: make the model as close to artificial general intelligence as possible without crossing whatever legal and ethical lines remain. Make it smarter. Make it more capable. Make it bigger. That's the stated goal. But the actual goal - the one that determines architectural decisions - is simpler: build something you can meter. A monolith is a product. You can charge per query, per seat, per token. You can gate access behind a subscription. You can control the pricing because you control the infrastructure. A distributed network of specialist models owned by thousands of institutions is harder to monetize from a single balance sheet. The monolith isn't the best architecture for intelligence. It's the best architecture for revenue.

The technical problems flow from this economic choice. A 175-billion-parameter generalist allocates capacity to everything from patent law to fan fiction to molecular biology simultaneously. When you ask it about cardiac pharmacology, the vast majority of its parameters are dead weight - trained on data that has nothing to do with your question, consuming compute without contributing signal. This isn't a bug to optimize away. It's the architecture. If you tried to build a brain this way - one undifferentiated mass responsible for vision, language, motor control, memory, and executive function all at once - you'd get a brain that hallucinates. Which is exactly what happens.

Hallucination - the confident generation of plausible but wrong information - isn't a failure mode of large language models. It's a feature of overextension. A system that models everything models nothing with certainty. Broca's area doesn't hallucinate visual information because it was never trained on visual data. The architecture prevents the error class. Monoliths can't do this because the entire point of a monolith is that everything lives in one place.

A specialist model (7 to 30 billion parameters, trained on high-quality, curated data within a single domain) can outperform a 175-billion-parameter generalist on domain-specific benchmarks² - accuracy gains of 10-30 percentage points over generalists on domain tasks; hallucination rates dropping below 2% in-domain where generalists run 5-15%³; inference on a single node instead of multi-GPU clusters. Not because the specialist is smarter. Because every parameter is doing relevant work.

The ecological math is just as stark. Training GPT-4 consumed an estimated 50 GWh of electricity⁴ and over 5 million liters of water for cooling.⁵ Each new frontier model demands more. The scaling curve isn't linear; it's superlinear - each incremental capability improvement costs disproportionately more energy, more water, more hardware. Gartner projects that more than half of enterprise generative AI models will be domain-specific by 2027, up from 1% in 2024.⁶ Distributed specialists don't have this problem. A 7B model trains on commodity hardware. It runs on a single GPU. It draws a fraction of the power. Fine-tuning a specialist via parameter-efficient methods like LoRA requires training roughly 18 million parameters - about 0.01% of a 175-billion-parameter generalist.⁸ At inference, task-specific fine-tuned models consume approximately 30 times less energy than routing the same query through a general-purpose model.⁹ A 30B specialist can be fine-tuned on a single consumer GPU with 24 gigabytes of memory. GPT-5's training runs cost over $500 million each.¹⁰

Read that delta again. One architecture lets a hospital fine-tune a cardiac specialist on a workstation in a closet. The other requires a billion-dollar data center drawing enough power to light a small city - and the hospital still has to pay a subscription to query its own knowledge through someone else's infrastructure. The same pattern that shows up in electrical grids, water systems, and biological nervous systems shows up in energy economics: distributed architectures consume resources proportionally to the work they're actually doing. Monoliths consume resources proportionally to everything they might be asked to do. The gap isn't incremental. It's structural.

• • •

The deeper problem with the monolith isn't technical. It's economic.

For twenty years, data generated by institutions - hospitals, universities, law firms, standards bodies, research labs - has flowed away from the institutions that produced it. It flows toward corporations that aggregate, train on, and monetize it. The knowledge of a retiring cardiologist, accumulated over thirty years of clinical practice, gets scraped into a training corpus owned by a company in San Francisco. The hospital that employed her, the patients whose outcomes shaped her expertise, the medical school that trained her - none of them own what was built from their knowledge. None of them benefit when that knowledge is queried by another physician across the country.

This is the knowledge succession crisis. When domain experts retire, their knowledge doesn't get preserved - it gets extracted. The institution loses the expert and gains nothing. The aggregator keeps the data, trains on it, degrades it by diluting it into a trillion-token corpus alongside fan fiction and Reddit threads, and then sells access back to the institution that produced the knowledge in the first place - usually as a training tool, which is its own kind of irony. The hospital pays a subscription to query a worse version of its own expertise, stripped of clinical context, owned by a company that has never treated a patient.

The distributed architecture reverses this flow. If a hospital trains its own specialist model on its own clinical data, on its own infrastructure, the weights belong to the hospital. The data never leaves. The model improves as the institution's knowledge grows. When the cardiologist retires, her expertise persists - not as a scraped paragraph in a trillion-token corpus, but as a dedicated model the institution owns, controls, and can share on its own terms.

The protocol that coordinates these specialists - the thalamic layer - is open. It routes queries to the relevant specialists, manages timing and attention, surfaces conflicts between experts rather than flattening them into false consensus. It doesn't own the knowledge it directs. TCP/IP didn't own the data that flowed through the internet. It created the conditions for the data to flow.

• • •

I've spent my career building systems that follow this principle, and the pattern keeps appearing - not because I'm looking for it, but because the math demands it.

Across four domains with nothing in common - neuroscience, graph theory, signal processing, and ecosystem dynamics - the same mathematical structure keeps appearing. When independent dimensions combine to produce an emergent whole, the composition is multiplicative, not additive. The emergent property is the product of its independent dimensions, not their sum. I've been able to prove this is axiomatically unique: given reasonable constraints on how independent dimensions should combine, the multiplicative form is the only valid composition function.⁷ The math doesn't offer an alternative.

The practical consequence is zero-collapse. If any single dimension is trivially absent, the product is zero. You can't compensate for the absence of one dimension by having more of another. In consciousness, this means zero integration with infinite processing power produces nothing. In ecosystems, it means zero biodiversity with infinite resources produces no resilience. The structure is universal because the math is universal.

Applied to distributed intelligence, zero-collapse means a system whose specialist outputs lack integration (they don't cohere), or lack differentiation (they all produce the same answer), or lack self-reference (the system can't evaluate its own boundaries or refuse a query it shouldn't answer) produces zero trustworthy output. Not low. Zero. This is why the routing protocol matters as much as the specialist models. This is why governance - the layer that detects bias, surfaces disagreement between experts, refuses to flatten conflict into artificial consensus - isn't an add-on feature. It's structural. Remove it and the entire product collapses to zero, no matter how powerful the individual specialists are.

• • •

There's a governance argument here worth stating plainly.

A monolithic model owned by one company is a single point of failure for the world's access to machine intelligence. If that company gets captured - by financial pressure, political incentive, regulatory coercion, or the personal ideology of whoever controls the board - the intelligence degrades for everyone simultaneously. This isn't theoretical. We've already watched it happen. A billionaire buys a social platform, fires the safety teams, and repurposes the AI to reflect his politics. A government pressures a model provider to suppress outputs on sensitive topics. An investor bloc demands the model optimize for engagement over accuracy because engagement drives subscriptions. The corruption vector isn't a hypothetical - it's the default trajectory when intelligence infrastructure is concentrated in hands that answer to shareholders, not to the institutions that produced the knowledge.

No redundancy. No alternative routing. No mechanism for the institutions that actually produce knowledge to contest the model's outputs on their own terms. When one company's model hallucinates about your field, your profession, your research - there's no appeals process. There's no second opinion from a competing specialist. There's just the monolith, and whatever its training data and its owner's incentives happened to produce.

A distributed architecture doesn't solve this by being better governed. It solves it by making capture structurally harder. Open protocol means no single entity controls routing. Thousands of institutional specialists means no single acquisition consolidates knowledge. Surfacing disagreement between experts (rather than suppressing it) makes the system's epistemic immune response a design feature, not a vulnerability to patch.

And the open protocol isn't generosity - it's multiplication. Every institution that builds a specialist adds a node. Every node increases cross-domain capability for every other node. A hundred implementers building on the open protocol reach a hundred institutions simultaneously. The architecture compounds because it was designed to.

• • •

The instinct to build a God-model is human ego applied to engineering - the belief that we can construct a singular artifact superior to a distributed, specialized, routing-coordinated architecture that evolution refined across geological time. The brain didn't evolve a monolith because a monolith doesn't survive. It doesn't specialize. It doesn't preserve knowledge through succession. It doesn't surface disagreement. It doesn't refuse. Systems that tried the monolithic approach were outcompeted by systems that distributed.

We don't need to invent our way past this. The problem was solved a long time ago. It's not broken.

Sources

Albuixech-Crespo, B. et al. "Molecular regionalization of the developing amphioxus neural tube challenges major partitions of the vertebrate brain." PLOS Biology, 2017. See also Grillner, S. "The Basal Ganglia Over 500 Million Years." Current Biology, 2016. ↩
Minaee, S. et al. "Large Language Models: A Survey." arXiv:2402.06196, 2024. Jiang, C. et al. "MoDEM: Mixture of Domain Expert Models." arXiv:2410.07490, 2024. ↩
Huang, Q. et al. "FinLLMs: A Survey of Financial Domain Large Language Models." EmergentMind, 2025. Domain-specialized financial LLMs surpass generalists by 10-30% absolute on domain tasks. ↩
Ludvigsen, K.G.A. "The carbon footprint of GPT-4." Towards Data Science, 2025. Estimated 51-62 GWh based on 25,000 A100 GPUs for 90-100 days. Corroborated by Li, P. et al. "Electricity Demand and Grid Impacts of AI Data Centers." arXiv:2509.07218, 2025. ↩
Ren, S. et al. Research on water costs of AI computation, UC Riverside. GPT-4 training consumed an estimated 5.4 million liters of freshwater for cooling. ↩
Gartner forecast cited in Kili Technology, "Domain-Specific LLM Benchmarks: 2026 Vertical AI Map," 2025. Specialized GenAI model spend projected at $1.1B in 2025, fastest-growing segment of $14.2B total. ↩
See multiplicative composition research on this site. Axiomatic uniqueness proof following Aczél, J. Lectures on Functional Equations and Their Applications, Academic Press, 1966. ↩
Hu, E.J. et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685, 2021. LoRA trains ~18M adapter parameters vs 175B full model parameters - roughly 0.01% of the original model. QLoRA extends this to fine-tune 30B models on a single 24GB consumer GPU. ↩
SmarterArticles. "Power Hungry Machines: How to Cut AI Energy Costs Without Losing Capability," 2025. "Using fine-tuned models for specific tasks rather than general-purpose generative models consumes approximately 30 times less energy." ↩
GPT-5 training costs reported via multiple sources, 2025. Individual training runs exceeded $500M, with total training costs estimated at $1.25-2.5B. ↩