Essays, observations, and arguments — on AI architecture, the nature of systems, the road, and whatever else demands more than a paragraph.
Every impressive AI demo conceals the same truth: reliability at scale is a completely different engineering problem than correctness in controlled conditions. Here is what the gap actually looks like from the inside.
The conversation is stuck on chat interfaces. The unit of work has moved from the query to the goal. Here is what that actually means for how we build.
Model Context Protocol is not just a tool integration format. It is the first serious attempt to standardize how AI agents interact with the world. What this means for how we architect systems.
I drove every intact stretch of Route 66 from Chicago to Santa Monica. The road is not a tourist attraction. It is a cross-section of American time, frozen in asphalt.
Every unmaintainable system I have worked on became that way because someone confused complexity with sophistication. The argument for simplicity as a moral position in engineering.
If the universe is vast and old and life is not rare, where is everyone? Every proposed answer implies something extraordinary about our situation. The silence is data.
Teams spend 80% of their time on model selection and 5% on evaluation. This is backwards. The case for building evals first, running them continuously, and treating regression as a critical bug.
Split Rock Lighthouse, Pictured Rocks, the Boundary Waters by canoe. The Great Lake that looks like an ocean and behaves like one. Notes from the northern edge of the continent.
Not because superintelligence is imminent. Because the systems being built today are already consequential. The case for treating alignment not as a research concern but as an engineering discipline.
A demo that works 90% of the time is impressive. A production system that works 90% of the time is broken. This sounds obvious written down. In practice, the distinction collapses in every sales meeting, every board presentation, every proof-of-concept review I have ever attended.
The gap is not a secret. It is just systematically ignored because the incentives on both sides of the table push toward ignoring it. The vendor wants to close the deal. The buyer wants to believe. The demo is optimized for the best case. Production is defined by the worst.
The hard 10% is where real engineering lives. The edge cases, the adversarial inputs, the ambiguous instructions, the cascading errors, the 3am pages. Most teams underinvest in the error handling, observability, and fallback design that separate a compelling prototype from a reliable system — not because they are lazy, but because these things are invisible in demos and visible only in production.
What does the gap actually look like? In healthcare AI — my primary domain — it looks like this: a prior authorization model that performs at 94% accuracy in testing and 76% in production, because the test set was cleaned and the production data is not. It looks like an agent that handles the happy path beautifully and fails silently on the 15% of cases that require a document format the training data never included. It looks like a system that works perfectly until the downstream API it depends on starts returning 429s, and nobody built a retry strategy or a graceful degradation path.
Three things separate teams that close the gap from those that do not. First: they build evals before they build features. You cannot know whether you are improving a system if you have no way to measure it. Second: they instrument everything. If you cannot observe it, you cannot debug it. Logs, traces, and structured outputs are not overhead — they are the system's ability to explain itself. Third: they design for the exit condition in every agent loop. What happens when the goal cannot be reached? What happens when the tool fails? What happens when the model returns something unexpected? The teams that answer these questions before they ship are the ones whose systems stay up.
The gap is closable. It requires different skills than demo-building — less creativity, more rigor; less novelty, more robustness. The engineers who do this work are rarely the ones in the conference room presentations. They are the ones keeping the system running at 3am. They deserve more credit than they get, and their concerns deserve more weight in the architecture discussions that happen before the demo is ever built.
The dominant mental model for AI in most organizations is still the chat interface. Question in, answer out. The model as an oracle you consult. This is the wrong frame, and holding it will cause organizations to systematically underinvest in the capability that will matter most in the next five years.
Agentic AI systems do not answer questions. They pursue goals. The difference is architectural, operational, and strategic — not cosmetic. The unit of work has moved from the query to the objective. A query-response system generates text. An agentic system observes state, plans a sequence of actions, executes them using real tools, evaluates the results, and iterates until the goal is reached or a limit is hit.
In healthcare revenue cycle management — where I spend most of my time — this means the difference between a model that answers "what is the prior authorization status for this claim?" and a system that identifies claims at risk of denial, retrieves the relevant clinical guidelines, checks the patient's coverage history, flags the discrepancy between the documented diagnosis and the requested procedure, and drafts the appeal letter with supporting evidence before a human ever touches the case.
These are not the same technology deployed differently. They require different architecture, different observability, different failure modes, different trust calibration, and different organizational structures to support them. Organizations that treat agentic systems as chatbots with more steps will build chatbots with more steps. The capability gap with those that understand the distinction will compound over time.
Before Model Context Protocol, every AI tool integration was a bespoke contract between a model and a specific system. You wrote a function, described it in the model's preferred schema, handled the authentication, parsed the response, and repeated for every tool you needed. The result was integration code that was brittle, non-portable, and tightly coupled to a specific model's API conventions.
MCP changes the architecture by introducing a standard server-client protocol for tool integration. An MCP server exposes a set of tools. Any MCP-compatible agent can use them. The integration is written once and works everywhere. This is the same insight that made REST APIs transformative — standardize the interface, and composability follows.
The practical implication for how I build systems: I now design MCP servers as first-class architectural components. One server per integration surface — ServiceNow, EHR systems, document stores, internal APIs. Each server handles its own authentication, rate limiting, and error handling. The agent orchestration layer never needs to know the implementation details of any particular integration. When a new tool needs to be added, it is a new MCP server, not a modification to the agent.
This composability is not just a developer convenience. It is a strategic capability. An organization that has built a library of well-designed MCP servers has built a reusable foundation for every AI agent they will ever deploy. The investment compounds. The alternative — bespoke integrations for every agent — scales linearly with the number of agents and tools, which is the wrong direction.
I started at the Grant Park sign in Chicago on a Tuesday morning in October. The weather was what Chicago weather always is in October — aggressively ambivalent. By the time I reached the Gemini Giant in Wilmington, the sky had decided on grey, and it stayed grey all the way through Illinois.
The Route 66 that tourists imagine — the neon signs, the diners, the open Southwest sky — does not begin until Missouri. The Illinois stretch is suburbs bleeding into farmland bleeding into more suburbs. But it is necessary. The road earns its mythology by making you wait for it.
The thing nobody tells you about driving Route 66 is that it is not one road. It is a palimpsest of roads, overlaid on each other across decades. The original alignment through a town is different from the bypass built in the 1940s and different again from the section that survived into the interstate era. Following the historic route means making decisions at every town about which layer of the palimpsest you want to be on.
I took every intact historic alignment I could find. This added two days and was completely correct. The intact sections are where the road still has personality — the two-lane through Galena, Kansas; the ribbon road in the Ozarks; the nine-mile stretch through the Mojave from Amboy to Ludlow where the asphalt shimmers and there is nothing in any direction except the fact of the desert being very large and very indifferent to your presence.
I arrived at the Santa Monica Pier on a Saturday afternoon. The Pacific was exactly as unimpressed as the Atlantic always is when you finally reach it. The road ends at a sign that says "END OF THE TRAIL." Somebody had taken a photo of themselves in front of it five minutes before me. Somebody took a photo of me. The road does not care. It has been doing this since 1926.
Every system I have ever worked on that became unmaintainable did so because someone — often someone talented — confused complexity with capability. The two are not correlated. The most powerful systems I have worked with are also the most understandable ones. This is not a coincidence.
Complexity has a specific failure mode: it concentrates knowledge in the person who built the system. A system that only its creator can understand has a single point of failure — its creator. When they leave, take a vacation, or simply forget, the system becomes a black box. Black boxes cannot be debugged, extended, or trusted.
Elegance is not an aesthetic preference. It is a functional requirement. The right abstraction does more work with less surface area. It handles more cases with fewer rules. It is easier to test, easier to explain, and easier to extend. When you find yourself adding complexity to solve a problem, the more productive question is usually: why does this problem exist? The answer often points to a design decision made earlier that can be revisited.
In AI systems specifically, complexity accumulates in a particular way. Prompt engineering layers build on each other. Tool descriptions multiply. Agent chains grow. Each addition makes sense in isolation. The aggregate becomes something nobody fully understands, which means nobody can reliably predict what it will do under novel inputs. Simplicity in AI systems is not a nice-to-have. It is a precondition for the kind of reliability that production use cases require.
The Fermi Paradox is simple to state: the universe is approximately 13.8 billion years old, contains somewhere between 200 billion and two trillion galaxies, each with hundreds of billions of stars, many of which have planets in the habitable zone — and yet we have found no evidence of any other intelligent life anywhere. The silence is total and, when you sit with it long enough, bewildering.
Enrico Fermi, having lunch with colleagues at Los Alamos in 1950, asked the question that bears his name: where is everybody? It remains unanswered seventy-five years later.
Every proposed explanation for the silence implies something extraordinary about our situation. The Great Filter hypothesis suggests that somewhere on the path from simple chemistry to spacefaring civilization, there is a step that almost nothing survives. If the filter is behind us — if it was the emergence of eukaryotic cells, or sexual reproduction, or language — then we may be the most complex thing in the observable universe. If it is ahead of us, then virtually every civilization that reaches our technological level is about to end.
I find the Fermi Paradox useful not as an astronomy problem but as a perspective tool. The cosmic silence is a fact that keeps the anxieties of daily life properly scaled. Not because nothing matters, but because the question of what matters enough to spend a finite life on becomes clearer when you hold it against the backdrop of thirteen billion years of universal silence. We are, as far as we can tell, the universe knowing itself. That is not nothing. In fact it might be everything.
The pattern is consistent across every AI project I have reviewed in the last three years: teams spend 80% of their time on model selection and prompt engineering and roughly 5% on evaluation. This is the wrong ratio by a large margin, and it explains why so many AI systems that look good in demos fail in production.
Without a robust evaluation framework, you are optimizing in the dark. Every prompt change, every model upgrade, every new tool integration might be an improvement or might be a regression — and you have no reliable way to know which. The teams I have seen move fastest on AI development are, without exception, the ones with the best evals. Not the ones with the most sophisticated prompts. Not the ones using the newest models. The ones who know, quantitatively, whether their changes are improvements.
Build evals before you build features. Start with the cases you know should work — the happy path, the common cases, the examples from your requirements documents. Add the edge cases as you discover them in testing and production. Run the eval suite on every significant change. Treat a regression as a critical bug, not a known limitation to be documented.
The other thing evaluations do that is rarely discussed: they force you to define what good looks like. This is harder than it sounds for AI systems, where the output is often natural language and the quality is genuinely subjective. The act of writing evaluation criteria — what counts as a correct response, what counts as a failure, what counts as acceptable variation — is itself a design exercise that clarifies requirements in ways that no amount of verbal specification can match. Write the eval first. The implementation gets easier.
Lake Superior does not look like a lake. It looks like an ocean that has been placed, through some cartographic error, in the middle of the North American continent. Standing on the shore at Two Harbors, Minnesota, at 5:30 in the morning in September, you cannot see the Canadian shore. The water is the color of pewter and the horizon is indistinguishable from the sky. The scale is wrong for a lake.
The Ojibwe called it Gitchigumi — the Great Sea. This is more accurate than Lake Superior, which sounds like something that ranks bodies of water by quality. It is not superior in that sense. It is simply enormous: 31,700 square miles of fresh water, deep enough in places to swallow skyscrapers, cold enough to preserve the wooden hulls of ships that sank a century ago.
Split Rock Lighthouse sits on a cliff sixty feet above the water, built in 1910 after a November storm sank or stranded twenty-nine ships in thirty-six hours. Lake Superior storms are not weather events. They are geological events. The lake generates its own climate. It does not care about forecasts.
I paddled into the Boundary Waters from the Ely entry point on a Thursday and did not see another person for four days. The Boundary Waters Canoe Area Wilderness is one million acres of lakes connected by portage trails — you carry your canoe overland between bodies of water, which is the exact right amount of effort to filter out everyone who is not serious about being there. The loons call at night. The stars, away from any light pollution, are what stars were before electricity: overwhelming.
When most engineers hear "AI alignment," they think of a research problem that other people are working on — something involving hypothetical superintelligent systems and long-horizon risks that are not yet relevant to the software being shipped today. This is the wrong frame, and it has practical consequences for the systems being built right now.
Alignment is not a future problem. It is the question of whether the system does what you actually want, in all the situations it will encounter, in a way that serves the people it is supposed to serve. Stated that way, it is obvious that alignment is already your problem — and has been since you started building.
A prior authorization model that denies care incorrectly is an alignment failure. An NLP classifier that encodes the biases present in its training data is an alignment failure. A recommendation system that optimizes for engagement over user wellbeing is an alignment failure. A hiring algorithm that penalizes resume gaps — which correlate with caregiving — is an alignment failure. None of these require superintelligence. They require only that a system be deployed in a consequential context with imperfect specification.
The question is not whether to care about alignment. It is whether you are paying close enough attention to realize you already need to. The engineers and architects building production AI systems have more moral responsibility than the current industry incentive structures acknowledge. This is not comfortable. It is true. Act accordingly — not because a regulator requires it, but because the people on the other end of the system deserve it.