A \ V — Signal

Paper 2025-12 via Anthropic Research

Scaling Monosemanticity: Extracting Interpretable Features from Claude

Sparse autoencoders applied to Claude 3 Sonnet reveal millions of interpretable features — including multimodal neurons, emotion representations, and abstract concepts. The finding that features are universal across model sizes changes how we should think about alignment and interpretability work.

InterpretabilityMechanisticAnthropicSAE

ESSENTIAL

Tool 2024-11 via Anthropic

Model Context Protocol (MCP)

MCP standardizes how AI models connect to external tools and data sources. The server-client architecture means integrations become composable. One MCP server for ServiceNow, one for EHR systems, one for document stores — and any Claude-powered agent can use all of them without bespoke tool schemas.

MCPTool UseIntegrationProtocol

ESSENTIAL

Pattern 2025-Q1 via Field Observation

Agentic Loops Need Explicit Exit Conditions

The most common failure mode in production agentic systems is not hallucination — it is infinite loops and unbounded tool calls. Every agent loop needs an explicit maximum iteration budget, a confidence threshold for early termination, and a graceful degradation path when neither is met. Design for the exit before you design for the goal.

Agentic AIReliabilityArchitecture

STRONG

Paper 2023-05 via Google DeepMind

PaLM 2 Technical Report

The multilingual and reasoning improvements in PaLM 2 established benchmarks that shaped how the field evaluates large language models. The mixture-of-experts scaling insights here informed how teams think about compute-efficient training versus inference-time scaling.

LLMScalingGoogleReasoning

NOTABLE

Tool 2024-01 via LangChain

LangGraph: Graph-Based Agent Orchestration

LangGraph brings explicit state machines to multi-agent systems. The graph model means agent transitions are auditable, human-in-the-loop checkpoints are first-class, and cycles are supported by design. For production healthcare AI this is not optional — it is the difference between a system you can explain to a compliance team and one you cannot.

LangGraphOrchestrationState MachineMulti-Agent

STRONG

Field Note 2026-03 via Direct Observation

The Trust Calibration Problem in Clinical AI

Clinicians overtrust AI outputs when presented with high confidence scores and undertrust them when outputs are hedged or qualified. The calibration problem is not technical — it is communicative. The system that says "I am 94% confident" is less useful than the one that says "I found 3 supporting criteria and 1 conflicting note — here is the conflict." Explanations over scores.

Healthcare AITrustUXExplainability

STRONG

Paper 2017-06 via Google Brain

Attention Is All You Need

The paper that made everything else possible. The transformer architecture eliminated recurrence and convolution entirely in favor of self-attention. Every large language model, every multimodal system, every agent architecture running today descends from this eight-page paper. Re-reading it periodically is useful — the clarity of the original is remarkable.

TransformerAttentionFoundationalArchitecture

ESSENTIAL

Watch 2026-Q1 via Industry Observation

Inference-Time Compute Scaling is the Next Frontier

The era of simply scaling training compute is maturing. The emerging thesis is that allocating more compute at inference time — chain-of-thought, repeated sampling, verification models — can achieve capability gains that would require orders of magnitude more training compute. Models that reason before they answer are consistently outperforming larger models that do not.

Inference ScalingChain-of-ThoughtReasoning

STRONG

Pattern 2025-Q3 via Architecture Review

RAG Quality is a Chunking Problem More Than an Embedding Problem

Most RAG systems underperform not because the retrieval model is weak but because the chunks are wrong. Recursive character splitting destroys document structure. The signal is in the semantic unit — a policy section, a clinical note, a code block — not in the 512-token window. Chunk by meaning, not by length, and retrieval quality improves dramatically.

RAGChunkingRetrievalEmbeddings

STRONG

Tool 2023-Q4 via Anthropic

Claude's Extended Context Window as Architecture Feature

A 200K context window is not just a larger input buffer — it is an architectural choice that changes how you build systems. When the entire prior authorization history, policy document, and clinical notes fit in context simultaneously, the retrieval problem changes shape. The tradeoff is cost and latency, but for high-value decisions it is often the right call.

ClaudeContext WindowArchitecture

NOTABLE

Field Note 2026-02 via Direct Observation

The Gap Between AI Demo and AI Production is Wider Than Anyone Admits

A demo that works 90% of the time is impressive. A production system that works 90% of the time is unacceptable. The hard 10% — edge cases, adversarial inputs, ambiguous instructions, cascading errors — is where real architecture work lives. Most teams underinvest in the error handling, observability, and fallback design that separate a compelling prototype from a reliable system.

Production AIReliabilityEngineering

ESSENTIAL

Watch 2026-Q1 via Industry Observation

Sovereign AI Infrastructure is Becoming a Strategic Priority

Governments and large enterprises are increasingly unwilling to route sensitive data through third-party AI APIs. The demand for on-premises LLM deployment, private cloud inference, and data-residency-compliant AI infrastructure is accelerating. For healthcare, financial services, and defense, this is not a preference — it is a regulatory requirement shaping procurement decisions now.

Sovereign AIOn-PremisesHealthcareCompliance

STRONG

Paper 2022-03 via Anthropic

Constitutional AI: Harmlessness from AI Feedback

RLHF from human feedback has a scaling problem — you need humans. Constitutional AI replaces human feedback with a set of principles and AI-generated critiques. The model critiques its own outputs against the constitution and revises them. The result is alignment that scales. Every safe AI system built today draws from this lineage.

Constitutional AIAlignmentRLHFAnthropic

ESSENTIAL

Pattern 2025-Q2 via Architecture Review

Tool Descriptions Are Prompt Engineering

In a tool-using agent, the quality of the tool description determines the quality of tool selection. A poorly described tool will be ignored, misused, or called with wrong parameters. The effort that goes into writing good function descriptions — naming conventions, parameter descriptions, examples of when to use and when not to use — returns many times over in agent reliability.

Tool UsePrompt EngineeringAgent Design

STRONG

Watch 2025-Q4 via Industry Observation

The Commoditization of Base Models is Accelerating

As frontier model capabilities converge and open-weight models catch up, the competitive moat is shifting from the model to the system around it. Data pipelines, fine-tuning infrastructure, evaluation frameworks, deployment tooling, and domain-specific context are where durable advantage lives. The model is becoming table stakes. What you build on top of it is the differentiator.

Model CommoditizationStrategyOpen Source

STRONG

Field Note 2025-11 via Direct Observation

Evaluation is the Most Underinvested Part of Every AI Project

Teams spend 80% of their time on model selection and prompt engineering and 5% on evaluation. This is backwards. A robust eval framework tells you whether your changes are improvements. Without it, you are optimizing in the dark. Build evals early, run them continuously, and treat regression as a critical bug. The teams that move fastest are the ones with the best evals.

EvaluationMLOpsTestingQuality

ESSENTIAL

Paper 2024-09 via OpenAI

Learning to Reason with LLMs (o1 System Card)

The o1 system card introduced the inference-time compute scaling paradigm to the broader community. Training the model to spend more time thinking before answering — using a hidden chain of thought — produces qualitatively different reasoning capabilities. The implications for agentic systems that need to plan and verify are significant.

Reasoningo1Inference ScalingOpenAI

STRONG

Pattern 2025-Q4 via Architecture Review

Parallelism is the Most Underused Lever in Agentic Design

Most agentic systems run sequentially when the tasks are actually independent. Clinical prior authorization involves verifying eligibility, checking formulary, retrieving clinical guidelines, and reviewing prior notes — all of which can happen in parallel. Designing for concurrency from the start, rather than retrofitting it later, changes both the latency profile and the architecture fundamentally.

ParallelismAgentic AILatencyArchitecture

STRONG