Where AI coding agents go blind on mobile

I've been running Claude Code against native Android and iOS codebases for several months. The capability ceiling isn't the model. It's the environment. There are three specific places where agents go blind on mobile, and they show up in every project in roughly the same form.

1. Build output that isn't structured for a model's context window

Gradle's standard output is written for humans reading a terminal. It assumes you'll scroll, scan, and use pattern recognition built from years of seeing the same error formats. A model reading that output is doing something different: it's consuming a large, flat text block and trying to infer what failed, what's still relevant, and what's noise.

The problem is volume and structure. A failing Android build can produce thousands of lines. The actual failure is often buried after irrelevant warnings, note messages, and deprecation notices that are technically correct but don't matter for fixing the current problem. The model sees all of it. It frequently focuses on the wrong line.

The fix is structured build output — a log formatter that emits failures in a schema the model can parse. Error type. File. Line. Relevant context. Nothing else. The agent gets a compact, structured error report instead of a raw terminal dump. Hit rate on correct first-pass fixes goes up significantly.

2. Runtime state the agent can't see

This is the deeper problem. The agent can read the code. It cannot see a running app.

Compose layout bugs are a clean example. An agent can write a LazyColumn with complex item composition and get the code exactly right — it compiles, the logic is correct, the data flows properly. And then on a real device with real content at real scroll depth, items overlap, or a sticky header misbehaves, or a nested scroll interaction breaks. The failure mode is visual and behavioral. It doesn't appear in logs unless you instrument it yourself.

Same thing with navigation state. A backstack that's three levels deep behaves differently from one that's two levels deep. An agent reasoning from code can produce a navigation graph that looks correct and fails at runtime under specific entry sequences. The failure is in the device's runtime memory, not in any file the agent can read.

ExoPlayer is worse. Buffering behavior, surface release timing, surface lifecycle under configuration changes — these are problems that only appear when you're running against a real media server with real latency. The agent can write ExoPlayer configuration code. It cannot observe what happens when a user rotates the device mid-stream.

The fix requires piping runtime telemetry back into the agent's context. Compose's LayoutInfo, navigation state dumps, ExoPlayer's analytics listener output — structured, summarized, and routed into the loop so the agent can reason about what the running app actually did. This is harder to build than the log formatter. It's also more valuable.

3. Platform-version fragmentation as a constant tax

Android has 15+ years of active SDK versions. iOS has its own layered history. The question "which API does this SDK version support" comes up on nearly every task, and the answer isn't stable — it depends on the minSdk, the target SDK, what the library version supports, and whether there are behavioral differences between versions that matter for the specific use case.

Agents handle this poorly by default. They know the APIs exist. They're less reliable about knowing which version introduced what, which patterns are deprecated but still functional, and which patterns will fail silently on older SDK versions. The model's training data has API docs, but API doc coverage isn't uniform across versions, and the model's calibration on version-specific behavior is inconsistent.

The practical effect: agents produce code that works on the emulator (which is usually running a current API level) and fails on devices running older Android versions that are still common in production. The failure is often silent — no crash, just wrong behavior on a device you didn't test.

The fix I'm working toward is a version-aware prompt layer: a set of platform-specific prompt fragments that tell the agent which SDK version is the target, which APIs are available, and which patterns to avoid. Paired with a CI run against a matrix of API levels, this gives the agent feedback it currently doesn't get.

I'm building tooling to fix these. More on that soon.