menu
BehanceLinkedinInstagram
HomePortfolioWords
Close
Send me an email:[email protected]
Schedule a meeting:
BehanceLinkedinInstagram
home

Building TrailSherpa

April 25, 2026 • AI workflows • 5 minutes read

Building TrailSherpa

A mountain biking co-pilot app with rally-style voice cues. My first deep dive AI-powered project with Claude Code, starting near the end of summer 2025. An experiment to solve a known problem and understand the limits of AI software production.

Drawing the map before we set off

Ok, so before I start – I need to figure out exactly what I'm building and how I'm going to organize everything, and the summer of '25 ChatGPT was my go-to-LLM for processing thoughts and creating plans. But not anymore - ew.for

Defining a "Magic prompt" was the secret back then. For TrailSherpa, that meant a 958-line, 28-section build spec that became Claude Code's first input and its source of truth until MVP.

It locked in the tech stack, a full file tree, cue engine constants, an acceptance scenario, a required self-verification protocol, and explicit out-of-scope guards: no cloud, limited API integrations, no microphone, no Android validation.

The tradeoff is obvious. A lot of time before any visible output. What it bought was a deterministic build, a clear verification target, and guardrails that held for weeks. When I reversed myself later (and I did, often), I was reversing from something specific instead of from vibes.

Tail winds out of port but stopped by

Voice interaction design in a high-intensity environment

TrailSherpa talks to riders while they're moving. That's the whole product. You're on singletrack at 10 m/s, you can't read a screen, and you need to know what's coming. Left 3 in 40. Hairpin right. Short descent into sharp right.

Early on, the engine graded turns by angle alone. A screenshot showed "Left 4" on a 90° corner and I remember asking — why 4? A 90° turn could be a tight hairpin at 5m of radius, or a fast sweeper at 500m+ of radius. Same angle, completely different rider experience.

That's when I leaned on rally pace notes. Rally co-drivers don't grade corners by angle, they grade by severity — how fast you can actually take it. A wide 90° is a fast 5. A tight 90° is a slow 1. So I redirected the grading to include a 3-point radius calculation. A 90° turn at 563m of radius correctly reclassified from Grade 2 to Grade 5. After that, the cues matched what a rider feels through the bars.

The voice I named Tenzing, after the Everest sherpa. Non-editable baseline profile, ElevenLabs eleven_turbo_v2_5, with a system TTS fallback for when the API was out. Minimum utterance gap 2.5 seconds. Cues bundled with pauses so "left 1 with rock into right 5 over rock into short descent" came out as "caution rocks, hairpin left into right, short descent into sharp right."

This is product work dressed up as engineering. You can't design cues without first thinking about what a rider actually hears, at speed, under load.

Field data, with LiDAR as ground truth

No amount of simulation tells you if the system works. So I field-tested. Many iterations on real trails, five saved GPX recordings, and a 216-line comparative analysis document I wrote myself to work through what the data was telling me.

During one walk-test I noticed the on-screen elevation was lagging behind the real terrain — dips would show after I was already climbing. I logged the observation and traced it back to a 5-point rolling average I'd approved weeks earlier. On GPS data, it smoothed noise. On barometer data, it was acting as a low-pass filter: a real 0.3m climb was showing as only 0.06m of gain. The fix was removing the rolling average from barometer data only, and rebuilding calibration around stability — standard deviation under 0.5m across 10 readings for 5 consecutive seconds, with a 60-second safety timeout. First field test after that landed: "ok field tested - This is the best version of our mapper yet."

The measurement layer is where I got more serious. I designed a LiDAR ground-truth validation workflow using an iPhone 16 Pro and SiteScape. Scan a real trail, export the point cloud, treat the LiDAR elevations as truth, then compare GPS and barometer recordings against it for mean absolute error, RMSE, maximum error, and turn detection precision and recall. The point was to stop debating accuracy and start measuring it. Once you have ground truth, you're iterating against known answers instead of arguing from screenshots.

Experimenting, directing, and catching

One of the more useful decisions was building two tracking engines side by side and shipping a selector in the UI so I could switch between them mid-ride. Legacy was the production engine. Alt was where I sandboxed alternative approaches after studying how competitors handled the same problem. Having both live meant I wasn't debating approaches in theory, I was A/B testing them on real dirt, from the same handlebars, on the same trail.

The trap was architectural drift. A few sessions later I realized barometer improvements had been applied to Alt while production was still using Legacy. The signal-to-noise of the experiment had dropped below what it was worth, so I deleted Alt entirely — around 1,200 lines — and re-applied the work to the engine that actually shipped.

That pattern repeated at smaller scales. One cue read "baby flow descent for 120". Wrong grammar, since "flow" is supposed to replace grade and direction, not stack as a prefix. I caught it and updated the eight call sites. Early on, the system misdiagnosed a trail as "too flat" based on average grade across 633m. I pushed back — average grade over a full trail is not the grade of individual segments — and unlocked the threshold as a configurable parameter.

Each of those catches took a minute to explain and shipped in the next commit. Cumulatively, they're what kept the product honest.

What I learned

Defining the work up front (product scope, build spec, acceptance criteria) compounds across every week that follows. The hours I spent on MAGIC_PROMPT.md before any code existed were the most valuable hours in the project.

The field is the oracle. Simulators will tell you what they're told. A real trail won't.

Subtraction is a design skill. Deleting the Alt engine, removing a filter, cutting First Aid from the MVP — each of those made the product clearer, not smaller.

And directing an AI-assisted build isn't mostly about the prompting. It's about knowing the domain well enough to catch plausible output that happens to be wrong, setting up measurement before you need it, and having the nerve to push back or delete.

TrailSherpa is pre-launch, production-ready code, awaiting the launch decisions that come next.

More Articles:

Design: understanding and describing a practice

February 3, 2021

In my continuing pursuit for a deeper understanding of design, this article is my current interpretation.

read more

Top Creatives: October 2020

December 23, 2020

Exploring notable creative practitioners across disciplines including graphic design, motion graphics, and illustration discovered during web research.

read more

Visiting The Hirshhorn Museum (and Washington, D.C.)

February 15, 2020

Documents a January 2020 visit to Washington, D.C. and the Hirshhorn Museum during a product planning sprint with the BRINK team.

read more