ct smith

docs goblin

simulating readers cause I'm not a user researcher

October 28, 2025

Recently, I built an AI-powered testing tool that simulates different user personas navigating documentation with a defined goal to measure success rates, identify navigation issues, and help find actionable insights for doc teams. I managed to do this in 7.5 hours of screen time, and at no point did I abuse Claude so heavily that I hit my session limits.

My entire thesis was that documentation works great for some users, terrible for others, and we never test this systematically. I've never worked on a doc team with unfettered access to real users. When I did work somewhere with a user research team, they didn't care about docs. We do all kinds of testing on APIs but not the docs that explain them (unless you're using something like Doc Detective to contract test the docs). I decided that I wanted to use AI personas with realistic behaviors to navigate docs, attempt to get information on real tasks, and provide specific feedback on what's missing or structured stupidly.

What I ended up with was pretty cool for the amount of work that it took:

It was built entirely using Claude.ai and VS Code. I didn't use any sort of tools like Co-Pilot or Cursor (not even autocomplete). Just conversational coding and fixes because I really wanted to understand what I was building, and I didn't want Claude to do it for me.

Okay, let's talk more about how I built it.

the opening prompt

I didn't know whether my goal was achievable or not, but I started with this prompt:

I want to build a tool that reads through the docs and tells you whether the reader met a goal. It needs to simulate realistic user journeys through the docs:

  • Nervous beginner who rapidly cycles through docs looking for something that will help them understand the basics of something
  • Impatient expert who skips to API reference or ctrl + f everything
  • Troubleshooter searching for error messages
  • Someone who likes to read the docs start to finish before starting a task

The tool should measure whether each persona can meet their stated goal. It should generate a report showing which user types your docs serve well vs. poorly. It should use Claude API to simulate realistic reading patterns and decision making.

Claude responded with its typical "great idea" nonsense, but then laid out a solid technical approach. It said to define personas with specific goals and behaviors, use the LLM to simulate navigation decisions page-by-page, and measure success rates along with path efficiency. It also suggested tracking metrics like time-to-success, dead ends hit, and friction points encountered.

Claude's answer felt like enough of a rough shape to go on, so I started building. I didn't build it exactly like Claude suggested, but its initial suggestion was a solid starting point.

timeline: what got built when and what I learned

Like I said, I built this in small chunks of time between Friday night and Saturday evening. I don't have a ton of unbroken time to focus on side projects. In the next few sections, I'll break down how I spent my time and what came of it.

Friday night

My work on Friday night lasted about 3 hours, from ~7pm to 10pm. I go to bed at 10 so I had a hard stop, ha.

hour 1: foundation

hour 2: AI integration

Hour 3: journey system

Saturday

Saturday, I spent about 4.5 hours messing with this project. I woke up, worked for about two hours on it, went to run some errands, and wrapped up in the afteroon. I put some finishing touches on it in the evening.

The first thing I actually did was refactor the whole thing into modular files. I initially hadn't been sure that this project would work so I actually just had everything in a few files. I know, For SHAME. Once it was refactored, I spun up a quick little CLI that uses Commander, and from here on out, the CLI evolved with the rest of the system.

Hour 4: dialing in the persona system

The personas are pretty much the most important part of this tool, but I was having a really hard time making them work the way I wanted to. This was the darkest hour of the build, not to be dramatic or anything.

Hour 5: content strategy innovation plus a breakthrough moment

Hour 6: site configuration framework

I had hit a wall with the various ways different doc sites bury the content in elements -- every site has a different HTML structure.

I knew I couldn't hardcode selectors for every single site on the planet, but I also couldn't rely on universal extraction alone. I decided to go with configuration over code here, and made a lightweight and extensible way to figure out and tell the tool where to look for the doc content.

The universal content extractor was doing its best with semantic HTML (<article>, <main>), but I wanted my tool to be extensible and not brittle, so I did more work:

Hour 7: feedback generation

A critical feature for making the tool useful is the feedback it provides. I needed to know what to fix or improve whether a test failed or was a success. I deferred to Claude here for the main design decisions. I knew what I wanted but didn't have strong opinions about how it needed to be done.

Hour 8: added polish

ship it

And that's that. It was a pretty fun project, I learned a lot about site crawling and using the Anthropic APIs. It's also usable and has given me some cool insights at work already. Stuff I wouldn't have noticed without a user research team watching real humans struggle to navigate the docs.

what makes this special to me

There's a lot of stuff that makes this project kinda special, I think. But that might be because I'm proud of it? IDK. Here are the parts that make me feel most clever:

1. persona-specific reading modes (click to expand)

Different users don't just prefer different content formats, they also consume content in fundamentally different ways. I am proud of the three distinct reading modes I came up with:

  • Progressive disclosure (Beginners): Start with 1500-char preview, load full content (5000 chars) only if Claude indicates uncertainty. Saves tokens while providing context when needed.
  • Keyword search (Experts): Extract 1000-char sections around goal-related keywords, simulating Ctrl+F behavior. Experts scan for specific info, not sequential reading.
  • Full-always (Methodical): Always load maximum content (5000 chars). Reflects actual thorough reading behavior.

These reading modes lead to 50% token savings for beginners, more realistic behavior simulation, and better (more realistic) success rates.

2. Feedback system (click to expand)

Success doesn't mean good UX. Accidentally finding an answer in a format that's difficult for a reader to use is still a documentation problem. Beginners are wired to prefer the hand-holdy resources like tutorials, and experts are wired to prefer things like API references and implementation guides with code examples.

Success Feedback: Even when tests succeed, the tool evaluates whether the content type matched persona preferences:

  • "Perfect" - Ideal format for this persona
  • "Acceptable" - Found answer but would prefer different format
  • "Poor" - Answer exists but format is frustrating

I can use this information to fill gaps in content strategy for different kinds of readers.

Failure Feedback: When tests fail, Claude analyzes the complete journey to identify:

  • What navigation problem occurred
  • What specific content is missing
  • Actionable recommendation for doc teams
  • User impact assessment

The tool provides actionable insights regardless of outcome. I get recommendations for specific fixes, not just "this failed, good luck loser".

3. Smart link prioritization by persona (click to expand)

Each persona reorders links based on their preferences:

  • Beginners: Heavily favor tutorials, guides, "getting started" (+50 score), penalize API references (-20)
  • Experts: Heavily favor API references (+100), penalize tutorials (-30), simulate jumping to reference docs
  • Methodical Learners: Don't reorder, they respect document structure and read sequentially
  • Debuggers: Favor troubleshooting, error pages, FAQs (+60), search for error-related content

This behavior is designed to mimic realistic navigation patterns that match real user behavior.

conclusion

This is the most ambitious/complete single thing I've ever built. It's not perfect but it works really well for what I need it for. It's also well-documented and you can extend it for whatever your use case is.

Be sure to check it out on GitHub, the README is comprehensive.