Voice Tutor is a self-hosted voice conversation service that pairs Claude with a personal knowledge wiki or an uploaded document. The voice pipeline is built on Pipecat, with Deepgram for transcription and Cartesia for speech. It lives on a Mac Mini and is reachable from any browser, the testbed for a question that feels increasingly real: what changes when an AI tutor actually knows what you're reading and what you've already said?
What it does
Two modes share the same backbone. The default mode is an open-ended voice conversation grounded in my personal wiki: the wiki index lives in the system prompt, individual pages are fetched on-demand via tool call, and a persistent memory.md plus the most-recent transcript give it continuity across sessions. Study companion mode is the document-grounded variant, drop in a PDF, MD, or TXT file and the conversation is bound to that one document, with a Haiku-generated artifact written to disk after the session ends.
Transcripts, per-session summaries, and cost telemetry write automatically on disconnect.
Why this exists
The Socratic method, learning through question-and-answer dialogue, is one of the most durable ways to build recall and genuine understanding. It's harder to drift through a conversation than a textbook, and the act of answering aloud forces the kind of retrieval that makes a concept stick. Voice is the natural medium for it, and an AI tutor that already knows your material can run that loop on demand. I do my best thinking out loud, often on walks, and having those conversations automatically synthesized into notes afterward means the threads don't disappear when I get home.
Existing solutions struggle with user experience, hallucination, persistence in memory, and the right amount of structured learning. I wanted to find out whether wiring voice to a personal knowledge layer changes the texture of these conversations, and if so, what falls out of that for people who are studying, researching, or thinking through complicated material out loud.
Validation
Two structured Reddit runs, both grounded in real posts and run through the Last 30 Days pipeline. The first tested the “drowning in saves” framing against r/PKMS and r/Obsidian. The pain is real, but more diffuse than I'd originally pitched.
The second tested a study companion hypothesis against med, law, and undergrad subreddits. The framings I'd sketched in the design doc, voice-as-study workaround, hallucination-on-named-doc, casebook into ChatGPT, all show up verbatim in titles. The positioning that holds up is narrow and concrete: ChatGPT voice mode bound to one document, with a recap when you're done.
What's next
Use it daily for a week and see whether the wiki context actually changes the texture of the conversation. In parallel, run more validation conversations with the users surfaced in the Reddit work and let what comes back shape iterations on prompting and UX. From there, three branches: integrate Readwise so the document layer pulls directly from what I'm reading, build a custom frontend that fixes the rough edges of the prebuilt UI, or push toward function calling so the voice agent can save notes and create tasks instead of only answering questions.
Stack