Voice-First in Darija: Bypassing the Translation Layer
On building assistive voice AI natively in a language that most NLP tools do not support.
Nadar is a real-time voice companion for visually impaired users that speaks Moroccan Darija natively. Building voice-first AI in a language with no standardized orthography, limited NLP support, and heavy code-switching between Arabic, French, and Spanish required rethinking assumptions that most assistive technology frameworks take for granted.
Most assistive technology defaults to English and layers other languages on top through translation. The user hears their language, but the system thinks in English. This works for standardized languages with strong NLP support. It fails for Darija.
Moroccan Darija is spoken by over 30 million people. It has no official orthography. It borrows vocabulary from Arabic, French, and Spanish, often within the same sentence. Code-switching between Darija and French is not an exception; it is the default mode of conversation. Most NLP tools do not support it. Most speech-to-text systems misclassify it as Modern Standard Arabic and produce nonsensical transcriptions.
Nadar thinks in Darija. The system prompt is written in Darija. The persona is described in Darija. The safety instructions are in Darija. When the model speaks, it speaks as someone who belongs in the conversation. French and English are supported as equal alternatives.
This decision changed how the model addressed the user, what cultural references it could make, and how natural the interaction felt. Users who tested both the English-prompt and Darija-prompt versions described the Darija version as feeling like a different product entirely. Both versions had the same tools, the same memory system, the same vision pipeline. The difference was in presence.
The multilingual embedding model handles Arabic, French, and English natively, which means a user can store a memory in Darija and retrieve it weeks later by asking in French. Cross-language recall works because the embeddings represent meaning at the semantic level. The question we are still investigating is how retrieval ranking changes when the same fact appears in mixed dialect forms. A memory stored in Darija-French code-switched speech should surface when queried in either language, but the ranking confidence varies with how much of the original was in each language.
Voice-first interfaces demand a different relationship with silence and timing than screen-first interfaces. A sighted user can glance at a loading indicator. A blind user cannot. If the audio stutters, the system feels broken. If there is a pause between sentences, the user wonders if it crashed. The bar for audio quality in a voice-first interface is absolute. Gapless playback, echo cancellation, interrupt detection, silence thresholds, push-to-talk fallback for noisy environments, all of it exists because the user has no visual channel to compensate for audio failure.
The system has a priority hierarchy that never changes: hazards first. Before describing a scene, it mentions the stairs ahead. Before reading a sign, it calls out the approaching car. This is embedded in the system prompt as the foundational instruction. Everything else; scene description, text reading, contextual commentary; comes after safety. For someone who cannot see what is in front of them, this ordering is the reason the system can be trusted outdoors.
Proactive behavior required careful calibration. The system performs better when it offers relevant observations without being asked, especially for hazards and high-signal context. But there is a line where proactive behavior becomes noise. The system prompt instructs the model to participate; a distinction that shapes how observations are delivered: useful context shared naturally, as a companion walking alongside you.
The relationship graph changed how users interacted with the system over time. When we added face recognition with persistent identity, users began introducing the system to their social world deliberately. They registered family members, explained connections, added notes about preferences. The system became a repository of social context. Users started saying 'you know my mother' rather than 'identify this face,' which signals a shift from tool interaction to companion interaction.
Building this system surfaced a gap that technology could have addressed years ago. Over 350,000 Moroccans live with visual impairments. Millions more across North Africa and the Middle East speak Darija or similar dialects with no assistive technology in their language. The cost of running the system is dominated by API calls during active use. A single session costs less than a dollar. The technology to build this has existed for less than two years. The need has existed for decades.