Nadar / نظر

a companion that sees, in the language you actually speak

Over 350,000 Moroccans live with visual impairments. Nadar is a real-time voice companion that sees, remembers, and navigates alongside its user in Moroccan Darija, French, or English. It recognizes faces and remembers names. It builds a relationship graph of people in the user's life. It reads menus, medicine labels, and street signs through OCR. It remembers facts across sessions through semantic memory that works across languages. It finds nearby pharmacies and tells you if they are open. It runs on anything with a microphone, a speaker, a camera, and an internet connection. It was built because the people who need it most have been waiting the longest.

// architecture

The system maintains a live bidirectional WebSocket to the Gemini Live API, streaming audio and video simultaneously. The user speaks, the model sees what the camera sees, and it responds in natural speech. Ten native tools give it capabilities beyond conversation: semantic memory storage and recall through vector embeddings, face registration and identification through AWS Rekognition, OCR through Google Cloud Vision, nearby place search through Google Places, reverse geocoding for location awareness, weather forecasting, and a secondary reasoning model for complex questions. Nine serverless endpoints on Vercel handle tool execution. Supabase PostgreSQL with pgvector stores memories as 3072-dimensional multilingual embeddings. Face metadata lives in Supabase rather than AWS because AWS cannot store Arabic names, a limitation that would have silently excluded the primary users this was built for.

// how it works

Darija is the first language. The system prompt is written in Moroccan Arabic. The persona is described in Darija. The safety instructions are in Darija. When the model speaks, it speaks as someone who belongs in the conversation. French and English are supported as equal alternatives. The multilingual embedding model handles all three languages natively, which means a user can store a memory in Darija and retrieve it weeks later by asking in French.

The system has a priority hierarchy that never changes: hazards first, always. Before describing a beautiful garden, it mentions the three steps ahead. Before reading a sign, it calls out the approaching car. This is embedded in the system prompt across all three languages as the foundational instruction. Everything else; scene description, text reading, contextual commentary; comes after safety. For someone who cannot see what is in front of them, this ordering is the reason the system can be trusted outdoors.

Face recognition builds a persistent social world. When the user introduces someone by name, the system captures a frame, indexes the face through AWS Rekognition, and stores the name and any notes in a personal database. The next time that person appears on camera, the system recognizes them and greets them naturally. Over time, it tracks how often people are seen, when they were last encountered, and what the user has said about them. Relationships between people can be stored too: Ahmed is Fatima's son, Layla is a colleague. When identifying someone, the system mentions the connections: who this person is, and how they relate to the people the user already knows.

The audio pipeline is engineered for a user who cannot glance at a screen to check if the system is working. Microphone capture runs through an AudioWorklet at 16kHz for low-latency processing without blocking the main thread. Playback at 24kHz uses pre-scheduled audio chunks with 200-millisecond lookahead to eliminate gaps between sentences. The user can interrupt at any time by simply speaking; the microphone stays open during playback, and a noise threshold filters out echoes. Two input modes are available: open mic for quiet environments, where the system responds after 1.5 seconds of silence, and push-to-talk for noisy environments, where the user holds anywhere on the screen to speak and releases to trigger a response.

Memory works through meaning. When the user says something worth remembering, the system stores it as a vector embedding, a mathematical representation of the meaning of the sentence. Retrieval works by similarity: asking about your mother's medication will surface the memory where you mentioned what she takes, even if you used different words or a different language. Each user's memories are isolated at the database level. There is no shared memory, no cross-user leakage, no way for one person's private information to appear in another person's session.

OCR turns the camera into a reading tool. Point it at a restaurant menu, a medicine label, a street sign, or a letter, and the system reads the text aloud. It handles Arabic, French, and English. For a traveler, this means pointing at a foreign menu and hearing what it says. For someone managing medication, it means confirming the right box. For someone receiving a government letter, it means not needing to ask a stranger to read it. The text is extracted through Google Cloud Vision, filtered for confidence, and truncated to a length that makes sense when spoken aloud rather than displayed on screen.

The system prompt instructs the model to be proactive. If it recognizes a face, it greets naturally. If it sees important text, it reads it. If the weather matters for what the user is doing, it mentions it. This is described in the Darija prompt as participation. The model walks with the user. This distinction shapes how the model behaves: it speaks like a companion offering useful observations.

Complex questions are handled by a secondary model. When the user asks something that requires deeper reasoning, such as explaining a concept, comparing options, or doing math, the system routes the question to a separate Gemini endpoint with a higher thinking budget. The response is constrained to two or three sentences because it will be heard aloud. Simple questions stay on the fast path. The user does not know this routing happens. They ask a question and get an answer at the appropriate depth.

// what we observed

The system is useful far beyond visual impairment. A sighted person traveling abroad can point it at a menu in an unfamiliar language and get an instant spoken translation. Someone managing an elderly relative's medication can use it to verify prescriptions. A student can point it at a textbook and have it read and explain a passage. The vision assistant framing was where the project started, but the combination of live camera, voice interaction, semantic memory, and multilingual OCR creates something closer to a general-purpose portable companion. The people who tested it found their own uses that we did not anticipate.

Building for Darija required rethinking assumptions that most frameworks take for granted. Darija is not standardized. It has no official orthography. It borrows from Arabic, French, and Spanish. Most NLP tools do not support it. Writing the system prompt in Darija changed how the model addressed the user, what cultural references it could make, and how natural the interaction felt. Users who tested both the English and Darija versions described the Darija version as feeling like a different product entirely.

The relationship graph changed how users interacted with the system over time. Early versions only identified faces by name. When we added the ability to store relationships between people, users began introducing the system to their social world more deliberately. They would register family members, explain connections, add notes about preferences and habits. The system became a repository of social context. Users started saying things like 'you know my mother' rather than 'identify this face,' which is a small linguistic shift that signals a large change in how they perceived the system.

The hardest engineering problem was audio. Making speech feel natural when it arrives as variable-length chunks over a WebSocket, on a mobile browser, with echo cancellation, noise suppression, and interrupt detection, took more iterations than any other part of the system. The gapless playback pipeline, the AudioWorklet microphone capture, the silence detection tuning, the push-to-talk fallback for noisy environments, all of it exists because a blind user cannot look at a loading spinner or a buffering indicator. If the audio stutters, the system feels broken. If there is a pause between sentences, the user wonders if it crashed. The bar for audio quality in a voice-first interface is absolute.

There are over 350,000 visually impaired people in Morocco alone, and millions more across North Africa and the Middle East who speak Darija or similar dialects with no assistive technology in their language. The cost of running the system is dominated by API calls during active use, and idle time costs nothing. A single session uses less than a dollar in API costs. The technology to build this has existed for less than two years. The need has existed for decades. That gap is the reason the project exists.