TRAJEKT

a visual engine that draws the motion your eyes skip over

Video contains more information than what appears on screen. Between every two frames there is motion, and that motion has direction, speed, rhythm, and shape. TRAJEKT is a real-time engine that captures pixel-level motion from any video source, stores it as spatial memory, and synthesizes geometry directly from that data. The system divides each frame into a sampling grid, tracks brightness changes across frames, estimates displacement vectors, and feeds that trajectory data into any of its synthesis modes that draw curves, shatter fragments, advect particles, deform meshes, or decompose frequencies. The same input with the same parameters always produces identical output. The aesthetic comes from math applied to real spatial data.

// architecture

The pipeline runs in five sequential stages at 60 frames per second. A frame source interface abstracts over webcam, video file, and screen capture inputs. Spatial memory divides each frame into an NxN grid, defaulting to 576 sample points, and runs a block-matching motion estimator at each cell centroid. Detected motion becomes a trajectory point carrying position, velocity, source color, magnitude, angle, and timestamp. Each grid cell stores its trajectory in a pre-allocated ring buffer, a circular array that ages out old points without triggering garbage collection, which is critical for real-time rendering stability. The synthesis engine takes active trajectories and generates render commands: spline paths, line segments, or particle positions. The color engine maps trajectory data to RGB through HSL space using one of five modes. A renderer draws the final composite on either a Canvas 2D or WebGL2 backend. The Rust layer, via Tauri, handles MIDI device enumeration and CC event capture through midir, and OSC message parsing over UDP through rosc. Audio input runs through the Web Audio API with an AnalyserNode providing real-time FFT data, beat detection, and BPM estimation that modulate synthesis parameters per frame.

// how it works

Motion detection works by sampling brightness at each grid cell centroid in both the current and previous frame. If the delta exceeds a sensitivity threshold, the estimator searches a neighborhood around the centroid for the minimum brightness difference, accumulates a weighted displacement vector, and normalizes it into a velocity. This is a simplified block-matching optical flow, chosen over dense methods like Lucas-Kanade because it runs at frame rate on a CPU without GPU acceleration. The tradeoff is spatial resolution: a 24x24 grid samples 576 points instead of every pixel. In practice, this is more than enough to capture the structure of motion in a scene, and the grid density is tunable up to 64x64 for finer detail at the cost of computation.

Curve synthesis uses Catmull-Rom spline interpolation through trajectory point sequences. Unlike Bezier curves, Catmull-Rom splines pass through every control point, which means the rendered curve follows the actual measured motion path rather than approximating it. Each segment between two points is subdivided into configurable steps, defaulting to eight, producing smooth arcs from discrete samples. Line width is data-driven: it scales with the magnitude of the newest trajectory point in that cell, so fast motion draws thick and slow motion draws thin. This is a direct encoding of the data into geometry.

Five synthesis modes reveal different aspects of the same underlying motion data. Curves draw smooth trails that show where motion has been. Fragments shatter trajectory segments into crystalline geometry with deterministic pseudo-random offsets and velocity-driven spikes, producing a cracked aesthetic. Flow field spawns a particle system that advects through a bilinearly-interpolated vector field built from the grid velocities, with particles leaving smooth trails and stall-detecting after fifteen inactive frames. Mesh deformation overlays a regular grid whose vertices warp according to local motion vectors. Frequency decomposition applies FFT to trajectory time-series and renders oscillating wave overlays that can sync to audio BPM. All five modes read from the same spatial memory. Switching modes does not restart anything; it just changes how the data is interpreted.

Color mapping operates in HSL space with five modes that each tell a different story about the motion. Velocity mode maps magnitude to hue on a blue-to-red gradient, so you see speed. Source mode samples the actual pixel color from the video at each trajectory origin, so the curves inherit the scene palette. Temporal mode maps the age of each point to a violet-to-amber gradient, revealing how motion flows through time along a single curve. Direction mode maps the angle of motion to hue, making it immediately visible which parts of the scene move in the same direction. A custom mode accepts user-defined expressions with access to magnitude, angle, velocity, color, age, beat, and energy variables.

The ring buffer is a small architectural decision that shaped the entire system. Each grid cell stores its trajectory as a fixed-size circular array. When a new point arrives and the buffer is full, the oldest point is silently overwritten. This means the system never allocates memory during the render loop, which eliminates garbage collection pauses that would cause frame drops. But the deeper consequence is that trajectory history has a natural lifespan governed by a single parameter: history depth. A depth of 30 means each cell remembers 30 frames of motion, roughly half a second at 60fps. Increase it and curves become longer, slower to fade, more ghostly. Decrease it and the visualization becomes twitchy, immediate, nervous. The entire visual character of the output shifts with one number.

Audio reactivity connects the Web Audio API to synthesis parameters through a modulation layer. A microphone input feeds an AnalyserNode that provides real-time FFT data across frequency bands. Beat detection uses an energy threshold with debounce, and BPM estimation averages intervals between detected beats. These audio signals modulate glow intensity, line width, particle speed, and wave amplitude per frame. The frequency synthesis mode can lock its oscillation to the detected BPM, producing visuals that pulse with the music. The system does not analyze the audio in any deep sense. It reacts to energy and rhythm, which turns out to be enough to make the output feel alive.

MIDI and OSC support turn the engine into a live performance instrument. MIDI CC messages from any connected controller map to engine parameters, so a physical knob can control grid density, sensitivity, glow, or synthesis mode in real time. OSC messages arrive over UDP, parsed by the Rust backend, enabling integration with TouchDesigner, Ableton, or any creative tool that speaks OSC. The parameter space is large enough that two people working with the same video source and different controller mappings will produce completely different outputs. The engine becomes an instrument whose timbre is motion.

Export produces compositable output. Beyond standard WebM and MP4 recording, the system can render curves on a transparent background as a PNG sequence. This means the visualization layer can be imported into Premiere, DaVinci Resolve, or After Effects as an overlay track, separated from the source video entirely. The visual output lives independently of the tool that created it.

// what we observed

The most significant observation was how much visual information the grid resolution parameter encodes. A 16x16 grid produces broad, sweeping curves that capture the general flow of a scene. A 48x48 grid produces dense, fibrous geometry that tracks individual edges and textures. The same video, the same synthesis mode, the same color mapping, but a completely different visual character from one integer. We expected grid resolution to affect detail. We did not expect it to change the emotional register of the output. Low grids feel cinematic. High grids feel biological.

Deterministic pseudo-randomness in the fragment synthesizer turned out to be more important than we anticipated. The fragment mode uses a seeded random function based on each point's timestamp and index, so the same input always shatters into the same geometry. This means you can scrub back through a video file and the fragments will reconstruct identically. When we tested with non-deterministic randomness, scrubbing produced different geometry each time, and the output felt arbitrary rather than authored. Determinism made the fragments feel like they belonged to the video.

The ring buffer's fixed depth created an unintended visual phenomenon we started calling the ghost boundary. When a moving object stops, its trajectory points age out from the oldest end of the buffer while the newest points hold still. The curve appears to retract from the point where motion began, pulling back toward where the object stopped, like a contrail dissolving. We did not animate this. It is a direct consequence of FIFO eviction in the ring buffer. The visual system developed its own sense of impermanence from a data structure choice.

Flow field particles revealed large-scale motion structure that no other mode made visible. In a video of a crowded street, the curve and fragment modes showed individual trajectories: this person walked left, that car turned right. The flow field mode, by advecting thousands of particles through an interpolated vector field, showed the aggregate: a river of motion flowing down the sidewalk, an eddy forming around a street vendor, a laminar stream of traffic separating into turbulent merges at an intersection. The same data, read at a different scale, told a completely different story.

Building a system with no machine learning in a lab that primarily works with language models clarified something about when AI is and is not the right tool. TRAJEKT does not need a model because the problem is not ambiguous. Motion estimation is physics. Spline interpolation is geometry. Color mapping is signal processing. Adding a neural network would not improve accuracy; it would add latency, nondeterminism, and a dependency on training data that does not exist for this domain. The constraint forced every visual decision to be traceable to a measurement, which made the output fully debuggable and the creative process more transparent.

github →