Camera Aware Chatbot
An AI proof-of-concept that adds real-time camera vision context into a local LLM, interfaced with a Unity expressive avatar.
Camera Aware Chatbot was built as a proof-of-concept for AffectiLink in April 2025, showcasing how live camera data can enrich on-device LLM conversations. The system constantly feeds what it “sees” through your webcam into a vision model, then injects that descriptive context into every chat response, giving your local AI genuine visual awareness.
Under the hood, two Flask microservices power the pipeline. The first server reads your camera stream and runs it through the Florence 2 visual model, outputting detailed text descriptions of the scene at regular intervals. The second server acts as an orchestrator: whenever the Unity client sends a message, it pulls in the latest camera description, wraps it into the prompt for a local LLM (Gemma 3), and returns the combined reply, essentially performing a real-time RAG with visual context.
On the front end, a Unity application provides a simple text input and an expressive VRM avatar that reacts to the conversation with multi-step dialogue and emotion cues. Built with C# and UniVRM, the avatar can talk, smile, and express emotions as it answers your questions, all while running entirely offline. This POC laid the groundwork for Mira Desktop AI, proving that a truly private, camera-aware assistant could live on your machine without any cloud dependencies.
Keeping response times "low" while having real time camera processing and LLM inference
Working with multiple large models locally with limited resources
Testing and choosing multiple VLMs and LLMs for the best results
Ensuring the output from the LLM compatible with the dialogue system
Front-end
Back-end
AI
Tools
Date
2025-04
Status