Camera Aware Chatbot

An AI proof-of-concept that adds real-time camera vision context into a local LLM, interfaced with a Unity expressive avatar.

Computer Vision
AI
Chatbot
Unity
Avatar
Video
1 / 3
Project Overview

Camera Aware Chatbot was built as a proof-of-concept for AffectiLink in April 2025, showcasing how live camera data can enrich on-device LLM conversations. The system constantly feeds what it “sees” through your webcam into a vision model, then injects that descriptive context into every chat response, giving your local AI genuine visual awareness.

Under the hood, two Flask microservices power the pipeline. The first server reads your camera stream and runs it through the Florence 2 visual model, outputting detailed text descriptions of the scene at regular intervals. The second server acts as an orchestrator: whenever the Unity client sends a message, it pulls in the latest camera description, wraps it into the prompt for a local LLM (Gemma 3), and returns the combined reply, essentially performing a real-time RAG with visual context.

On the front end, a Unity application provides a simple text input and an expressive VRM avatar that reacts to the conversation with multi-step dialogue and emotion cues. Built with C# and UniVRM, the avatar can talk, smile, and express emotions as it answers your questions, all while running entirely offline. This POC laid the groundwork for Mira Desktop AI, proving that a truly private, camera-aware assistant could live on your machine without any cloud dependencies.

Key Features
Camera-aware chatbot that can interact with the user based on their camera feed
100% local Computer Vision and AI processing
Multi step dialogue system with an avatar that can express emotions
Florence 2 model for advanced visual understanding
Gemma 3 model for natural language processing
Flask server acting as an orchestrator ensuring output consistency from the LLM and VLM
Technical Challenges

Keeping response times "low" while having real time camera processing and LLM inference

Working with multiple large models locally with limited resources

Testing and choosing multiple VLMs and LLMs for the best results

Ensuring the output from the LLM compatible with the dialogue system

Technologies Used

Front-end

Unity
VRM
C#

Back-end

Flask
Python

AI

Ollama
PyTorch
Transformers
LLM - Gemma 3
VLM - Florence 2

Tools

Insomnia
Adobe Mixamo
Hugging Face
Project Details

Date

2025-04

Status

completed