This project aims to develop a cost-effective, modular system that processes video streams from a camera and translates them into audio output in real time. Designed to func- tion both as a handheld or wearable device and as a component mounted on a robotic guide dog, the system offers flexibility for different use cases. It consists of a portable processor with a GPU (Jetson Nano) and an Al-enabled camera (OAK-D Lite). Lightweight models such as YOLOvV8n are used to perform visual tasks including object detection and depth estimation, with distinct indoor and outdoor modes tailored to recognize context-specific objects rele- vant to visually impaired users. The system leverages internet connectivity when available to access online resources, reduce local computation load, and utilize state-of-the-art models. It also fully supports offline operation, relying on onboard Al-vision models and text-to-speech conversion to ensure robust performance even without an internet connection. The results are delivered through spoken descriptions, providing timely and meaningful feedback while maintaining a balance between affordability, portability, reliability, and power efficiency.

