This technology describes a mechanism for generating an AI model that leverages Visual Grounding technology to extract object category, position, and attribute information from images. This information is then converted into natural language instructions to plan and control a robot's manipulation trajectory.
Existing robot control methods required operators to manually input object coordinates and task details. This resulted in limitations such as the need for fixed object positions and low operational efficiency when generating commands for multiple objects.
This technology proposes a method for generating a training dataset and subsequently training an AI model. This is achieved using a first framework (GVCCI) which comprises: a visual feature extraction module that recognizes objects and extracts features from images; a module that generates context-appropriate natural language instructions; a model that infers targets and positions via a visual grounding model; and a manipulation module that plans the trajectory of a robot arm.
This technology was developed with support from the Institute of Information & Communications Technology Planning & Evaluation (IITP) through a self-directed AI research project focused on solving novel problems.
PCT Application WO2025178174