Understanding Visual Question Answering – VQA

17 Min Read

With the development of Deep Studying (DL), the invention of Visible Query Answering (VQA) has develop into attainable. VQA has not too long ago develop into common among the many laptop imaginative and prescient analysis neighborhood as researchers are heading in the direction of multi-modal issues. VQA is a difficult but promising multidisciplinary Synthetic Intelligence (AI) job that permits a number of purposes.

On this weblog we’ll cowl:

  • Overview of Visible Query Answering
  • The basic ideas of VQA
  • Engaged on a VQA system
  • VQA datasets
  • Functions of VQA throughout numerous industries
  • Latest developments and future challenges

What’s Visible Query Answering (VQA)?

The best means of defining a VQA system is a system able to answering questions associated to a picture. It takes a picture and a text-based query as inputs and generates the reply as output. The character of the issue defines the character of the enter and output of a VQA mannequin.

Inputs could embody static photographs, movies with audio, and even infographics. Questions might be introduced inside the visible or requested individually concerning the visible enter. It could actually reply multiple-choice questions, YES/NO (binary questions), or any open-ended questions in regards to the supplied enter picture. It permits a pc program to grasp and reply to visible and textual enter in a human-like method.


Visual Question Answering example
Enter: What is going on within the picture? Output: Individuals consuming a meal at a restaurant 


  • Are there any telephones close to the desk?
  • Guess the variety of burgers on the desk.
  • Guess the colour of the desk?
  • Learn the textual content within the picture if any.

A visible query answering mannequin would be capable of reply the above questions in regards to the picture.

Resulting from its complicated nature and being a multimodal job (programs that may interpret and comprehend knowledge from numerous modalities, together with textual content, footage, and generally audio), VQA is taken into account AI-complete or AI-hard (essentially the most tough downside within the AI subject) as it’s equal to creating computer systems as clever as people.

Ideas Behind VQA

Visible query answering naturally works with picture and textual content modalities.


flow chart of a vqa model
Flowchart of a visible query answering mannequin – Source


A VQA mannequin has the next components:

  1. Pc Imaginative and prescient (CV)
    CV is used for picture processing and extraction of the related options. For picture classification and object recognition in a picture, CNN (Convolution Neural Networks) are utilized. OpenCV and Viso Suite are appropriate platforms for this strategy. Such strategies function by capturing the native and world visible options from a picture.
  2. Pure Language Processing (NLP)
    NLP works parallel with CV in any VQA mannequin. NLP processes the info with pure language textual content or voice. Long Short-Term Memory (LSTM) networks or Bag-Of-Words (BOW) are largely used to extract query options. These strategies perceive the sequential nature of the query’s language and convert it to numerical knowledge numerical knowledge for NLP.
  3. Combining CV And NLP
    That is the conjugation half in a VQA mannequin. The character of the ultimate reply is derived from this integration of visible and textual options. Totally different architectures, equivalent to CNNs and Recurrent Neural Networks (RNNs) mixed, Consideration Mechanisms, and even Multilayer Perceptrons (MLPs) are used on this strategy.
See also  How AI Is Helping Small Businesses In This Digital Landscape?

How Does a VQA System Work?

A Visible Query Answering mannequin can deal with a number of picture inputs. It could actually take visible enter as photographs, movies, GIFs, units of photographs, diagrams, slides, and 360◦ photographs. From a broader perspective, a visible query reply system undergoes the next phases:

  • Picture Characteristic Extraction: Transformation of photographs into readable function illustration to course of additional.
  • Query Characteristic Extraction: Encoding of the pure language inquiries to extract related entities and ideas.
  • Characteristic Conjugation: Strategies of mixing encoded picture and query options.
  • Reply Technology: Understanding the built-in options to generate the ultimate reply.



The steps of a common VQA approach
Steps for a typical VQA mannequin – Source
Picture Characteristic Extraction

Nearly all of VQA fashions use CNN to course of visible imagery. Deep convolutional neural networks obtain photographs as enter and use them to coach a classifier. CNN’s primary objective for VQA is picture featurization. It makes use of a linear mathematical operation of “convolution” and never easy matrix multiplication.

Relying on the complexity of the enter visible, the variety of layers could vary from lots of to 1000’s. Every layer builds on the outputs of those earlier than it to determine complicated patterns.

A number of Visible Query Answering papers revealed that many of the fashions used VGGet earlier than ResNets (8x deeper than VGG nets) got here in 2017 for picture function extraction.

Query Characteristic Extraction

The literature on VQA means that Lengthy Quick-Time period Reminiscence (LSTMs) are generally used for query featurization, a sort of Recurrent Neural Community (RNN). Because the title depicts, RNNs have a looping or recurrent workflow; they work by passing sequential knowledge that they obtain to the hidden layers one step at a time.

The short-term reminiscence part on this neural community makes use of a hidden layer to recollect and use previous inputs for future predictions. The subsequent sequence is then predicted based mostly on the present enter and saved reminiscence.

RNNs have issues with exploding and vanishing gradients whereas coaching a deep neural community. LSTMs overcome this. A number of different strategies equivalent to count-based and frequency-based strategies like count vectorization and TF-IDF (Time period Frequency-Inverse Doc Frequency) are additionally out there.

For pure language processing, prediction-based strategies equivalent to a steady bag of phrases and skip grams are used as properly. Word2Vec pre-trained algorithms are additionally relevant.

A skip-gram mannequin predicts the phrases round a given phrase by maximizing the probability of accurately guessing context phrases based mostly on a goal phrase. So, for a sequence of phrases w1, w2, … wT, the target of the mannequin is to precisely predict close by phrases.


average log probability of a skip gram model


It achieves this by calculating the likelihood of every phrase being the context, with a given goal phrase. Utilizing the softmax function, the next calculation compares vector representations of phrases.


Softmax function in skip gram model


Characteristic Conjugation

The first distinction between numerous methodologies for VQA lies in combining the picture and textual content options. Some approaches embody easy concatenation and linear classification. A Bayesian strategy based mostly on probabilistic modeling is preferable for dealing with completely different function vectors.

If the vectors coming from the picture and textual content are of the identical size, element-wise multiplication can be relevant to hitch the options. It’s also possible to strive the Consideration-based strategy to information the algorithm’s focus in the direction of an important particulars within the enter. The DualNet VQA mannequin makes use of a hybrid strategy that concatenates element-wise addition and multiplication outcomes to attain better accuracy.

See also  Understanding StyleGAN2


Element-wise multiplication and addition VQA model
Concatenation of element-wise multiplication and element-wise summation – Source
Reply Technology

This part in a VQA mannequin entails taking the encoded picture and query options as inputs and producing the ultimate reply. A solution could possibly be in binary type, counting numbers, checking the proper reply, pure language solutions, or open-ended solutions in phrases, phrases, or sentences.

The multiple-choice and binary solutions use a classification layer to transform the mannequin’s output right into a likelihood rating. LSTMs are acceptable to make use of when coping with open-ended questions.

VQA Datasets

A number of datasets are current for VQA analysis. Visual Genome is at present the biggest out there dataset for visible query answering fashions.


Timelime of popular visual question answering datasets
Timeline of common VQA datasets – Source


Relying on the query reply pairs, listed below are a few of the widespread datasets for VQA.

  • COCO-QA Dataset: Extension of COCO (Widespread Objects in Context). Questions of 4 varieties: quantity, shade, object, and placement. Right solutions are all given in a single phrase.
  • CLEVR: Comprises a coaching set of 70,000 photographs and 699,989 questions. A validation set of 15,000 photographs and 149,991 questions. A check set of 15,000 photographs and 14,988 questions. Solutions for all coaching and VAL questions.
  • DAQUAR: Include real-world photographs. People query reply pairs about photographs.
  • Visual7W: A big-scale visible query answering dataset with object-level floor reality and multimodal solutions. Every query begins with one of many seven Ws.


COCO dataset
Samples of annotated photographs within the MS COCO dataset – Source

Functions of Visible Query Answering System

Individually, CV and NLP have separate units of varied purposes. Implementation of each in the identical system can additional improve the appliance area for Visible Query Answering.

Actual-world purposes of VQA are:

Medical – VQA

This subdomain focuses on the questions and solutions associated to the medical subject. VQA fashions could act as pathologists, radiologists, or correct medical assistants. VQA within the medical sector can enormously scale back the workload of employees by automating a number of duties. For instance, it might lower the probabilities of disease misdiagnosis.


Working of a medical vqa
Widespread structure of a proposed medical VQA mannequin – Source


VQA might be carried out as a medical advisor based mostly on photographs supplied by the sufferers.  It may be used to examine medical data and knowledge accuracy from the database.


The applying of VQA within the schooling sector can assist visible studying to an amazing extent. Think about having a studying assistant who can information and consider you with realized ideas. A few of the proposed use circumstances are Automatic Robot System for Pre-scholars, Visual Chatbots for Education, Gamification of VQA Systems, and Automated Museum Guides. VQA in schooling has the potential to make studying kinds extra interactive and artistic.


Education robot working
A diagram of instructional robotic working for preschool studying – Source
Assistive Expertise

The prime motive behind VQA is to assist visually impaired people. Initiatives just like the VizWiz cellular app and Be My Eyes make the most of VQA programs to supply automated help to visually impaired people by answering questions on real-world photographs. Assistive VQA fashions can see the environment and assist folks perceive what’s taking place round them.

Visually impaired folks can have interaction extra meaningfully with their setting with the assistance of such VQA programs. Envision Glasses is an instance of such a mannequin.

See also  The Robot Restorers: Can AI Truly Conserve Cultural History


AI-powered Envision glasses to aid visually impaired individuals
Envision Glasses for visually impaired people – Source

VQA is able to enhancing the web procuring person expertise. Shops and platforms for on-line procuring can combine VQA to create a streamlined e-commerce setting.  For instance, you may ask questions about products (Product Query Answering) and even add photographs, and it’ll offer you all the mandatory data like product particulars, availability, and even suggestions based mostly on what it sees within the photographs.

On-line procuring shops and web sites can implement VQA as a substitute of guide customer support to additional enhance the person expertise on their platforms. It could actually assist prospects with:

  • Product suggestions
  • Troubleshooting for customers
  • Web site and procuring tutorials
  • VQA system may also act as a Chatbot that may converse visible dialogues
Content material Filtering

Some of the appropriate purposes of VQA is content material moderation. Based mostly on its basic function, it might detect dangerous or inappropriate content material and filter it out to maintain a protected on-line setting. Any offensive or inappropriate content material on social media platforms might be detected utilizing VQA.

Latest Growth & Challenges In Enhancing VQA

With the fixed development of CV and DL, VQA fashions are making big progress. The variety of annotated datasets is quickly growing due to crowd-sourcing, and the fashions have gotten clever sufficient to supply an correct reply utilizing pure language. Up to now few years, many VQA algorithms have been proposed. Nearly each methodology entails:

  1. Picture featurization
  2. Query featurization
  3. An acceptable algorithm that mixes these options to generate the reply

Nonetheless, a major hole exists between correct VQA programs and human intelligence. Presently, it’s onerous to develop any adaptable mannequin because of the range of datasets. It’s tough to find out which methodology is superior as of but.

Sadly, as a result of most giant datasets don’t supply particular details about the kinds of questions requested, it’s onerous to measure how properly programs deal with sure kinds of questions.

The current fashions can’t enhance total efficiency scores when dealing with distinctive questions. This makes it onerous for the evaluation of strategies used for VQA. Presently, a number of alternative questions are used to judge VQA algorithms as a result of evaluation of open-ended multi-word questions is difficult. Furthermore, VQA regarding videos nonetheless has an extended method to go.


AVQA is an audio-visual question answering model
Mechanism for visible frames and audio waveforms of VQA mannequin for movies – Source


Current algorithms are usually not ample to mark VQA as a solved downside. With out bigger datasets and extra sensible work, it’s onerous to make better-performing VQA fashions.

What’s Subsequent for Visible Query Answering?

VQA is a state-of-the-art AI mannequin that’s far more than task-specific algorithms. Being an image-understanding mannequin, VQA goes to be a serious growth in AI. It is bridging the hole between visible content material and pure language.

Textual content-based queries are widespread, however think about interacting with the pc and asking questions on photographs or scenes. We’re going to see extra intuitive and pure interactions with computer systems.

Some future suggestions to enhance VQA are:

  • Datasets must be bigger
  • Datasets must be much less biased
  • Future datasets want extra nuanced evaluation for benchmarking

Extra effort is required to create VQA algorithms that may assume deeply about what’s within the photographs.

Associated subjects and weblog articles about laptop imaginative and prescient and NLP:

Source link

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Please enter CoinGecko Free Api Key to get this plugin works.