Understanding Multimedia Content with Prior Knowledge

Rigoni, Davide

Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.

Understanding Multimedia Content with Prior Knowledge / Rigoni, Davide. - (2023 Sep 22).

Understanding Multimedia Content with Prior Knowledge

RIGONI, DAVIDE

2023

Abstract

Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Understanding Multimedia Content with Prior Knowledge
			
	Anno di discussione
	
				22-set-2023
			
	Abstract
	
				Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task.
In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task.
In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue.
The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training.
The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset.
To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework.
Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.
			
	Citazione
	
				Understanding Multimedia Content with Prior Knowledge / Rigoni, Davide. - (2023 Sep 22).
			
	Appare nelle tipologie:
	
				08.01 - Tesi di Dottorato UNIPD (Deposito Legale)

File in questo prodotto:

File	Dimensione	Formato
PhD_Dissertation_final.pdf accesso aperto Descrizione: Tesi di dottorato Tipologia: Tesi di dottorato Licenza: Altro Dimensione 33.07 MB Formato Adobe PDF Visualizza/Apri	33.07 MB	Adobe PDF	Visualizza/Apri