Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.

Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.

Understanding Multimedia Content with Prior Knowledge / Rigoni, Davide. - (2023 Sep 22).

Understanding Multimedia Content with Prior Knowledge

RIGONI, DAVIDE
2023

Abstract

Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.
Understanding Multimedia Content with Prior Knowledge
22-set-2023
Visual-textual grounding is a challenging task that involves associating language with visual objects or scenes, and it has become a popular research area due to its importance in various applications. Traditionally, visual-textual grounding has been solved by relying on information from images and textual phrases. However, incorporating additional prior knowledge, such as a graph, could potentially enhance the performance and accuracy of visual-textual grounding models. The graph is a discrete structure that can represent any kind of information that can be used to solve the grounding task. In this Ph.D. thesis, a formal probabilistic framework is proposed to consider all three modalities: image, text, and graph. The framework allows for the analysis of existing works and the development of a novel approach to visual-textual grounding based on an innovative factorization of probabilities. The adoption of the probabilistic approach is crucial for accounting for the inherent uncertainties in solving the task. In addition, this thesis presents two contributions to improve the traditional visual-textual grounding task. The first contribution regards a new loss function for training visual-textual grounding models in a supervised setting. Indeed, the models in the literature are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. The second contribution consists of a model tackling the weakly-supervised visual-textual grounding. The proposed model is based on the principle of first predicting a rough alignment among phrases and boxes, adopting a module that does not require training, and then refining those alignments using a learnable neural network. The model is trained to maximize the multimodal similarity between an image and a sentence describing that image while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected so as to help as much as possible during training. The object detector plays a fundamental role in solving the visual-textual grounding task. It should be able to identify many different objects and classify them correctly. Nevertheless, increasing the number of objects to be recognized usually leads to a more challenging classification problem. The importance of the correct classification of an object is even greater when considering the graph in the resolution of the visual-textual grounding task. In fact, the semantic information conveyed through the classes is crucial to identify the graph nodes that best characterize the objects depicted in the image. In literature, the most common approach is to use an object detector trained to detect 1600 different classes of objects. However, those classes are noisy and impair the performance of the object detector. To solve this problem, this document proposes also a new set of clean labels to use for training object detectors on the Visual Genome dataset. To conclude, this thesis introduces a new object detector that can be conditioned by nodes of the WordNet graph to search for objects in images. In particular, the conditioned object detector can be deployed to estimate a component of the probability distribution factorization designed thanks to the probabilistic framework. Overall, this Ph.D. thesis contributes to the study of visual-textual grounding and provides tools and insights that have the potential for developing advanced approaches and applications within this domain.
Understanding Multimedia Content with Prior Knowledge / Rigoni, Davide. - (2023 Sep 22).
File in questo prodotto:
File Dimensione Formato  
PhD_Dissertation_final.pdf

accesso aperto

Descrizione: Tesi di dottorato
Tipologia: Tesi di dottorato
Dimensione 33.07 MB
Formato Adobe PDF
33.07 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3498348
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact