Collegamento di menzioni visive e testuali di entità con conoscenze di base.

Dost, Shahi

“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.

Collegamento di menzioni visive e testuali di entità con conoscenze di base / Dost, Shahi. - (2021 May 26).

Collegamento di menzioni visive e testuali di entità con conoscenze di base.

DOST, SHAHI

2021

Abstract

“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Titolo in inglese
	
				Linking Visual and Textual Entity mentions with Background Knowledge.
			
	Anno di discussione
	
				26-mag-2021
			
	Abstract
	
				“A picture is worth a thousand words”, the adage reads. However, pictures cannot replace words in terms of their ability to efficiently convey clear (mostly) unambiguous and concise knowledge. Images and text, indeed reveal different and complementary information that, if combined will result in more information than the sum of that contained in single media. The combination of visual and textual information can be obtained by linking the entities mentioned in the text with those shown in the pictures. To further integrate this with the agent background knowledge, an additional step is necessary. That is, either finding the entities in the agent knowledge base that correspond to those mentioned in the text or shown in the picture or, extending the knowledge base with the newly discovered entities. We call this complex task Visual-Textual-Knowledge Entity Linking (VTKEL). In this thesis, after providing a precise definition of the VTKEL task, we present two datasets called VTKEL1k* and VTKEL30k. These datasets consisting of images and corresponding captions, in which the image and textual mentions are both annotated with the corresponding entities typed according to the YAGO ontology. The datasets can be used for training and evaluating algorithms of the VTKEL task. Successively, we developed an unsupervised baseline algorithm called VT-LinKEr (Visual-Textual-Knowledge-Entity Linker) for the solution of the VTKEL task. We evaluated the performances of VT-LinKEr on both datasets. We also developed a supervised algorithm called ViTKan (Visual-Textual-Knowledge-Alignment Network). During training, the ViTKan takes in the input (1) an image and applying an object detector to predict visual-objects & their typing, (ii) takes text (captions) and applying a knowledge graph extracting tool PIKES to recognized textual entity mentions and linked these entities to the knowledgebase YAGO for background knowledge extraction. We trained the ViTKan model by using the visual, textual, and ontological features data of the VTKEL1k* dataset. During prediction, the ViTKan solves the problem of alignment (mapping) between visual entities in the image with textual entities in the captions with a great accuracy. The evaluation results of ViTKan on VTKEL1k* and VTKEL30k datasets show improved results with respect to the state-of-the-art methods on grounding (localization) of textual entities on images task.
			
	Citazione
	
				Collegamento di menzioni visive e testuali di entità con conoscenze di base / Dost, Shahi. - (2021 May 26).
			
	Appare nelle tipologie:
	
				08.01 - Tesi di Dottorato UNIPD (Deposito Legale)

File in questo prodotto:

File	Dimensione	Formato
tesi_definitiva_Shahi_Dost.pdf accesso aperto Descrizione: tesi_definitiva_Shahi_Dost Tipologia: Tesi di dottorato Licenza: Altro Dimensione 4.73 MB Formato Adobe PDF Visualizza/Apri	4.73 MB	Adobe PDF	Visualizza/Apri