Tanmay Khandelwal

Deep Learning for End-of-Video Frame Segmentation

Advisor: Prof. Yann LeCun

In this project, we developed a method for performing semantic segmentation on predicted video frames. We trained a model using a U-Net architecture combined with a custom Intersection over Union (IoU)-based loss function to generate masks for unlabeled video frames, thereby significantly expanding our training dataset. We then trained a SimVP video frame prediction model on both the ground-truth and generated masks to predict the segmentation mask of the 22nd frame in a video sequence, given the first 11 frames. After fine-tuning with IoU, our model achieved an IoU of 0.455 on the validation dataset.

Visual Question Answering (VQA) System

CODE PDF

Advisor: Prof. Rob Fergus

In this project, we developed a Visual Question Answering (VQA) system for our computer vision course (CSCI-GA.2271-001). The goal was to deliver accurate answers by processing both image and text inputs using Convolutional Neural Networks (CNNs) for image feature extraction and Long Short-Term Memory (LSTM) networks for question analysis. We evaluated the system using the VQA-RAD and VQA-v2 datasets. Initially, a CNN-LSTM architecture was implemented, which was later enhanced with a stacked attention network (SAN) to improve accuracy. This enhancement led to significant accuracy improvements, achieving 67.82% on VQA-RAD and 54.82% on VQA-v2. The results demonstrate the system's proficiency in handling a variety of visual questions, highlighting its potential for practical applications in image-based question-answering systems.

DCASE 2023 Task 4A: Sound Event Detection

PDF

Advisor: Dr. Rohan Kumar Das

This project details the systems developed by Fortemedia Singapore (FMSG) for DCASE 2023 Task 4A, focusing on sound event detection with weak labels and synthetic soundscapes. Our approach integrates features from Bidirectional Encoder representation from Audio Transformers (BEATs) and frequency dynamic (FDY)-convolutional recurrent neural network (CRNN) into a single-stage setup. We enhance our system through three main strategies: curating an external dataset from AudioSet by mapping AudioSet categories to target sound events, using multiple aggregation methods to leverage various strengths, and employing the asymmetric focal loss (AFL) function to adjust training weights based on model difficulty. Additionally, we use data augmentation to prevent overfitting, adaptive post-processing methods, and an ensemble of multiple subsystems to improve generalization. Our method achieves top PSDS1 and PSDS2 scores of 0.557 and 0.854 on the development set, and the highest PSDS1 and PSDS2 scores of 0.607 and 0.875 on the public evaluation set.

Search Tool

CODE

Advisor: Keerthi Ram

Developed an NLP-based search tool for the Brain Architecture Portal, utilizing spaCy, Word2Vec, and cosine similarity to streamline the summarization of neuroscience articles, saving over 30 hours of manual review time monthly. The tool enhances text comprehension and evidence detection by fine-tuning BERT for neuroscience-specific language, achieving a 73% Normalized Term Contribution (NTC). This project significantly improves the efficiency and accuracy of retrieving relevant research, providing an intuitive interface that supports complex queries and scalable processing of extensive datasets, thereby facilitating more focused and effective analysis for researchers.

Projects

Some projects I have worked on...