Learning from Matured Dumb Teacher for Fine Generalization | Arxiv
The flexibility of decision boundaries in neural networks that are unguided by training data is a well-known problem typically resolved with generalization methods. A surprising result from recent knowledge distillation (KD) literature is that random, untrained, and equally structured teacher networks can also vastly improve generalization performance. It raises the possibility of existence of undiscovered assumptions useful for generalization on an uncertain region. In this paper, we shed light on the assumptions by analyzing decision boundaries and confidence distributions of both simple and KD-based generalization methods. Assuming that a decision boundary exists to represent the most general tendency of distinction on an input sample space (i.e., the simplest hypothesis), we show the various limitations of methods when using the hypothesis. To resolve these limitations, we propose matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information. In practical experiments on feed-forward and convolution neural networks for image classification tasks on MNIST, CIFAR-10, and CIFAR-100 datasets, the proposed method shows stable improvement to the best test performance in the grid search of hyperparameters. The analysis and results imply that the proposed method can provide finer generalization than existing methods.
- Idea: Using the simplest hypothese to generalize the model only with given training data.
- Dataset: MNIST, CIFAR-10, CIFAR-100
- Method: matured Dumb Teacher based Knowledge Distillation (mDT-KD)
- Keywords: Generalization, Self-knowledge distillation, Neural networks, Occam’s Razor, Confidence distribution, Decision boundary
What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection | Arxiv
Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., Between which objects are there an interaction? When do relations occur and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD) - segment-based and window-based. We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness. 1) TSPN tells what to look - it sparsifies relation search space by scoring relationness (i.e., confidence score for the existence of a relation between pair of objects) of object pair. 2) TSPN tells when to look - it leverages the full video context to simultaneously predict the temporal span and categories of the entire relations. TSPN demonstrates its effectiveness by achieving new state-of-the-art by a significant margin on two VidVRD benchmarks (ImageNet-VidVDR and VidOR) while also showing lower time complexity than existing methods - in particular, twice as efficient as a popular segment-based approach.
- Question: How can we extract long-term relation better from a video?
- Idea: Directly propose a temporal span over object trajectories.
- Dataset: ImageNet-VidVRD, VidOR
- Method: Temporal Span Proposal Network (TSPN)
- Keywords: Video Visual Relation Detection (VidVRD), Spatio-temporal Video Understanding, Temporal Relation Localization
Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions | Arxiv
In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies 1) Ambiguity - even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry - despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts - leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).
- Question: Which of the issues that reflect the nature of the data itself has not been addressed in depth in previous studies?
- Idea: The characteristics shared by target issues are solved simultaneously using a bottom-up approach.
- Dataset: Visual Genome
- Method: Local-to-Global Interaction Network (LOGIN)
- Keywords: Scene Graph Generation (SGG), Scene Understanding, Relationship Detection, Bidirectional Relationship Classification
Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs | Electronics
A common approach to jointly learn multiple tasks with a shared structure is to optimize the model with a combined landscape of multiple sub-costs. However, gradients derived from each sub-cost often conflicts in cost plateaus, resulting in a subpar optimum. In this work, we shed light on such gradient conflict challenges and suggest a solution named Cost-Out, which randomly drops the sub-costs for each iteration. We provide the theoretical and empirical evidence of the existence of escaping pressure induced by the Cost-Out mechanism. While simple, the empirical results indicate that the proposed method can enhance the performance of multi-task learning problems, including two-digit image classification sampled from MNIST dataset and machine translation tasks for English from and to French, Spanish, and German WMT14 datasets.
- Question: What leads to sub-par optimum in the multi-task learning environment?
- Idea: Resolve gradient conflicts among multiple tasks via drop-out-like mechanism.
- Dataset: MNIST, WMT14
- Method: Cost-Out
- Keywords: Multitask Learning, Gradient Conflict, Escaping Pressure, Dropout