Research

Research Publication Summary

Label-Focused Inductive Bias over Latent Object Features in Visual Classification

The Twelfth International Conference on Learning Representations (ICLR 2024)

  • Abstract

    Most neural networks for classification primarily learn features differentiated by input-domain related information such as visual similarity of objects in an im- age. This input-domain focused inductive bias, while natural, can unintentionally conflict with unexpressed yet implicitly utilized relations over latent objects in human labeling, referred to Undescribed world knowledge (UWK). Such conflicts can limit generalization of models by potential dominance of the input-domain focused bias in inference. To overcome this limitation without external resources, we introduce Label-focused Latent-object Biasing (LLB) training method that con- structs label-focused inductive bias over latent objects determined by only labels as UWK. It has four steps: 1) it learns intermediate latent object features in an unsupervised manner; 2) it decouples their visual dependencies by assigning new independent embedding parameters; 3) it captures structured features optimized for the original classification task; and 4) it integrates the structured features with the original visual features for the final prediction. We implement the LLB on a vision transformer architecture, and achieved significant improvements on im- age classification benchmarks. This paper offers a straightforward and effective method to obtain and utilize undescribed world knowledge in classification tasks.

  • Question: Unintentional conflicts between input image(doamin) focused inductive bias and relations over latent objects in human labeling(Undescribed world knowledge) exits, and how can we handle it?
  • Idea: Constructs label-focused inductive bias over latent objects that decouples visual dependencies
  • Dataset: ImageNet1K, Places356, iNaturalist2018
  • Method: Label-focused Latent-object Biasing (LLB)
  • Keywords: Representation Learning, Output-Domain Focused Inductive Bias, Classification
    Links:
Fixed Non-negative Orthogonal Classifier: Inducing Zero-mean Neural Collapse with Feature Dimension Separation

The Twelfth International Conference on Learning Representations (ICLR 2024)

  • Abstract

    Fixed classifiers in neural networks for classification problems have demonstrated cost efficiency and even outperformed learnable classifiers in some popular benchmarks when incorporating orthogonality (Pernici et al., 2021a). Despite these advantages, prior research has yet to investigate the training dynamics of fixed classifiers on neural collapse. Ensuring this phenomenon is critical for obtaining global optimality in a layer-peeled model, potentially leading to enhanced performance in practice. However, the fixed classifiers cannot invoke neural collapse due to their geometric limitations. To overcome the limits, we analyze a zero-mean neural collapse considering the orthogonality in non-negative Euclidean space. Then, we propose a fixed non-negative orthogonal classifier that induces the optimal solution and maximizes the margin of an orthogonal layer-peeled model by satisfying the properties of zero-mean neural collapse. Building on this foundation, we exploit a feature dimension separation effect inherent in our classifier for further purposes; (1) enhances softmax masking by mitigating feature interference in continual learning and (2) tackles the limitations of mixup on the hypersphere in imbalanced learning. We conducted comprehensive experiments on various datasets and architectures and demonstrated significant performance improvements.

  • Question: How the collapse between class means and class weight vectros occurs in a fixed classifier when its shape is not a simplex?
  • Idea: When forcing non-negativity and orthogonality to the fixed classifier, zero-mean neural collapse occurs
  • Dataset: Split-{MNIST,CIFAR10/100,TinyImageNet}, {CIFAR10/100,ImageNet,Places}-LT
  • Method: Fixed Non-negative Orthogonal Classifier
  • Keywords: neural collapse, orthogonal, fixed classifier, imbalanced, continual
    Links:
Revisiting Softmax Masking: Stop Gradient for Enhancing Stability in Replay-based Continual Learning

Arxiv 2024

  • Abstract

    In replay-based methods for continual learning, replaying input samples in episodic memory has shown its effectiveness in alleviating catastrophic forgetting. However, the potential key factor of cross-entropy loss with softmax in causing catastrophic forgetting has been underexplored. In this paper, we analyze the effect of softmax and revisit softmax masking with negative infinity to shed light on its ability to mitigate catastrophic forgetting. Based on the analyses, it is found that negative infinity masked softmax is not always compatible with dark knowledge. To improve the compatibility, we propose a general masked softmax that controls the stability by adjusting the gradient scale to old and new classes. We demonstrate that utilizing our method on other replay-based methods results in better performance, primarily by enhancing model stability in continual learning benchmarks, even when the buffer size is set to an extremely small value.

  • Question: How can we mitigate catastrophic forgetting caused by the push effect of cross-entropy loss with softmax in continual learning?
  • Idea: Enhancing model stability by preventing the gradient flow from the target class to old and new classes for alleviating catastrophic forgetting
  • Dataset: Split-{MNIST,CIFAR10/100,TinyImageNet}
  • Method: General Masked Softmax
  • Keywords: continual learning, softmax, mask, stop gradient
    Links:
Asymptotic Midpoint Mixup for Margin Balancing and Moderate Broadening

Arxiv 2024

  • Abstract

    In the feature space, the collapse between features invokes critical problems in representation learning by remaining the features undistinguished. Interpolation-based augmentation methods such as mixup have shown their effectiveness in relieving the collapse problem between different classes, called inter-class collapse. However, intra-class collapse raised in coarse-to-fine transfer learning has not been discussed in the augmentation approach. To address them, we propose a better feature augmentation method, asymptotic midpoint mixup. The method generates augmented features by interpolation but gradually moves them toward the midpoint of inter-class feature pairs. As a result, the method induces two effects; 1) balancing the margin for all classes and 2) only moderately broadening the margin until it holds maximal confidence. We empirically analyze the collapse effects by measuring alignment and uniformity with visualizing representations. Then, we validate the intra-class collapse effects in coarse-to-fine transfer learning and the inter-class collapse effects in imbalanced learning on long-tailed datasets. In both tasks, our method shows better performance than other augmentation methods.

  • Question: How can we learn balanced margin in decision?
  • Idea: Asymptotically mixup hidden vectors in the latent space
  • Dataset: {CIFAR10/100,ImageNet,Places}-LT, {MNIST,CIFAR10/100}-CoarseToFine
  • Method: Asymptotic Midpoint Mixup
  • Keywords: augmentation, mixup, manifold mixup, imbalanced learning
    Links:
Enhancing Accuracy and Robustness through Adversarial Training in Class Incremental Continual Learning

arxiv

  • Abstract

    In real life, adversarial attack to deep learning models is a fatal security issue. However, the issue has been rarely discussed in a widely used class-incremental continual learning (CICL). In this paper, we address problems of applying adversarial training to CICL, which is well-known defense method against adversarial attack. A well-known problem of CICL is class-imbalance that biases a model to the current task by a few samples of previous tasks. Meeting with the adversarial training, the imbalance causes another imbalance of attack trials over tasks. Lacking clean data of a minority class by the class-imbalance and increasing of attack trials from a majority class by the secondary imbalance, adversarial training distorts optimal decision boundaries. The distortion eventually decreases both accuracy and robustness than adversarial training. To exclude the effects, we propose a straightforward but significantly effective method, External Adversarial Training (EAT) which can be applied to methods using experience replay. This method conduct adversarial training to an auxiliary external model for the current task data at each time step, and applies generated adversarial examples to train the target model. We verify the effects on a toy problem and show significance on CICL benchmarks of image classification. We expect that the results will be used as the first baseline for robustness research of CICL.

  • Question: How to preserve representations via adversarial training in continual learning.
  • Idea: External adversarial sample generation
  • Dataset: Cifar-10, Tiny image-net
  • Method: Experience Replay, External Adversarial Training
  • Keywords: Class-Incremental Continual Learning, Robustness, External Adversarial Training
    Links:
Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs

Submmitted Article (IEEE) "This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible" (2023. 05)

  • Abstract

    Disentanglement learning is a core issue for understanding and re-using trained information in Variational AutoEncoder (VAE), and effective inductive bias has been reported as a key factor. However, the actual implementation of such bias is still vague. In this paper, we propose a novel method, called Multiple Invertible and partial-equivariant transformation (MIPE-transformation), to inject inductive bias by 1) guaranteeing the invertibility of latent-to-latent vector transformation while preserving a certain portion of equivariance of input-to-latent vector transformation, called Invertible and partial-equivariant transformation (IPE-transformation), 2) extending the form of prior and posterior in VAE frameworks to an unrestricted form through a learnable conversion to an approximated exponential family, called Exponential Family conversion (EF-conversion), and 3) integrating multiple units of IPE-transformation and EF-conversion, and their training. In experiments on 3D Cars, 3D Shapes, and dSprites datasets, MIPE-transformation improves the disentanglement performance of state-of-the-art VAEs.

  • Question: How to inject inductive bias for learning unsupervised disentangled representations.
  • Idea: Invertible and Partial-Equivariant function for disentanglement learning.
  • Dataset: 3D Cars, dSprites, 3D Shapes
  • Method: Multiple Invertible and Partial-Equivariant Transformation (MIPET)
  • Keywords: Invertible, Partial-Equivariant, Disentanglement learning, Variational Auto-Encoder (VAE)
    Links:
Spherization Layer: Representation Using Only Angles

36th Conference on Neural Information Processing Systems (NeurIPS 2022)

  • Abstract

    In neural network literature, angular similarity between feature vectors is frequently used for interpreting or re-using learned representations. However, the inner product in neural networks partially disperses information over the scales and angles of the involved input vectors and weight vectors. Therefore, when using only angular similarity on representations trained with the inner product, information loss occurs in downstream methods, which limits their performance. In this paper, we proposed the spherization layer to represent all information on angular similarity. The layer 1) maps the pre-activations of input vectors into the specific range of angles, 2) converts the angular coordinates of the vectors to Cartesian coordinates with an additional dimension, and 3) trains decision boundaries from hyperplanes, without bias parameters, passing through the origin. This approach guarantees that representation learning always occurs on the hyperspherical surface without the loss of any information unlike other projection-based methods. Furthermore, this method can be applied to any network by replacing an existing layer. We validate the functional correctness of the proposed method in a toy task, retention ability in well-known image classification tasks, and effectiveness in word analogy test and few-shot learning. Code is publicly available at https://github.com/GIST-IRR/spherization_layer

  • Question: How about learning representations using only the angles and using them on angular similarity?
  • Idea: Learn feature representations by using only angles.
  • Dataset: MNIST, Fashion-MNIST, CIFAR10/100, WikiText, Mini-ImageNet
  • Method: Spherization Layer
  • Keywords: representation learning, spherization, angular similarity
    Links:
Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions

IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2022)

  • Abstract

    In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies 1) Ambiguity - even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry - despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts - leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).

  • Question: Which of the issues that reflect the nature of the data itself has not been addressed in depth in previous studies?
  • Idea: The characteristics shared by target issues are solved simultaneously using a bottom-up approach.
  • Dataset: Visual Genome
  • Method: Local-to-Global Interaction Network (LOGIN)
  • Keywords: Scene Graph Generation (SGG), Scene Understanding, Relationship Detection, Bidirectional Relationship Classification
    Links:
Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring

ELSEVIER Expert Systems with Applications (ESWA 2023)

  • Abstract

    Knowledge distillation is an approach to transfer information on representations from a teacher to a student by reducing their difference. A challenge of this approach is to reduce the flexibility of the student's representations inducing inaccurate learning of the teacher's knowledge. To resolve it in transferring, we investigate distillation of structures of representations specified to three types: intra-feature, local inter-feature, global inter-feature structures, instead representations. To transfer them, we introduce feature structure distillation methods based on the Centered Kernel Alignment (CKA), which assigns a consistent value to similar feature structures and reveals more informative relations. We implement intra-feature structure with the relation between tokens in single sentence, and inter feature structure with the relation between sentences in mini-batch. In particular, a memory-augmented transfer method with clustering is implemented for the global structures. Then we utilize the CKA metric to transfer the teacher's structure to student's through layer-wise distillation. The methods are empirically analyzed on the nine tasks for language understanding of the GLUE dataset with Bidirectional Encoder Representations from Transformers (BERT), which is a representative neural language model. In the results, the proposed methods effectively transfer the three types of structures and improve performance compared to state-of-the-art distillation methods. Indeed, the code for the methods is available at https://github.com/maroo-sky/FSD.

  • Question: How to transfer the feature structure from teacher to student model?
  • Idea: Transfer relation of teacher's knowledge to student.
  • Dataset: GLUE
  • Method: Feature Structure Distillation (FSD)
  • Keywords: Knowledge Distillation, Centered Kernel Alignment, Feature Sturcture, BERT, Natural Language Processing
    Links:
Learning from Matured Dumb Teacher for Fine Generalization

Arxiv 2021

  • Abstract

    The flexibility of decision boundaries in neural networks that are unguided by training data is a well-known problem typically resolved with generalization methods. A surprising result from recent knowledge distillation (KD) literature is that random, untrained, and equally structured teacher networks can also vastly improve generalization performance. It raises the possibility of existence of undiscovered assumptions useful for generalization on an uncertain region. In this paper, we shed light on the assumptions by analyzing decision boundaries and confidence distributions of both simple and KD-based generalization methods. Assuming that a decision boundary exists to represent the most general tendency of distinction on an input sample space (i.e., the simplest hypothesis), we show the various limitations of methods when using the hypothesis. To resolve these limitations, we propose matured dumb teacher based KD, conservatively transferring the hypothesis for generalization of the student without massive destruction of trained information. In practical experiments on feed-forward and convolution neural networks for image classification tasks on MNIST, CIFAR-10, and CIFAR-100 datasets, the proposed method shows stable improvement to the best test performance in the grid search of hyperparameters. The analysis and results imply that the proposed method can provide finer generalization than existing methods.

  • Idea: Using the simplest hypothese to generalize the model only with given training data.
  • Dataset: MNIST, CIFAR-10, CIFAR-100
  • Method: matured Dumb Teacher based Knowledge Distillation (mDT-KD)
  • Keywords: Generalization, Self-knowledge distillation, Neural networks, Occam’s Razor, Confidence distribution, Decision boundary
    Links:
What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection

Arxiv 2021

  • Abstract

    Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., Between which objects are there an interaction? When do relations occur and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD) - segment-based and window-based. We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness. 1) TSPN tells what to look - it sparsifies relation search space by scoring relationness (i.e., confidence score for the existence of a relation between pair of objects) of object pair. 2) TSPN tells when to look - it leverages the full video context to simultaneously predict the temporal span and categories of the entire relations. TSPN demonstrates its effectiveness by achieving new state-of-the-art by a significant margin on two VidVRD benchmarks (ImageNet-VidVDR and VidOR) while also showing lower time complexity than existing methods - in particular, twice as efficient as a popular segment-based approach.

  • Question: How can we extract long-term relation better from a video?
  • Idea: Directly propose a temporal span over object trajectories.
  • Dataset: ImageNet-VidVRD, VidOR
  • Method: Temporal Span Proposal Network (TSPN)
  • Keywords: Video Visual Relation Detection (VidVRD), Spatio-temporal Video Understanding, Temporal Relation Localization
    Links:
Revisiting Dropout: Escaping Pressure for Training Neural Networks with Multiple Costs

Electronics 2021

  • Abstract

    A common approach to jointly learn multiple tasks with a shared structure is to optimize the model with a combined landscape of multiple sub-costs. However, gradients derived from each sub-cost often conflicts in cost plateaus, resulting in a subpar optimum. In this work, we shed light on such gradient conflict challenges and suggest a solution named Cost-Out, which randomly drops the sub-costs for each iteration. We provide the theoretical and empirical evidence of the existence of escaping pressure induced by the Cost-Out mechanism. While simple, the empirical results indicate that the proposed method can enhance the performance of multi-task learning problems, including two-digit image classification sampled from MNIST dataset and machine translation tasks for English from and to French, Spanish, and German WMT14 datasets.

  • Question: What leads to sub-par optimum in the multi-task learning environment?
  • Idea: Resolve gradient conflicts among multiple tasks via drop-out-like mechanism.
  • Dataset: MNIST, WMT14
  • Method: Cost-Out
  • Keywords: Multitask Learning, Gradient Conflict, Escaping Pressure, Dropout
    Links: