ACL 2024 Findings, "Structural Optimization Ambiguity and Simplicity Bias in Unsupervised Neural Grammar Induction"

Video will be uploaded after conference!

Summary & Contribution

  • We reveal two problem “Structural Optimization Ambiguity” and “Structural Simplicity Bias” that cause large variance and simplistic tree structures in Unsupervised Neural Grammar Induction

  • We introduce simple method “Sentence-wise Parse-Focusing” that use only selected parse trees than all possible parse trees and “Focusing-Bias Generation” that generate promising tree structures from pre-trained unsupervised parsers

  • We demonstrate the effectiveness of our approach by investigating performances when using various focusing-biases and small-sized grammars.

Introduction

Grammar induction has been researched actively due to its unique ability to understand and utilize the structural information of language. This has achieved good performance with the application of neural parameterization. However, the grammars induced by these approaches have unaddressed issues. They exhibit high variance in F1 scores during parsing and generate overly simplistic parse trees that differ from the gold parse trees. In this study, we reveal two underlying issues: Structural Optimization Ambiguity and Structural Simplicity Bias.

Structural Optimization Ambiguity

We define that two grammars are structural-optimization-ambiguous, if they have the same sentence probabilities while deriving different parse trees for given the same set of sentences. despite having different rule probability distributions. Let’s consider the search space of grammars, where two different optima may prefer different tree structures despite having the same loss. Since they are indistinguishable in the loss landscape, which optimum the model converges to depends on random components, resulting in high variance. Under single pre-terminal constraints, we mathematically prove that SOA can always exist. Additionally, we empirically demonstrate SOA under general conditions. For the existing model FGG-TNPCFG, we found a very low correlation between negative likelihood and F1 score. This indirectly shows that the same loss can correspond to various tree structures. We also observed that the variance in F1 scores increases as the grammar size increases.

Structural Simplicity Bias

Structural simplicity bias refers to the bias that induces grammars to use the minimum number of rules possible. In traditional learning, this is a natural and preferred outcome. However, in UNGI, SSB operates extremely strongly. This force the grammar to generate left- or right-binarized parse trees. This influence is stronger than the influence from the data. As a result, grammars use only a fraction of their potential expressiveness, leading to lower performance relative to their capacity. We averaged the number of unique rule types used per length when generating parse trees. This analysis confirmed that grammars from existing models use fewer rules compared to gold parse trees, and this bias is stronger as grammar size decreases. The simplification for rule utilization is relaxed when we use our method.

Method

These issues arise because the sentence probability used as the loss includes the probabilities of all possible parse trees. Variance issue was not critical in the SGI or EM algorithms. Inspired by this, we introduced sentence-wise parse-focusing and focusing-bias generation.

Sentence-wise Parse-Focusing

We modified the inside algorithm to aggregate and use only the structures included in the selected parse trees. For example, with two parse trees, as shown in the figure, we count the frequency of all spans and weight the rules calculated in the inside algorithm according to their proportion. However, strict bias assigns probabilities only to the observed spans. Which excludes the potential for unobserved structures. Therefore, to preserve these possibilities, we use a soft-weighted bias derived from softmax.

Focusing-Bias Generation

In Sentence-wise Parse-Focusing, performance depends on the selected parse trees. we use pre-trained parsers trained through unsupervised learning on the same data to obtain good bias without gold parse trees. Specifically, we derive parsers from different models and mix it. The reason is to offset the model-specific biases and enhance the data-specific biases. This approach helps the model to make more accurate predictions for the given data.

Experiments

We used PTB for English, CTB for Chinese, and SPMRL for other languages. The base model was FGG-TNPCFG, and pre-trained models were Structformer, NBL-PCFG, and FGG-TNPCFG. All training hyperparameters of models including pre-trained parsers were the same as in the original papers. A total of 32 experiments were conducted for both the reproduced models and our model to clearly determine variance. Our method achieves high performance while significantly reducing variance. For our method, we used the same set of parsers for all random seeds to ensure the robustness of the model training under the same focusing bias. These parsers were selected randomly. We also compared performance across various languages, showing the highest performance in most languages with low variance.

Analysis

We investigated that parse-focusing can address the high variance problem with any focusing-bias. All these focusing bias in table demonstrated significantly lower variance compared to the base model. We applied our method to small-scale grammars and achieved high performance, reducing structural simplicity bias and showing that small grammars have higher potential expressiveness than previously known. Additionally, we examined the rule frequency in parsing and showed that our method mitigates over-reliance on specific rules. Finally, we demonstrated that heterogeneous parsers generally achieve higher performance on average, supporting the idea that offsetting model-specific bias is an effective strategy.