Stephanie Chan

In-context Few-shot Learning, and the Importance of Data Distributions

Intriguingly, large transformer-based models like GPT-3 are able to perform in-context learning – the ability to learn rapidly and without gradient updates from a very small number of examples – without being explicitly trained for it. We show that this behavior is driven by the distributions of the training data itself. The relevant properties are inherent to naturalistic data in a wide range of domains like language. We also found that naturalistic data distributions were only able to elicit in-context learning in transformers, and not in recurrent models. Thus, our findings indicate how the transformer architecture works together with particular properties of the training data to drive the emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.

Buttontwitter Buttonlinkedin
This website uses cookies to ensure you get the best experience. Learn more