English Intern
Mathematik des Maschinellen Lernens

Mathematics of Transformers Workshop in Würzburg 2026

Datum: 17.06.2026, 10:15 - 17:15 Uhr
Kategorie: Veranstaltung
Ort: Hubland Nord, Geb. 41, 00.006
Vortragende: Tim Roith, Nicolás García Trillos, Leon Bungert

Mean-field models for self-attention dynamics in transformers (Prof. Dr. Tim Roith)

Transformer architectures have driven recent breakthroughs in natural language processing, computer vision and generative modeling. Their increasing empirical success calls for a thorough understanding of their inner mechanisms for ensuring their reliable and secure deployment. A recent work by Geshkovski, Letrouit, Polyanskiy and Rigollet systemized the mathematical framework for analyzing the forward pass of a transformer by interpreting it as a system of interacting particles, namely the so-called tokens in the architecture. This perspective allows for many interesting connections, such as the study of the associated partial differential equation, which can give (at least partially) an answer to the long-term behavior of the system. In this talk, we give an introduction into the topic and highlight the driving questions such as clustering and meta-stability. Moreover, we will present results from Burger, Kabri, Korolev, Roith and Weigand, which consider self-attention on a sphere, motivated by normalization layers in a transformer architecture. Under some assumptions on the weight matrices,the long-time behavior is given as the solution of an energy minimization problem over measures on the sphere. We show how minimizers can be characterized depending on the eigenvalues of the weight matrices.

 

A collective dynamics perspective on transformer dynamics beyond gradient flows                      (Prof. Dr. Nicolás García Trillos)

In this talk, I will discuss a collective dynamics perspective on transformers, the architecture at the heart of modern large language models. In particular, we will discuss how dimensionality reduction techniques akin to those used in the study of the Kuramoto model can be employed to explore the rich structure that the evolution of the distributions of tokens (particles) can have when selecting different values for the key, query, and value matrices parameterizing a transformer model with multiple self-attention layers . This perspective will allow us to explore the structure of token dynamics beyond the gradient flow setting obtained by very specific choices of model parameters. In particular, we will discuss how certain parameter choices induce cyclical behavior, consensus formation without stability, and Hamiltonian dynamics. While our theoretical discussion will focus exclusively on 2-dimensional token embeddings, I will also discuss numerical experiments that suggest that our theoretical findings can be extrapolated to general multi-dimensional settings. 

This is joint work with Sixu Li (UW-Madison), Jan Peszek (Warsaw), Trevor Teolis (Rice), Konstantin Riedl (Oxford), Jake Maranzatto (Maryland), Semih Akkoc (Maryland), and Sennur Ulukus (Maryland)

 

Quantifying concentration phenomena of self-attention dynamics (Prof. Dr. Leon Bungert)

In this talk, I will study the evolution of tokens in transformers at inference time which is described by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. We quantify this behavior by bounding the Wasserstein distance of the token distribution and the metastable one in terms of an inverse temperature parameter β>0 and the inference time. For the proof, we use Lyapunov-techniques and stability estimates in Wasserstein space together with a quantitative Laplace principle. Our result implies that for time scales of order log β the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for positive temperature and large time the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

Zurück