We first learn a high-dimensional sparse code for each frame by a universal background MBN, then accumulate the sparse codes of the frames in a session (a.k.a. utterance) into a single high-dimensional sparse supervector, and finally reduce the session-level sparse supervectors to a low-dimensional subspace by MBN where clustering is conducted.
Source: wiktionary