cut

In the code all this ignoring of future words is done with attention masks: tensors of shape (batch_size, num_heads, seq_len, seq_len). If we just consider the last two dimensions, entry i,j indicates whether the output for token i should consider the layer output from token j .