At train-time, we embed each pixel of the ground truth image SI as the mean of predefined guide functions f over instance pixels it belongs to, resuling in embeddings e(S, Ψ). We then train the ...