The dimension of invariant and equivariant linear layers

July 2021

Neural networks are compositions of linear operatiors and pointwise non-linearities. If the input is x{\bf x}, then a 2-layers network will look like

f(x)=mρLρA (x) f({\bf x}) = m \circ \rho \circ L \circ \rho \circ A ~ (\bf x)

where ρ\rho is a pointwise nonlinearity (for example, ReLu), A,LA, L are linear operators and mm is a multi-layer perceptron.

In many cases, one needs to impose some constraints on how ff behaves when x\bf x undergoes some symmetry transformation. Typically, a neural network trained to classify images of cats and dogs should produce the same output if tested on an image of a cat, and on the same image but mirrored.

In this note, we give some theoretical results on symmetry-invariance of tensorized neural networks.



Permutations invariance and equivariance

Most neural architectures do not take as inputs one-dimensional vectors, but rather tensors, which are just generalized vectors. Let us fix the notations and explain how we define the symmetries of tensors.

Tensor notations. We will only deal with « square tensors », possibly with features. A tensor with order kk and dimension nn is simply an element of Rnk\mathbb{R}^{n^k}, seen as an array with kk axes each of which has nn indices: we will not index its elements with one index from 11 to nkn^k, but with a multi-index (i1,,ik)(i_1, \dotsc, i_k), each of them in [n][n]. Vectors are order-1 tensors, matrices are order-2 tensors, and so on. In general, the entries of a tensor are complex numbers, but sometimes they can themselves have a higher dimension, say aa. The space of tensors with kk axes, nn dimensions and features of dimension aa can thus be identified with Rnk×a\mathbb{R}^{n^k \times a}.

Permutation action. For any permutation σ\sigma of [n][n], one can permute the elements of a tensor XRnk×aX \in \mathbb{R}^{n^k \times a} by simply setting

(σX)i1,,ik,j=Xσ(i1),,σ(ik),j. (\sigma * X )_{i_1, \dotsc, i_k, j} = X_{\sigma(i_1), \dotsc, \sigma(i_k), j}.

We say that a function ff defined on tensors has symmetries with respect to permutations if it behaves well when the permutation acts on the tensor.

  • Invariance. A function f: Rnk×aRf : \mathbb{R}^{n^k \times a} \to \mathbb{R} is called permutation-invariant if

f(σX)=f(X).f(\sigma * X) = f(X).
  • Equivariance. A function f:Rnk×aRnh×bf: \mathbb{R}^{n^k \times a } \to \mathbb{R}^{n^h \times b} is called permutation-equivariant if

f(σX)=σf(X). f(\sigma * X) = \sigma * f(X).

Note that for invariance, we restricted to functions which end in R\mathbb{R}. This is sufficient: invariant functions from Rnk×a\mathbb{R}^{n^k \times a } to Rm\mathbb{R}^m are just composed of mm invariant functions. Also, note that for equivariance, one can have an invariant function between tensors of different orders khk \neq h. One of the crucial goals of modern (deep) learning is to craft neural architectures which can approximate as well as possible any function with this kind of invariance. An intuitive way of doing so is to concatenate linear layers and non-linearities, as in (1), but in enforcing the linear layers to be themselves invariant or equivariant. It is easily seen that the resulting network will itself be invariant or equivariant. That being said, not all linear operations are permutation-invariant: for instance, the following example shows that there is essentially one linear form which is permutation-invariant.

The simplest of examples: invariant linear forms

Let us try to find all the linear operations L:RnRL:\mathbb{R}^n \to \mathbb{R} that are permutation-invariant. Such operations are simply multiplications by a n×1n \times 1 matrix, ie

L(x)=i=1naixi L (x) = \sum_{i=1}^n a_i x_i

for some numbers aia_i that we have to find in such a way that L(σx)=L(x)L(\sigma * x) = L(x). Equivalently, we must find the aia_i such that for every xx and every permutation σ\sigma, the numbers aixi\sum a_i x_i and aixσ(i)\sum a_i x_{\sigma(i)} are equal. It is quite easy to see that this can happen only if all the aia_i are equal. Indeed, by taking xx to be the first elementary vector e1=(1,0,,0)e_1 = (1, 0, \dotsc, 0) and σ\sigma to be the transposition (1,2)(1,2), one sees that a1=L(e1)=L(σe1)=L(e2)=a2a_1 = L(e_1) = L(\sigma *e_1) = L(e_2) = a_2 and so on. Consequently, there is essentially only one permutation-equivariant linear form, and it is L(x)=xiL(x) = \sum x_i and its multiples.

Our goal is to push further this example, and identify the invariant/equivariant linear operations in all orders and dimensions.

Dimension of invariant layers

We begin with the dimension of invariant operations, which can be computed quite easily. The solution involves the Bell numbers.

Definition. The kk-th Bell number Bk\mathrm{B}_k is the number of partitions of [k]={1,,k}[k]=\{ 1, \dotsc, k\}.

To be clear, a partition of [k][k] is simply a collection of nonempty disjoint sets, say π1,,πr\pi_1, \dotsc, \pi_r, such that their union is [k][k]. It is worth mentioning that these numbers are growing extremely fast. With some Julia code and the Combinatorics package, one can list all the partitions of [n][n] with collect(partitions(1:n)):

using Combinatorics
BellPart(n) = collect(partitions(1:n))

Here is a small picture of all the B4=15\mathrm{B}_4=15 partitions of {1,2,3,4}\{1, 2, 3, 4 \}:

using Plots

function plot_partition(p)
    order = maximum(maximum, p)
    pl = plot(legend=:none, border=:none, axis=nothing, aspect_ratio=:equal,  
    xlim=(-2, 2), ylim=(-2,2))
    for block in p
        circled = [exp(2im*pi*i/order) for i in block]
        scatter!(pl, real(circled), imag(circled), markersize=10)
    end
    return pl
end

parplots = [plot_partition(p) for p in BellPart(4)]
plot(parplots..., layout=(3, 5))

We now state the main result of this note.

Theorem. The dimension of permutation-invariant linear layers Rnk×aR\mathbb{R}^{n^k \times a} \to \mathbb{R} is a×Bka \times \mathrm{B}_k.

Proof. Let V\mathscr{V} be the vector space of all linear operators from Rnk×a\mathbb{R}^{n^k \times a} to R\mathbb{R}. For any permutation σ\sigma, let Pσ:Rnk×aRnk×aP_\sigma : \mathbb{R}^{n^k \times a} \to \mathbb{R}^{n^k \times a} be the matrix of the action of σ\sigma. By the definition, a linear operator LVL \in \mathscr{V} is invariant iff LPσx=LxLP_\sigma x = Lx for every xx, in other words iff LPσ=LLP_\sigma = L. Consequently, the subspace of invariant operators is Vfix={L:LPσ=L for all σ}\mathscr{V}_{\mathrm{fix}} = \{ L : LP_\sigma = L \text{ for all } \sigma \} and we want to compute its dimension. Thanks to a well-known magical trick from group representation theory, it turns out that the projection from V\mathscr{V} to Vfix\mathscr{V}_{\mathrm{fix}} is given by the group action average Φ:VV\Phi : \mathscr{V} \to \mathscr{V} defined by[1]

Φ:L1n!σLPσ. \Phi:L \mapsto \frac{1}{n!}\sum_\sigma LP_\sigma.

This is easy to prove: first, check that ΦΦ=Φ\Phi \circ \Phi = \Phi, so Φ\Phi is a projector, then check that its image is Vfix\mathscr{V}_{\mathrm{fix}}, which is also not very difficult by double inclusion. Now, the dimension of a subspace is nothing but the trace of its projector, so we want to compute trace(Φ)\mathrm{trace}(\Phi), which by linearity is given by

1n!σtrace(LLPσ). \frac{1}{n!}\sum_\sigma \mathrm{trace}(L \mapsto LP_\sigma).

By trace(LLPσ) \mathrm{trace}(L \mapsto LP_\sigma), we mean the trace of the linear operator from V\mathscr{V} to V\mathscr{V} defined by LLPσL \mapsto LP_\sigma. Fortunately, it is a classical exercise to check that for any linear operator A:Rnk×aRnk×aA : \mathbb{R}^{n^k \times a} \to \mathbb{R}^{n^k \times a}, one has trace(LLA)=trace(A)\mathrm{trace}(L \mapsto LA) = \mathrm{trace}(A).

The dimension dim(Vfix)=a×Bk\dim(\mathscr{V}_{\mathrm{fix}}) = a \times \mathrm{B}_k will follow from two computations, which we state as lemmas because we will use them later for equivariant layers. We recall that the fixed points of a permutation σ\sigma of [n][n] are the i[n]i\in [n] such that σ(i)=i\sigma(i)=i.

Lemma 1.

trace(Pσ)=a×(number of fixed points of σ)k\mathrm{trace}(P_\sigma) = a \times (\text{number of fixed points of } \sigma)^k

Lemma 2.

1n!σ(number of fixed points of σ)k=Bk\frac{1}{n!}\sum_\sigma (\text{number of fixed points of } \sigma)^k = \mathrm{B}_k

Given these, the proof of the dimension formula follows directly from (7), Lemma 1 and Lemma 2.


Proof of Lemma 1 (tensorized version). Check that PσP_\sigma can actually be written as MσMσIaM_\sigma \otimes \dots \otimes M_\sigma \otimes I_a where MσM_\sigma is the n×nn \times n permutation matrix of σ\sigma, \otimes is the Kronecker tensor product, then use that the trace is multiplicative with respect to \otimes and that the trace of a permutation matrix is the number of its fixed points.

Proof of Lemma 1 (vectorized version). By definition of the trace, trace(Pσ)=i1,,ik,j(Pσ)(i1,,ik,j),(i1,,ik,j)\mathrm{trace}(P_\sigma) = \sum_{i_1, \dotsc, i_k, j}(P_\sigma)_{(i_1, \dotsc, i_k, j), (i_1, \dotsc, i_k, j)}. By the definition of PσP_\sigma,

(Pσ)(i1,,ik,j),(i1,,ik,j)=1σ(i1)=i1,,σ(ik)=ik(P_\sigma)_{(i_1, \dotsc, i_k, j), (i'_1, \dotsc, i'_k, j')} = \mathbf{1}_{ \sigma(i_1) = i'_1, \dotsc, \sigma(i_k) = i'_k }

and thus we obtain

trace(Pσ)=j=1a(i1,,ik1σ(i1)=i1××1σ(ik)=ik)=j=1a(i=1n1σ(i)=i)k=a×(number of fixed points of σ)k.\begin{aligned}\mathrm{trace}(P_\sigma) &= \sum_{j=1}^a \left( \sum_{i_1, \dotsc, i_k}\mathbf{1}_{\sigma(i_1) = i_1}\times \dotsc \times \mathbf{1}_{\sigma(i_k) = i_k} \right)\\ &= \sum_{j=1}^a \left(\sum_{i=1}^n \mathbf{1}_{\sigma(i) =i}\right)^k \\ &= a \times (\text{number of fixed points of } \sigma)^k. \end{aligned}

Proof of Lemma 2. The group of permutations of [n][n] acts on the set of functions uu from [k][n][k] \to [n] by σu(k)=σ(u(k))\sigma \star u (k) = \sigma(u(k)). If σ\sigma has, say, zz fixed points, then there are zkz^k functions uu such that σu=u\sigma \star u = u: all the functions who only have values in the zz fixed points of σ\sigma. Burnside's formula says that

1n!σ(number of u such that σu=u)= number of orbits of the action .\frac{1}{n!}\sum_{\sigma} (\text{number of }u\text{ such that } \sigma \star u = u )= \text{ number of orbits of the action } \star.

But the orbits of this action are parametrized by partitions of [n][n]. To see this, just consider the sets u1(i)u^{-1}(i) for i[n]i \in [n]. The nonempty ones form a partition of [k][k], and if two u,vu,v have the same partition, then one will certainly find a σ\sigma such that σu=v\sigma \star u = v...


We thus have the dimension of the subspace of invariant linear layers, and the result has something quite stunning... Our theorems say that invariant layers from Rnk\mathbb{R}^{n^k} to R\mathbb{R} are parametrized by Bk\mathrm{B}_k dimensions... independently of nn.

The dimension of invariant linear layers between tensors do not depend on the size of the tensors, but only on the order of the tensors.

Among other things, a nice consequence is that the learnable parameters are transferable to input tensors with the same order but different sizes!

Basis for invariant layers

We also have at our disposal a basis for this subspace. A basis of the linear space V\mathscr{V} is given by the set of all the nk×an^k\times a elementary « matrices », which can be represented as Ei1,,ik,jE_{i_1, \dotsc, i_k, j}, the linear operator sending the (i1,,ik,j)(i_1, \dotsc, i_k, j) element of the elementary basis of Rnk×a\mathbb{R}^{n^k\times a } onto 1.

Let π\pi be a partition of [k][k] and consider all the multi-indices (i1,,ik)(i_1, \dotsc, i_k) which are constant on the blocks of π\pi; more formally, if s,ts,t are in the same block of π\pi, then is=iti_s = i_t. We note S(π)\mathscr{S}(\pi) all those indices (the equivalence class induced by π\pi).

The family

Fπ,j=(i1,,ik)S(π)Ei1,,ik,jF_{\pi, j} = \sum_{(i_1, \dotsc, i_k) \in \mathscr{S}(\pi) } E_{ i_1, \dotsc, i_k, j}

where π\pi is a partition of [k][k] and j[d]j \in [d] is a basis of the subspace of permutation-invariant linear operators from Rnk×a\mathbb{R}^{n^k \times a } to R\mathbb{R}.

Well, given the previous result, the proof is almost trivial since it is enough to check that the Fπ,jF_{\pi, j} are linearly independent. In fact, one can also prove the theorem by also checking that this family actually generates the subspace of permutation-invariant linear operators!

One of the main features of this description is that when training neural networks, each linear layer is caracterized by only aBka\mathrm{B}_k parameters, provided we have at our disposal the hard-coded basis of the Fπ,jF_{\pi, j}.

Example: Graph-invariant layers. Let us focus on invariant layers from Rn2\mathbb{R}^{n^2} to R\mathbb{R}. Here, we only seek linear operators taking as input square matrices, and invariant with respect to permutations, ie L((Ai,j))=L((Aσ(i),σ(j)))L((A_{i,j})) = L((A_{\sigma(i), \sigma(j)})) for any permutation σ\sigma. The theorem says that this subspace has only B2=2\mathrm{B}_2 = 2 dimensions. The two partitions of [2][2] are {{1,2}}\{ \{1,2\} \} and {{1},{2}}\{ \{1\}, \{2\} \}. The orbit of the first one is simply the set of diagonal couples, (i,i)(i,i), so the corresponding basis vector will be

F1=i=1nEi,iF_1 = \sum_{i=1}^n E_{i,i}

This is nothing but the trace operator: F1(A)=iEi,i(A)=iAi,iF_1(A) = \sum_i E_{i,i}(A) = \sum_i A_{i,i}. The orbit of the second partition is the set of non-diagonal couples, ie (i,j)(i,j) with iji\neq j. The corresponding basis vector will be

F2=ijEi,j.F_2 = \sum_{i \neq j} E_{i,j}.

This is the operator Identitytrace\mathrm{Identity - trace}. Consequently, all the invariant linear forms on Rn2\mathbb{R}^{n^2} have the form

Aλ×i=1nAi,i+μ×ijAi,j. A \mapsto \lambda \times \sum_{i=1}^n A_{i,i} + \mu \times \sum_{i \neq j}A_{i,j}.

Equivariant layers

Let us now have a look at the same problem, but for equivariance. We seek linear layers from Rnk,a\mathbb{R}^{n^k, a} to Rnh,b\mathbb{R}^{n^h, b} which are equivariant with respect to permutations. Note that now, we do not restrict to layers between Rnk×d\mathbb{R}^{n^k\times d} and R\mathbb{R}. We can even consider layers between tensor spaces of different orders khk \neq h and different feature dimensions aba \neq b.

The dimension of permutation-equivariant linear layers Rnk×aRnh×b\mathbb{R}^{n^k\times a} \to \mathbb{R}^{n^h \times b} is a×b×Bk+ha\times b \times \mathrm{B}_{k+h}.

Proof. The proof follows the same lines as for invariant layers, with the difference that the action is not exactly the same. Let us note V\mathscr{V} the set of linear operators from Rnk×aRnh×b\mathbb{R}^{n^k\times a} \to \mathbb{R}^{n^h \times b}. A linear operator LL is invariant iff LPσ=PσLLP_\sigma = P_\sigma L, so in fact we are interested in the dimension of the subspace Veq={L:Pσ1LPσ=L}\mathscr{V}_{\mathrm{eq}} = \{ L : P_\sigma^{-1}LP_{\sigma} = L\}. Here again, this dimension is the trace of the projection operator:

dim(Veq)=1n!σtrace(LPσ1LPσ).\dim (\mathscr{V}_{\mathrm{eq}}) = \frac{1}{n!}\sum_\sigma \mathrm{trace}(L \mapsto P^{-1}_\sigma L P_\sigma).

The main difference is that this time, one has

trace(LPσ1LPσ)=tracenk×a(Pσ)×tracenh×b(Pσ),\mathrm{trace}(L \mapsto P^{-1}_\sigma L P_\sigma) = \mathrm{trace}_{n^k \times a}(P_\sigma)\times \mathrm{trace}_{n^h \times b}(P_\sigma),

where the subscripts indicate in which underlying space we're taking the traces. But then, as before, this is equal to

a×(number of fixed points of σ)k×b×(number of fixed points of σ)h a\times (\text{number of fixed points of }\sigma)^k \times b\times (\text{number of fixed points of }\sigma)^h

and we can now finish the proof exactly as in the invariant case.


Here again, our remark on dimensions of invariants networks is still valid: the dimension of equivariant networks does not depend on nn!

For instance, suppose that one has a graph represented by an adjacency matrix (an order-2 tensor), with edge features of dimension 3, and we would like to represent this graph using only node features of dimension, say, 5; additionnally, we want this to be relabelling-equivariant, ie if we chose to label the nodes in a different way, the node features should be permuted accordingly. We thus seek equivariant layers between Rn2,3\mathbb{R}^{n^2, 3} and Rn,5\mathbb{R}^{n,5} and their dimension is 3×5×B2+1=2253 \times 5 \times \mathrm{B}_{2 + 1} = 225.

We also have a basis for equivariant layers. This time, the elementary basis of the space of linear operators Rnk×aRnh×b\mathbb{R}^{n^k \times a} \to \mathbb{R}^{n^h \times b} will be noted

Ei1,,ik,j,i1,,ih,jE_{i_1, \dotsc, i_k, j, i'_1, \dotsc, i'_h, j'}

meaning the operator sending the elementary basic vector of Rnk×a\mathbb{R}^{n^k \times a} of index (i1,,ik,j)(i_1, \dotsc, i_k, j) to the elementary basic vector of Rnh×b\mathbb{R}^{n^h \times b} of index (i1,,ih,j)(i'_1, \dotsc, i'_h, j'). If π\pi is a partition of the set [k+h][k+h], its orbit is the set of all multi-indices (i1,,ik,i1,,ih)(i_1, \dotsc, i_k, i'_1, \dotsc, i'_h) which are constant of its blocks and we note this orbit S(π)\mathscr{S}(\pi). Then, just as for the invariant case, summing on the orbits yields a basis of Veq\mathscr{V}_{\mathrm{eq} }.

The family

i1,,ik,i1,,ihS(π)Ei1,,ik,j,i1,,ih,j \sum_{i_1, \dotsc, i_k, i'_1, \dotsc, i'_h \in \mathscr{S}(\pi) } E_{i_1, \dotsc, i_k, j, i'_1, \dotsc, i'_h, j'}

where π\pi is a partition of [k+h][k+h], j[a],j[b]j \in [a], j'\in [b] is a basis of the space of permutation-equivariant linear operators from Rnk×a\mathbb{R}^{n^k \times a} to Rnh×b\mathbb{R}^{n^h \times b}.

Example: basis for graph equivariant layers. Let us find all linear operators Rn2Rn2\mathbb{R}^{n^2} \to \mathbb{R}^{n^2} which are permutation-equivariant. By the theorem above, the dimension is B2+2=15\mathrm{B}_{2+2} = 15 and we already saw those partitions in the plot above. Using the preceding description, we can try to plot the basis of the FπF_\pi, which requires a bit of reflection.

isequal(x) = all(y -> y==x[1], x) #checks if all the elements of array x are equal
is_constant_on_blocks(x, P) = prod([isequal(x[block]) for block in P]) 
zeros_t(n,k) = zeros(tuple(n*ones(Int, k)...))# creates a tensor with k axes and n dims

function lifting(P, c, tensor_order)
    out = zeros(Int, tensor_order)
    for i in 1:length(P) 
        a = c[i] * ones(length(P[i]))
        out[P[i]] = a #c[i] * ones(length(P[i]))
    end
    return CartesianIndex(Tuple(out))
end

function create_basic_element(partition,n)
    tensor_order = maximum(maximum, partition)
    x = zeros_t(n,tensor_order)
    for c in CartesianIndices(zeros_t(n,length(partition)))
        cc = lifting(partition, c, tensor_order)
        x[cc]=1
    end
    return x
end

function custom_heatmap(mat)
    #some code for a beautiful heatmap
end

k=2 ; n=4
basis = [create_basic_element(p, n) for p in BellPart(n)]
hmaps = [custom_heatmap(reverse(reshape(x, n^k, n^k), dims=1)) for x in basis]
plot(hmaps..., layout=(3,5))

I took n=4n=4, so the dimension of Rn2\mathbb{R}^{n^2} is 1616, and then I represented each operator Rn2Rn2\mathbb{R}^{n^2} \to \mathbb{R}^{n^2} as a 16×1616 \times 16 matrix. What you see above are the heatmap of the 15 basis elements of the space of equivariant operators; in the heatmap, yellow = 1 and black = 0.

But why ?

One of the goals of deep learning is to craft invariant or equivariant neural architectures for approximating invariant/equivariant functions, and the idea presented in this note was to concatenate several layers of linear invariant/equivariant operators and non-linearities. This will surely yield an invariant/equivariant network... But can all continuous invariant/equivariant functions be approximated by these kind of layers? This problem, known as the expressivity or universality problem, is not an easy one. Here is a small summary of what seems to be the current state of the art (I'll probably write a more detailed note on these topics):

References

I guess that the dimension of permutation invariant or equivariant linear operators is some kind of folklore problem, but it got digged out of oblivion pretty recently in the context of deep learning; a modern-language proof arose in the Invariant and equivariant graph networks wonderful paper (and those which followed by the same group of authors), which seemed to be the first to introduce these invariant networks with concatenations of invariant layers. Most results in my note are inspired by this paper.

[1] Φ\Phi is a linear operator, itself defined on the space of linear operators from Rnk×a\mathbb{R}^{n^k \times a} to R\mathbb{R}. Sometimes it's called a functor.