Neural networks are compositions of linear operatiors and pointwise non-linearities. If the input is , then a 2-layers network will look like
where is a pointwise nonlinearity (for example, ReLu), are linear operators and is a multi-layer perceptron.
In many cases, one needs to impose some constraints on how behaves when undergoes some symmetry transformation. Typically, a neural network trained to classify images of cats and dogs should produce the same output if tested on an image of a cat, and on the same image but mirrored.
In this note, we give some theoretical results on symmetry-invariance of tensorized neural networks.
Most neural architectures do not take as inputs one-dimensional vectors, but rather tensors, which are just generalized vectors. Let us fix the notations and explain how we define the symmetries of tensors.
Tensor notations. We will only deal with « square tensors », possibly with features. A tensor with order and dimension is simply an element of , seen as an array with axes each of which has indices: we will not index its elements with one index from to , but with a multi-index , each of them in . Vectors are order-1 tensors, matrices are order-2 tensors, and so on. In general, the entries of a tensor are complex numbers, but sometimes they can themselves have a higher dimension, say . The space of tensors with axes, dimensions and features of dimension can thus be identified with .
Permutation action. For any permutation of , one can permute the elements of a tensor by simply setting
We say that a function defined on tensors has symmetries with respect to permutations if it behaves well when the permutation acts on the tensor.
Invariance. A function is called permutation-invariant if
Equivariance. A function is called permutation-equivariant if
Note that for invariance, we restricted to functions which end in . This is sufficient: invariant functions from to are just composed of invariant functions. Also, note that for equivariance, one can have an invariant function between tensors of different orders . One of the crucial goals of modern (deep) learning is to craft neural architectures which can approximate as well as possible any function with this kind of invariance. An intuitive way of doing so is to concatenate linear layers and non-linearities, as in (1), but in enforcing the linear layers to be themselves invariant or equivariant. It is easily seen that the resulting network will itself be invariant or equivariant. That being said, not all linear operations are permutation-invariant: for instance, the following example shows that there is essentially one linear form which is permutation-invariant.
Let us try to find all the linear operations that are permutation-invariant. Such operations are simply multiplications by a matrix, ie
for some numbers that we have to find in such a way that . Equivalently, we must find the such that for every and every permutation , the numbers and are equal. It is quite easy to see that this can happen only if all the are equal. Indeed, by taking to be the first elementary vector and to be the transposition , one sees that and so on. Consequently, there is essentially only one permutation-equivariant linear form, and it is and its multiples.
Our goal is to push further this example, and identify the invariant/equivariant linear operations in all orders and dimensions.
We begin with the dimension of invariant operations, which can be computed quite easily. The solution involves the Bell numbers.
Definition. The -th Bell number is the number of partitions of .
To be clear, a partition of is simply a collection of nonempty disjoint sets, say , such that their union is . It is worth mentioning that these numbers are growing extremely fast. With some Julia code and the
Combinatorics package, one can list all the partitions of with
using Combinatorics BellPart(n) = collect(partitions(1:n))
Here is a small picture of all the partitions of :
using Plots function plot_partition(p) order = maximum(maximum, p) pl = plot(legend=:none, border=:none, axis=nothing, aspect_ratio=:equal, xlim=(-2, 2), ylim=(-2,2)) for block in p circled = [exp(2im*pi*i/order) for i in block] scatter!(pl, real(circled), imag(circled), markersize=10) end return pl end parplots = [plot_partition(p) for p in BellPart(4)] plot(parplots..., layout=(3, 5))
We now state the main result of this note.
Proof. Let be the vector space of all linear operators from to . For any permutation , let be the matrix of the action of . By the definition, a linear operator is invariant iff for every , in other words iff . Consequently, the subspace of invariant operators is and we want to compute its dimension. Thanks to a well-known magical trick from group representation theory, it turns out that the projection from to is given by the group action average defined by
This is easy to prove: first, check that , so is a projector, then check that its image is , which is also not very difficult by double inclusion. Now, the dimension of a subspace is nothing but the trace of its projector, so we want to compute , which by linearity is given by
By , we mean the trace of the linear operator from to defined by . Fortunately, it is a classical exercise to check that for any linear operator , one has .
The dimension will follow from two computations, which we state as lemmas because we will use them later for equivariant layers. We recall that the fixed points of a permutation of are the such that .
Given these, the proof of the dimension formula follows directly from (7), Lemma 1 and Lemma 2.
Proof of Lemma 1 (tensorized version). Check that can actually be written as where is the permutation matrix of , is the Kronecker tensor product, then use that the trace is multiplicative with respect to and that the trace of a permutation matrix is the number of its fixed points.
Proof of Lemma 1 (vectorized version). By definition of the trace, . By the definition of ,
and thus we obtain
Proof of Lemma 2. The group of permutations of acts on the set of functions from by . If has, say, fixed points, then there are functions such that : all the functions who only have values in the fixed points of . Burnside's formula says that
But the orbits of this action are parametrized by partitions of . To see this, just consider the sets for . The nonempty ones form a partition of , and if two have the same partition, then one will certainly find a such that ...
We thus have the dimension of the subspace of invariant linear layers, and the result has something quite stunning... Our theorems say that invariant layers from to are parametrized by dimensions... independently of .
Among other things, a nice consequence is that the learnable parameters are transferable to input tensors with the same order but different sizes!
We also have at our disposal a basis for this subspace. A basis of the linear space is given by the set of all the elementary « matrices », which can be represented as , the linear operator sending the element of the elementary basis of onto 1.
Let be a partition of and consider all the multi-indices which are constant on the blocks of ; more formally, if are in the same block of , then . We note all those indices (the equivalence class induced by ).
where is a partition of and is a basis of the subspace of permutation-invariant linear operators from to .
Well, given the previous result, the proof is almost trivial since it is enough to check that the are linearly independent. In fact, one can also prove the theorem by also checking that this family actually generates the subspace of permutation-invariant linear operators!
One of the main features of this description is that when training neural networks, each linear layer is caracterized by only parameters, provided we have at our disposal the hard-coded basis of the .
Example: Graph-invariant layers. Let us focus on invariant layers from to . Here, we only seek linear operators taking as input square matrices, and invariant with respect to permutations, ie for any permutation . The theorem says that this subspace has only dimensions. The two partitions of are and . The orbit of the first one is simply the set of diagonal couples, , so the corresponding basis vector will be
This is nothing but the trace operator: . The orbit of the second partition is the set of non-diagonal couples, ie with . The corresponding basis vector will be
This is the operator . Consequently, all the invariant linear forms on have the form
Let us now have a look at the same problem, but for equivariance. We seek linear layers from to which are equivariant with respect to permutations. Note that now, we do not restrict to layers between and . We can even consider layers between tensor spaces of different orders and different feature dimensions .
Proof. The proof follows the same lines as for invariant layers, with the difference that the action is not exactly the same. Let us note the set of linear operators from . A linear operator is invariant iff , so in fact we are interested in the dimension of the subspace . Here again, this dimension is the trace of the projection operator:
The main difference is that this time, one has
where the subscripts indicate in which underlying space we're taking the traces. But then, as before, this is equal to
and we can now finish the proof exactly as in the invariant case.
Here again, our remark on dimensions of invariants networks is still valid: the dimension of equivariant networks does not depend on !
For instance, suppose that one has a graph represented by an adjacency matrix (an order-2 tensor), with edge features of dimension 3, and we would like to represent this graph using only node features of dimension, say, 5; additionnally, we want this to be relabelling-equivariant, ie if we chose to label the nodes in a different way, the node features should be permuted accordingly. We thus seek equivariant layers between and and their dimension is .
We also have a basis for equivariant layers. This time, the elementary basis of the space of linear operators will be noted
meaning the operator sending the elementary basic vector of of index to the elementary basic vector of of index . If is a partition of the set , its orbit is the set of all multi-indices which are constant of its blocks and we note this orbit . Then, just as for the invariant case, summing on the orbits yields a basis of .
where is a partition of , is a basis of the space of permutation-equivariant linear operators from to .
Example: basis for graph equivariant layers. Let us find all linear operators which are permutation-equivariant. By the theorem above, the dimension is and we already saw those partitions in the plot above. Using the preceding description, we can try to plot the basis of the , which requires a bit of reflection.
isequal(x) = all(y -> y==x, x) #checks if all the elements of array x are equal is_constant_on_blocks(x, P) = prod([isequal(x[block]) for block in P]) zeros_t(n,k) = zeros(tuple(n*ones(Int, k)...))# creates a tensor with k axes and n dims function lifting(P, c, tensor_order) out = zeros(Int, tensor_order) for i in 1:length(P) a = c[i] * ones(length(P[i])) out[P[i]] = a #c[i] * ones(length(P[i])) end return CartesianIndex(Tuple(out)) end function create_basic_element(partition,n) tensor_order = maximum(maximum, partition) x = zeros_t(n,tensor_order) for c in CartesianIndices(zeros_t(n,length(partition))) cc = lifting(partition, c, tensor_order) x[cc]=1 end return x end function custom_heatmap(mat) #some code for a beautiful heatmap end k=2 ; n=4 basis = [create_basic_element(p, n) for p in BellPart(n)] hmaps = [custom_heatmap(reverse(reshape(x, n^k, n^k), dims=1)) for x in basis] plot(hmaps..., layout=(3,5))
I took , so the dimension of is , and then I represented each operator as a matrix. What you see above are the heatmap of the 15 basis elements of the space of equivariant operators; in the heatmap, yellow = 1 and black = 0.
One of the goals of deep learning is to craft invariant or equivariant neural architectures for approximating invariant/equivariant functions, and the idea presented in this note was to concatenate several layers of linear invariant/equivariant operators and non-linearities. This will surely yield an invariant/equivariant network... But can all continuous invariant/equivariant functions be approximated by these kind of layers? This problem, known as the expressivity or universality problem, is not an easy one. Here is a small summary of what seems to be the current state of the art (I'll probably write a more detailed note on these topics):
Yes, every continuous invariant function can be approximated by invariants networks as the ones we studied. But this might require hidden layers with prohibitive orders: if the original input is -dimensional, one might need tensors of order up to , which means vectors from a -dimensional space... This is obviously not usable for practical purposes. Same thing for equivariant networks.
For graphs (order-2 tensors), the most expressive architectures with low-order tensors seem to be the so-called Folklore Graph Neural Networks (FGNN), who incorporate message-passing-like layers (but they are not message-passing networks).
There are alternative architectures that are easy to implement, usable, and universal, such as PointNetST or DeepSets. They are often used for ML tasks on 3d point clouds.
I guess that the dimension of permutation invariant or equivariant linear operators is some kind of folklore problem, but it got digged out of oblivion pretty recently in the context of deep learning; a modern-language proof arose in the Invariant and equivariant graph networks wonderful paper (and those which followed by the same group of authors), which seemed to be the first to introduce these invariant networks with concatenations of invariant layers. Most results in my note are inspired by this paper.
Invariant and equivariant graph networks by Maron, Ben-Hamu, Shamir and Lipman.
On learning sets of symmetric elements by Maron, Litany, Chechik and Fetaya.
On the universality of invariant networks by Maron, Fetaya, Segol and Lipman.
On universal equivariant set networks by Segol and Lipman.
Universal Invariant and Equivariant GNN by Keriven and Peyré.
Expressive power of invariant and equivariant GNN, by Azizian and Lelarge.
|||is a linear operator, itself defined on the space of linear operators from to . Sometimes it's called a functor.|