Machine Learning.

4 minute read

Introduction

(Notes from a reading of Cohen and Welling: Steerable CNNs.)

Consider R2 as an affine hyperplane in R3, embedded via the map x(x,1). Then the Euclidean motion group ˜G=R2O(2) has a convenient matrix representation. Let r be a rotation and t a translation. Then

r=(R001) t=(IT01)

where RO(2) and TR2. Given xR2, we may identify it with a translation map in ˜G via

ˉx=(Ix01).

In image recognition, Z2 parametrizes the pixels of an image with infinite width and height, and a discrete subgroup G of ˜G acts on this parametrization. In particular, G=Z2D4, so that our parametrization is the homogeneous space Z2=G/D4. Let’s call this parametrization the pixel space.

Images are described by feature maps, which are functions f:G/D4RK, with each dimension of the target interpreted as a color channel. For example, K=3 may correspond to an RGB image, and K=4 may correspond to a CMYK image. A feature map typically takes non-negative rational values, and these values correspond to pixel intensities. A representation of a real-world image would thus be a compactly supported feature map.

Over the pixel space, we have a homogeneous vector bundle G×D4RK with an action by the discrete motion group given by g(g,v)=(gg,v). Let F=Γ(G×D4RK) be the space of all feature maps. G acts on it by left translation: π(tr)f(xD4)=(tr)f((tr)1xD4).

Let Ψ:FRK be a filter bank. This can be thought of as a collection of K linear functionals on the space of feature maps. Each linear functional is an operation that is performed on the value of a fixed pixel (which we may as well assume is the origin). Each linear functional outputs a weighted sum of the value associated to the fixed pixel and its neighbors. By translating pixels to the origin, we can construct a feature map ΨfF, where F=Γ(G×D4RK) is another space of feature maps. This is defined as follows:

(Ψf)(x)=Ψ(π(ˉx)1f).

Thus we get a map Φ:FF via Φ(f)=Ψf. This map is called a convolutional neural network.

Let (RK,ρ) be a representation of D4 on the model fiber of F. Our objective is to construct a representation (F,π) such that if the filter bank Ψ intertwines the dihedral group representations (F,π) and (RK,ρ), then the convolutional neural network Φ intertwines the discrete motion group representations (F,π) and (F,π). First we need some algebra:

Lemma. Let r be a rotation, t a translation, and ˉx the translation corresponding to the position xR2. Then

(tr)1ˉx1r=¯(tr)1x.

The proof is a straightforward application of the given matrix representation. This result leads to the following equivariance rule.

Proposition. Let r be a rotation, t a translation, and ˉx the translation corresponding to the position xZ2. If Ψπ(r)=ρ(r)Ψ for all rD4, then

(Ψπ(tr)f)(x)=ρ(r)(Ψf)((tr)1x).

Proof. By using an identity trick, we can exploit our two versions of the same map to produce the equivariance law:

(Ψπ(tr)f)(x)=Ψ(π(ˉx)1π(tr)f)=ρ(r)Ψ(π(r)π(r)1π(ˉx)1π(tr)f)=ρ(r)Ψ(π(r1ˉx1tr)f)=ρ(r)Ψ(π((tr)1ˉxr)1f)=ρ(r)Ψ(π(¯(tr)1x)1f)=ρ(r)(Ψf)((tr)1x)

With this calculation in mind, we are now in a position to define a representation of G on F:

π(tr)(Ψf)(x)=ρ(x)(Ψf)((tr)1x).

To verify this, we note that one checks the relation f(t1r1t2r2)=f(t1r1)f(t2r2) by using the fact that the conjugate of a translation by a rotation is again a translation, and that

t1r1t2r2=t1r1t2r11r1r2.

As a consequence, we get an intertwining property:

Corollary. If Ψπ(r)=ρ(r)Ψ for all rD4, then Φπ(g)=π(g)Φ for all gG.

Whenever we can find representations (F,π) and (F,π) for which Φ is an intertwiner, we say that Φ is a steerable convolutional neural network.

To determine the homogeneous vector bundle of which Ψf is a section, it is enough to calculate the action of D4 on (Ψf)(0):

π(r)(Ψf)(0)=ρ(r)(Ψf)(π(r)10)=ρ(r)(Ψf)(0).

This means that ΨfΓ(G×ρRK). From representation theory, we know that

Γ(G×ρRK)IndGD4(ρ)

thus we may interpret Φ as a map into an induced representation of the discrete motion group. Moreover, we can treat Γ(G×D4RK) as an induced representation as well.

On F, the action of D4 should only rotate the pixel of the image; there should be no linear transformations within fibers (i.e. no transformations of color channels). This means that D4 has a trivial action on the value of f at the origin. Hence we are regarding F as Γ(G×ρ0RK), where ρ0 denotes the trivial representation of D4. To summarize, Γ(G×ρ0RK)IndGD4(ρ0)C(Z2). We can now restate this picture in a representation theoretic context.

Theorem. Let (RK,ρ0) be a trivial representation of D4 and IndGD4(ρ0) a space of feature maps. Let Ψ:IndGD4(ρ0)RK be a D4-equivariant filter with respect to (RK,ρ). Then Φ:IndGD4(ρ0)IndGD4(ρ) is a steerable convolutional neural network.

Leave a comment