Machine Learning.
Introduction
(Notes from a reading of Cohen and Welling: Steerable CNNs.)
Consider R2 as an affine hyperplane in R3, embedded via the map x↦(x,1). Then the Euclidean motion group ˜G=R2⋉O(2) has a convenient matrix representation. Let r be a rotation and t a translation. Then
r=(R001) t=(IT01)
where R∈O(2) and T∈R2. Given x∈R2, we may identify it with a translation map in ˜G via
ˉx=(Ix01).In image recognition, Z2 parametrizes the pixels of an image with infinite width and height, and a discrete subgroup G of ˜G acts on this parametrization. In particular, G=Z2⋉D4, so that our parametrization is the homogeneous space Z2=G/D4. Let’s call this parametrization the pixel space.
Images are described by feature maps, which are functions f:G/D4→RK, with each dimension of the target interpreted as a color channel. For example, K=3 may correspond to an RGB image, and K=4 may correspond to a CMYK image. A feature map typically takes non-negative rational values, and these values correspond to pixel intensities. A representation of a real-world image would thus be a compactly supported feature map.
Over the pixel space, we have a homogeneous vector bundle G×D4RK with an action by the discrete motion group given by g′⋅(g,v)=(g′g,v). Let F=Γ(G×D4RK) be the space of all feature maps. G acts on it by left translation: π(tr)f(xD4)=(tr)⋅f((tr)−1⋅xD4).
Let Ψ:F→RK′ be a filter bank. This can be thought of as a collection of K′ linear functionals on the space of feature maps. Each linear functional is an operation that is performed on the value of a fixed pixel (which we may as well assume is the origin). Each linear functional outputs a weighted sum of the value associated to the fixed pixel and its neighbors. By translating pixels to the origin, we can construct a feature map Ψ∗f∈F′, where F′=Γ(G×D4RK′) is another space of feature maps. This is defined as follows:
(Ψ∗f)(x)=Ψ(π(ˉx)−1f).Thus we get a map Φ:F→F′ via Φ(f)=Ψ∗f. This map is called a convolutional neural network.
Let (RK′,ρ) be a representation of D4 on the model fiber of F′. Our objective is to construct a representation (F′,π′) such that if the filter bank Ψ intertwines the dihedral group representations (F,π) and (RK′,ρ), then the convolutional neural network Φ intertwines the discrete motion group representations (F,π) and (F′,π′). First we need some algebra:
Lemma. Let r be a rotation, t a translation, and ˉx the translation corresponding to the position x∈R2. Then
(tr)−1ˉx−1r=¯(tr)−1⋅x.The proof is a straightforward application of the given matrix representation. This result leads to the following equivariance rule.
Proposition. Let r be a rotation, t a translation, and ˉx the translation corresponding to the position x∈Z2. If Ψπ(r)=ρ(r)Ψ for all r∈D4, then
(Ψ∗π(tr)f)(x)=ρ(r)(Ψ∗f)((tr)−1⋅x).Proof. By using an identity trick, we can exploit our two versions of the same map to produce the equivariance law:
(Ψ∗π(tr)f)(x)=Ψ(π(ˉx)−1π(tr)f)=ρ(r)Ψ(π(r)π(r)−1π(ˉx)−1π(tr)f)=ρ(r)Ψ(π(r−1ˉx−1tr)f)=ρ(r)Ψ(π((tr)−1ˉxr)−1f)=ρ(r)Ψ(π(¯(tr)−1⋅x)−1f)=ρ(r)(Ψ∗f)((tr)−1⋅x)
With this calculation in mind, we are now in a position to define a representation of G on F′:
π′(tr)(Ψ∗f)(x)=ρ(x)(Ψ∗f)((tr)−1⋅x).To verify this, we note that one checks the relation f(t1r1t2r2)=f(t1r1)f(t2r2) by using the fact that the conjugate of a translation by a rotation is again a translation, and that
t1r1t2r2=t1r1t2r−11r1r2.As a consequence, we get an intertwining property:
Corollary. If Ψπ(r)=ρ(r)Ψ for all r∈D4, then Φπ(g)=π′(g)Φ for all g∈G.
Whenever we can find representations (F,π) and (F′,π′) for which Φ is an intertwiner, we say that Φ is a steerable convolutional neural network.
To determine the homogeneous vector bundle of which Ψ∗f is a section, it is enough to calculate the action of D4 on (Ψ∗f)(0):
π′(r)(Ψ∗f)(0)=ρ(r)(Ψ∗f)(π(r)−1⋅0)=ρ(r)(Ψ∗f)(0).
This means that Ψ∗f∈Γ(G×ρRK′). From representation theory, we know that
Γ(G×ρRK′)≅IndGD4(ρ)thus we may interpret Φ as a map into an induced representation of the discrete motion group. Moreover, we can treat Γ(G×D4RK) as an induced representation as well.
On F, the action of D4 should only rotate the pixel of the image; there should be no linear transformations within fibers (i.e. no transformations of color channels). This means that D4 has a trivial action on the value of f at the origin. Hence we are regarding F as Γ(G×ρ0RK), where ρ0 denotes the trivial representation of D4. To summarize, Γ(G×ρ0RK)≅IndGD4(ρ0)≅C(Z2). We can now restate this picture in a representation theoretic context.
Theorem. Let (RK,ρ0) be a trivial representation of D4 and IndGD4(ρ0) a space of feature maps. Let Ψ:IndGD4(ρ0)→RK′ be a D4-equivariant filter with respect to (RK′,ρ). Then Φ:IndGD4(ρ0)→IndGD4(ρ) is a steerable convolutional neural network.
Leave a comment