# Linear AlgebraOrthogonality

The **orthogonal complement** of a vector space is the set of vectors in which are orthogonal to every vector in . For example, the orthogonal complement a two-dimensional subspace of is the

**Exercise**

The orthogonal complement of the span of the columns of a matrix is equal to

*Solution.* The correct answer is the null space of

Furthermore, if is in the null space of , then it is orthogonal to any linear combination of the columns of , since

Therefore, orthogonal complement of the span of the columns of a matrix is equal to the null space of .

## Orthogonal decomposition

For any vectors and in , it is always possible to write as the sum of a multiple of and a vector which is perpendicular to :

**Exercise** (Orthogonal decomposition)

Suppose that and are nonzero vectors in . Solve the equation

for to find the multiple of which is perpendicular to its difference with .

*Solution.* The equation gives , which implies that

If is equal to where is perpendicular to , then we call the **projection** of onto . As the geometric intuition suggests, the projection of onto is the closest vector to among all vectors parallel to .

Orthogonal bases can be much more *orthogonalize* a basis without changing its span:

**Theorem** (Gram-Schmidt)

Every vector space has an orthogonal basis.

*Proof.* Suppose that is a basis of . We will build our orthogonal basis by orthogonalizing one vector at a time.

Define , and then define to be the part of which is orthogonal to :

Similarly, we project onto and onto and subtract off both of these projections:

Then has the same span as and is pairwise orthogonal. We may continue this procedure (project each new onto each of the preceding 's and subtract off all of these projections) until we reach the end of the list, thereby obtaining an orthogonal basis of .

**Theorem**

Suppose is a vector space. Then every vector can be written as the sum of a vector in and a vector in .

*Proof.* Consider an orthogonal basis of , and define

Then is in and is

**Exercise**

Suppose that is a -dimensional vector space in . Show that there is a basis of whose first vectors form a basis for and whose last vectors form a basis for .

*Solution.* Suppose that is a basis for and is a basis for . We claim that

is a basis for .

First, it's linearly independent because no vector is in the span of the preceding vectors: (1) the 's are linearly

by , we find that , which implies that . Thus , which is not compatible with the fact that the 's form a

Therefore, we conclude that is linearly independent.

Finally, the list spans since every vector in can be written as a sum of a vector in and a vector in .

## Orthogonal matrices

Suppose we can write a given transformation as a composition involving (i) a single transformation which scales space along the coordinate axes, and (ii) some other transformations which preserve distances and angles—like rotations and reflections in or . Such a decomposition of would be useful because it isolates the space-distorting behavior of in the simple transformation . In preparation for developing such a decomposition, let's study transformations which are distance-preserving and angle-preserving.

A transformation from to is distance-preserving if the norm of is the same as the norm of for all . Using dot products, we can write the distance-preserving condition as

If the transformation preserves angles as well as distances, then must also be equal to for all and in . Rewriting this equation using transposes, we see that we want

for all and in . This identity only holds if is equal to the identity matrix. This leads us to the following definition.

**Definition** (Orthogonal matrix)

A square matrix is **orthogonal** if is equal to the identity matrix.

Equivalently, we can say that a square matrix is orthogonal if and only if its columns are *orthonormal*, which means that they are orthogonal and have unit norm. If a non-square matrix satisfies , then we refer to as a *matrix with orthonormal columns*.

**Exercise**

(i) Explain why, for an matrix with orthonormal columns, we must have . (ii) Explain why the rank of is .

*Solution.*

We first recall that the number of linearly independent columns in cannot be greater than because the range of is a subspace of Now, by definition, the columns of are orthogonal and non-zero. These columns must be linearly independent, so

The rank of is equal to the number of linearly independent columns in which is in this case.

If is an matrix with orthonormal columns and if , then is an matrix of rank and therefore cannot be the identity matrix. In fact, is a projection matrix:

**Exercise**

Show that if is an matrix with orthonormal columns, then is the matrix of the transformation which projects each vector in onto the -dimensional subspace of spanned by the columns of .

*Solution.* The transformation which maps a vector onto the span of the columns of is given by

All of the denominators in this formula are equal to 1 because the columns of are unit vectors. So

The vector whose components are the expressions in parentheses, namely , is equal to , by the definition of the matrix-vector product. Applying that definition a second time (interpreting as a linear combination of the 's with weights given by the parenthetical dot products), we find that .

**Exercise**

Let be a vector in , and consider the linear transformation defined as . What is the rank of ? Geometrically describe the null space of .

*Solution.* The rank of is if otherwise the rank is Geometrically, the null space of is the set of vectors in that are orthogonal to

## An application: linear regression

One of the most common methods in statistics is *linear regression*. Given columns of numerical data, we seek a linear combination of the first columns which gets as close as possible to the last column. This can be helpful if the last column contains values you want to predict and the other columns contain data which is accessible at the time of prediction. For example, the last column might contain points scored by a given player in a Game 7 of a playoff series, while the previous columns contain the number of points scored by that player in the first 6 games of that series.

Suppose that the first columns are arranged into a matrix and the last column is called . Since

which implies that , assuming is invertible. This intuition is accurate, and the equation we derived is called the **normal equation**.

**Exercise**

Use the code below to build a random 100 × 6 matrix whose first five columns are linearly dependent and whose sixth column is not in the span of the first five. Use the normal equation to try to solve for the weights of the linear combination of the first five columns which gets closest to the sixth column. What goes wrong?

Note: `np.linalg.solve(A,b)`

`A \ b`

solves the equation for .

import numpy as np A = np.random.randint(0,2,(100,6)) A[:,4] = A[:,3] + A[:,2]

A = rand(0:1, 100, 6) A[:, 5] = A[:, 4] + A[:, 3]

*Solution* We try

b = A[:,5] A = A[:,:-1] np.linalg.solve(A.T @ A, A.T @ b)

b = A[:,5] A = A[:,1:end-1] (A' * A) \ (A' * b)

and we get an error telling us that is not invertible. This makes sense, because has the same rank as , and we know is

**Exercise**

Try the previous exercise again, but this time with the linear dependence relation holding only approximately. What goes wrong this time?

import numpy as np A = 1.0*np.random.randint(0,2,(100,5)) b = np.random.randint(0,2,(100,)) A[:,4] = A[:,3] + A[:,2] + 1e-3*np.random.standard_normal(100)

A = 1.0*rand(0:1, 100, 5) b = 1.0*rand(0:1, 100) A[:,4] = A[:,3] + A[:,2] + 1e-3*randn(100)

*Solution.* We take a look at the solution:

plt.bar(range(5),np.linalg.solve(A.T @ A, A.T @ b))

using Plots bar(1:5,(A' * A) \ (A' * b), label = "solution components")

We see that it gives large and oppositely-signed coefficients for the last three vectors. We can tell that the optimization process is leveraging the tiny difference between the last vector and the sum of the two before it to "reach" in the direction of . Although we did not get a singularity error this time, the result is no less undesirable, because predictions which depend on tiny differences between measured values are clearly not going to be useful. We will see what we can do about this problem when we develop the *singular value decomposition*.