2024-12-22

Math 51: Methods in Several Variables

Intro

Stanford is a bit strange in their coverage of linear algebra and multivariable calculus. Math 51 covers what you would find in your typical introductory multivariable calculus class, but only the derivative/gradient side of things, as well as a brief introduction to the usefulness of matrix decompositions and eigenvalues. The course is largely based on memorization of methods and performing computational routine. I didn't find the course that enlightening and honestly wished it followed a traditional routine, like that of Gill Strang's 18.06.

Overall Stats

Grade breakdown:

  • 20%: Midterm 1
  • 20%: Midterm 2
  • 30%: Final exam
  • 10%: PCRQs (Pre-class Reading Questionairres)
  • 20%: Problem

Here the PCRQs and Problem set marks are divided by 0.8: e.g if I get a total of 75% on the HW section I'll actually end up with a 0.750.8=0.9375\frac{0.75}{0.8} =0.9375.

Stats for Fall Quarter 2024:

  • Midterm 1: Median 45.5/60, SD 11, Mean 44/60
  • Midterm 2: Median 50.5/60, SD 8.41, Mean 48.5/60
  • Midterm 3: Median 67/90, SD 15, Mean 64/90

The curve in my quarter was only about 2.5-3%, since fall quarter tends to be quite brutal. Most kids taking Math 51 in fall already took linear algebra and multivariable calculus in high school, so exam scores tend to be higher on average. I've heard that the curve in other quarters is more like 8-10%, enough to move up a full letter grade, and I'll add those stats from friends that are taking Math 51 in these upcoming quarters when they're available.

Meta-Level Stuff

The biggest thing is that I learned is that I should have shopped around when it came to TAs and section. On the very last discussion section day before the final exam I dropped in on a dude named Pino's discussion section, which my friend recommended, and it was brilliant and amazingly well taught. This is in contrast to my own section leader, who sort held a quiet session for an hour where we'd just do homework (which is a bit of a waste of time, in my opinion).

I made it out of Math 51 with a decent grade after skipping pretty much every single lecture and section. But my study schedule for math was pretty bad and I tended to leave things to the last minute, which was cognitively draining and required a lot of work.

In an attempt to iterate and learn from mistakes, I'm going to try this study regime:

  1. Two days before lecture, read textbook in depth
  2. Actually go to lecture, and do a couple simple example problems the same day as lecture
  3. Do associated pset problems 2 days later

This way I naturally bake in a spaced repetition of about 2 days or so, 3 times, so hopefully the material sticks better. I'll see if this works better for me in Math 52.

Approximation

Let's define the derivative matrix of a function f:RnRM\textbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^M at a point a\textbf{a}, which we'll denote with (Df)(a)(D\textbf{f})(\textbf{a}), to be the m×nm \times n matrix:

(Df)(a)=[f1x1(a)f1x2(a)f1xn(a)f2x1(a)f2x2(a)f2xn(a)fmx1(a)fmx2(a)fmxn(a)](D\textbf{f})(\textbf{a}) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{a}) & \frac{\partial f_1}{\partial x_2}(\mathbf{a}) & \cdots & \frac{\partial f_1}{\partial x_n}(\mathbf{a}) \\ \frac{\partial f_2}{\partial x_1}(\mathbf{a}) & \frac{\partial f_2}{\partial x_2}(\mathbf{a}) & \cdots & \frac{\partial f_2}{\partial x_n}(\mathbf{a}) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}(\mathbf{a}) & \frac{\partial f_m}{\partial x_2}(\mathbf{a}) & \cdots & \frac{\partial f_m}{\partial x_n}(\mathbf{a}) \end{bmatrix}

Using this, we can define the linear approximation for a point near a\textbf{a} to be:

f(a+h)=f(a)+((Df)(a))h\textbf{f}(\textbf{a} + \textbf{h}) = \textbf{f(a)} + ((D\textbf{f})(\textbf{a}))\textbf{h}

The mathematical intuition here is that h\textbf{h} is an nvectorn-\text{vector} that gives our multivariable "steps", so multiplying the derivative matrix with h\textbf{h} we get the change in the surface of the function given by the associated derivative.

In multiple variables we define differentiability of f:RnRm\textbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m at aRn\textbf{a} \in \mathbb{R}^n to be the existence of an m×nm \times n matrix LnL_n so that as h\textbf{h} approaches 0\textbf{0} we have:

f(a+h)(f(a)+Lhh)h0\frac{||\textbf{f(a+h)} - (\textbf{f(a)} +L_h\textbf{h})||}{||\textbf{h}||} \rightarrow 0

Here, if the partial derivatives exist, then so does the derivative matrix, and for a differentiable function LhL_{h} is just (Df)(a)(D\textbf{f})(\textbf{a}).

Again, the intuition for this is similar to single variable definitions of differentiability - there is, if you zoom in on the function's surface enough, the function will exhibit some quality of "smoothness" that allows it to be differentiable.

I think that while these seem like very basic initial applications, I've chosen to include it because it ties together how everything in single variable calculus can easily extended into multiple variables just using the language of matrices and vectors. Take the definition of differentiability: if one sets m,n=1m, n = 1, then it can be seen that the definition just becomes the single variable definition of differentiability.

Matrices as Transformation

A nice mathematical intuition for what matrices do is encode some linear transformation for a vector. This also means that matrices encode functions, since they take inputs and give some output for a given vector. Take some matrix MM:

M=[0110]M = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

When we multiply MM by i^\hat{i} and j^\hat{j} we get:

Mi^=[01] and Mj^=[10]M\hat{i} = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \text{ and } M\hat{j} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}

This matrix is literally transforming i^\hat{i} and j^\hat{j} to swap in a sense, resulting in a reflection across the line y=xy = x.

In general, for an n×nn \times n square matrix MM, one can think of each column of the matrix as encoding where each respective basis vector e1,e2,...,en\textbf{e}_1, \textbf{e}_2, ..., \textbf{e}_n ends up after applying the transformation encoded by MM.

We see this in more clarity when we consider how matrix multiplication works for some generic vector v=[x1x2x3]R3\textbf{v} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \in \mathbb{R}^3:

Av=[y11y12y13y21y22y23y31y32y33][x1x2x3]A\textbf{v} = \begin{bmatrix} y_{11} & y_{12} & y_{13} \\ y_{21} & y_{22} & y_{23} \\ y_{31} & y_{32} & y_{33} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} =[x1y11+x2y12+x3y13x1y21+x2y22+x3y23x1y31+x2y32+x3y33]=\begin{bmatrix} x_1y_{11} + x_2y_{12} + x_3y_{13} \\ x_1y_{21} + x_2y_{22} + x_3y_{23} \\ x_1y_{31} + x_2y_{32} + x_3y_{33} \end{bmatrix} =x1[y11x21x31]+x2[y12x22x32]+x3[y13x23x33]= x_1\begin{bmatrix} y_{11} \\ x_{21} \\ x_{31} \end{bmatrix} + x_2\begin{bmatrix} y_{12} \\ x_{22} \\ x_{32} \end{bmatrix} + x_3\begin{bmatrix} y_{13} \\ x_{23} \\ x_{33} \end{bmatrix}

The matrix MM is literally encoding a linear combination of its columns, with scalar coefficients given by the elements of v\textbf{v}. This can be extended to Rn\mathbb{R}^n. Consider what happens when we have basis vectors:

e1=[100],e2=[010],e3=[001]\textbf{e}_{1}= \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \textbf{e}_{2}=\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \textbf{e}_{3}=\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}

Setting v\textbf{v} to any of the above ej\textbf{e}_{j} causes every column vector in the linear combination to disappear except for the associated jthj_{th} column vector - if we plug in the jthj_{th} basis vector we obtain the jthj_{th} column vector of MM, so MM is encoding where that jthj_{th} basis vector ends up after applying the function represented by MM.

We can also multiply matrices with each other to encode "compound" transformations, analogous to function compositions.

To illustrate compound matrix transformations, consider two rotation matrices:

  1. 90° Clockwise Rotation Matrix:
RCW=[0110]R_{\text{CW}} = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}
  1. 90° Counterclockwise Rotation Matrix:
RCCW=[0110]R_{\text{CCW}} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}

When we multiply these matrices, we get:

RCWRCCW=[0110][0110]R_{\text{CW}} \cdot R_{\text{CCW}} = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix} \cdot \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}

Performing the multiplication:

RCWRCCW=[1001]R_{\text{CW}} \cdot R_{\text{CCW}} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}

Thus, the product of a 90° clockwise rotation and a 90° counterclockwise rotation results in the identity matrix, which makes sense (rotating a vector clockwise and then counterclockwise by 90 degrees should do nothing to that vector)

RCWRCCW=IR_{\text{CW}} \cdot R_{\text{CCW}} = I

This shows that the two transformations cancel out, resulting in the identity matrix.

Friendship Networks

We can consider a matrix that encodes friends, call it FF - specifically, we say that the ijthij_{th} entry is 11 if ii and jj are friends, and 0 otherwise. We make two assumptions:

  1. Friendship is reciprocal, so FF must be symmetric
  2. You cannot be friends with yourself, so the diagonal of FF is all zero

Turns out that the ijthij_{th} entry of FF=F2FF = F^2 is the number of common friends between ii and jj. Why? We can give a formula for the ijthij_{th} entry of F2F^2:

ijth entry of F2=k=1nniknkjij_{th} \text{ entry of } F^2 =\sum_{k = 1}^n n_{ik} n_{kj}

Notice that niknkjn_{ik}n_{kj} only equals 11 if both nikn_{ik} and nkjn_{kj} are 1, or in other words person kk is a friend of both person ii and jj. So the total summation gives the number of people who are friends with ii and friends with jj!

I thought this result was interesting, if not really all that practical. I doubt that in real life any serious social media company is storing information about the graph of their users in a matrix, since 99.999% of that matrix would be empty, and the matrix would have several billion rows and columns. Instead similar functionality can be obtained just with intersect operations and SQL lookups.

Span and linear dependence

Span is a simple concept but it's so important to intuition in linear algebra that I include it here to help ensure I never forget it. We define the span of some collection of vectors v1,v2,...vnRn\textbf{v}_1, \textbf{v}_2, ... \textbf{v}_n \in \mathbb{R}^n to be the collection of all vectors in Rn\mathbb{R}^n that we can make through some linear combination of the collection of vectors. In formal logic, the span is something like:

xspan(v1,...,vn)\textbf{x} \in \text{span}(\textbf{v}_{1},...,\textbf{v}_{n})     \iff c1,...,cnR.x=c1v1+...+cnvn\exists c_{1},..., c_{n} \in \mathbb{R}. \textbf{x} = c_{1}\textbf{v}_{1} + ... + c_n\textbf{v}_n

So intuitively, the vectors

v=[100] and w=[010]\textbf{v} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \text{ and } \textbf{w} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}

span the xyxy plane in R3\mathbb{R}^3, since we can make any vector in the xyxy plane by linearly combining v\textbf{v} and w\textbf{w}. Similarly, the vectors

v=[100] and w=[010] and u=[001]\textbf{v} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \text{ and } \textbf{w} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} \text{ and } \textbf{u} = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}

span all of R3\mathbb{R}^3, since we can linearly combine v\textbf{v}, w\textbf{w}, and u\textbf{u} to make any vector in all of R3\mathbb{R}^3.

Matrix Spaces

Again, simple concepts, but so important to intuitively reasoning about matrices I include them here.

Column Space:

We define column space to be the span of the column vectors of some matrix MM. So if we define MM as:

M=[y11y12y13y21y22y23y31y32y33]M = \begin{bmatrix} y_{11} & y_{12} & y_{13} \\ y_{21} & y_{22} & y_{23} \\ y_{31} & y_{32} & y_{33} \end{bmatrix}

then the column space, which we denote with C(M)C(M), is just the span of the vectors:

[y11x21x31],[y12x22x32],[y13x23x33]\begin{bmatrix} y_{11} \\ x_{21} \\ x_{31} \end{bmatrix}, \begin{bmatrix} y_{12} \\ x_{22} \\ x_{32} \end{bmatrix}, \begin{bmatrix} y_{13} \\ x_{23} \\ x_{33} \end{bmatrix}

For any square matrix MM, since we know that MM really just encodes some linear combination of its columns from earlier, the system Mx=bM\textbf{x} = \textbf{b} has a solution only when b\textbf{b} lies in the column space of MM. If b\textbf{b} isn't in the column space of MM, then there's no way to linearly combine the columns of MM to create b\textbf{b}. This is the intuitive reason for why some linear systems may lack a solution.

Null Space:

Null spaces are useful since if the null space of some square matrix contains any other vector than just 0\textbf{0} (the zero vector), then it turns out that if Ax=bA\textbf{x} = \textbf{b} has a solution then there are infinite solutions.

We denote the null space of some matrix AA as N(A)N(A), and we define it to be the set of all solutions in Rn\mathbb{R}^n to the system Ax=0A\textbf{x} = \textbf{0}.

Say we find some solution x\textbf{x} to the system Ax=bA\textbf{x} = \textbf{b}, where b0\textbf{b} \not = \textbf{0}, and we know that the system Ax=0A\textbf{x} = \textbf{0} has some solution v0\textbf{v} \not = \textbf{0}. Then we know that Ax=bA\textbf{x} = \textbf{b} actually has infinite solutions, since A(x+v)=Ax+Av=Ax+0A(\textbf{x} + \textbf{v}) = A\textbf{x} + A\textbf{v} = A\textbf{x} + \textbf{0}. So we see that, by the properties of a linear subspace, the nullspace contains infinite vectors (in this case, any scalar multiple of vv), so to obtain a solution we simply take xx and add any vector in the nullspace of AA.

Over and Underdetermined Systems

Consider some system Ax=bA\textbf{x} = \textbf{b}. We call the system overdetermined if there are more associated equations than unknowns (So if AA is an m×nm \times n matrix, m>nm > n). Intuitively it is often (but not always) the case that there are no solutions, since there are fewer columns than there are rows, so the column space doesn't cover all of Rn\mathbb{R}^n.

On the other hand we say a system is underdetermined if there are more unknowns than equations (so m<nm < n). Intuitively these systems often (but not always) have infinite solutions, since the nullspace is nonzero and contains some nonzero vector (since there are more columns than rows, so some column must be "redundant" and not linearly independent of the others).

These are a nice "dipstick" test for some given linear system, even if they aren't definitive truths.

Orthogonal Matrices

An orthogonal matrix AA is defined as a square matrix where the columns form an orthonormal basis. For orthogonal matrices, AT=A1A^{T}= A^{-1}.

QR and LU Decompositions

I skip over the details of QR and LU decompositions and finding them since it's pretty tedious, mechanical, and easy to pick up.

To be brief, "most" (but not all) n×nn \times n matrices AA have form A=LUA = LU for n×nn \times n lower triangular LL and n×nn \times n upper triangular UU. This is helpful since systems that can be decomposed into this form can be easily solved.

Ax=bA\textbf{x} = \textbf{b} LUx=bLU\textbf{x} = \textbf{b}

Then all we have to do is solve 1) Ly=bL\textbf{y} = \textbf{b} and then solve 2) Ux=yU \textbf{x} = \textbf{y}.

Invertible matrices can also be decomposed in the form A=QRA = QR, where QQ is an n×nn \times n orthogonal matrix and RR is an n×nn \times n upper triangular matrix. Since QQ is an orthogonal matrix, Q1Q^{-1} is just QTQ^T, so we solve Ax=bA \textbf{x} = \textbf{b} by simply noting that A1=R1QTA^{-1} = R^{-1}Q^T (R1R^{-1} is easy to find mechanically because it is upper triangular).

What's a lot more interesting than finding these decompositions is what we can do with them, or more generally what solving linear algebraic systems lets us do. The most interesting to me was using these decompositions to find approximate solutions to model some function or phenomena.

This applies to me with regards to electrical engineering, where I might need to find a small number of constants in a model that fits a lot of data. Another way to say this is that I may need to approximate an unknown function by a linear combination of a small number of known functions, and we want to find the coefficients of that linear combination. This also, as one might imagine, comes up all the time in data science, physics, or any discipline where decent mathematical models for something are necessary.

To give a better intuition for what we're actually doing, we can pretend we're conducting some signal processing experiment where we're measuring some unknown signal, f:[0,T]Rf : [0, T] \rightarrow \mathbb{R}, on the entire time interval of length TT. If we have some known basic signals, like simple sine, cosine, saw, etc. f1,...,fs:[0,T]Rf_1, ... , f_s : [0, T] \rightarrow \mathbb{R}, then we want the linear combinations that will allow us to take these known functions (which are only measured over some part of the interval [0,T][0, T]) and approximate just the one function ff over the entire interval [0,T][0, T].

Mathematically, we might want to approximate some unknown function f:RRf: R \rightarrow \mathbb{R} on some region R (which could be a surface in space or a time interval). The approximation will be a linear combination of some known functions f1,...,fn:RRf_1, ..., f_n: R \rightarrow \mathbb{R}, and be of form c1f1+...+cnfnc_1f_1 + ... + c_nf_n. All this data comes from "experiments", where we might measure the values of a function ff at some specific point or time r1,...,rmRr_1, ..., r_m \in R.

Let A be the m×nm\times n matrix of values of the fj’sf_j\text{'s} at many points r1,...,rmRr_1, ..., r_m \in R. Let b\textbf{b} be the vector whose ithi_{th} entry bib_i is the measured value of ff at the point rir_i in the experiment. So the "best approximate solution" to the system Ax=bA\textbf{x} = \textbf{b} will give us the values c1,...,cnc_1, ..., c_n we need for our function approximation.

For this system all we need to do is decompose it into a QRQR decomposition, so we see that x=(ATA)1ATb\textbf{x} = (A^TA)^{-1} A^T\textbf{b}. If a solution exists for this system, then x\textbf{x} is our solution, and its entries will give the scalar coefficients we need.

Spectral Theorem and Applications

The spectral theorem is as follows:

Theorem: Let AA be a symmetric n×nn \times n matrix. There is an orthogonal basis w1,...,wn\textbf{w}_1, ..., \textbf{w}_n consisting of eigenvectors for AA. There's a few applications of this that I thought were kinda neat.

Finding Quadratic Forms

For any symmetric matrix AA, we can now find its quadratic form in terms of eigenvalues. We call this the diagonalization formula, given by

qA(v)=i=1nλi(wiwi)ti2q_A(\textbf{v}) = \sum_{i=1}^n\lambda_i( \textbf{w}_i\cdot \textbf{w}_i)t_i^2

This will give a quadratic form where there are no cross-terms (i.e no terms with xyxy, or yzyz), which becomes extremely useful since this allows us to determine the geometry of the function encoded by AA.

Definite-ness of Matrices

Let QQ be the quadratic form of a matrix AA. We say the matrix AA represents a positive-definite, negative-definite, or indefinite quadratic form if:

  • Positive-definite: q(x)>0q(\mathbf{x}) > 0 for all nonzero x\mathbf{x}.
  • Negative-definite: q(x)<0q(\mathbf{x}) < 0 for all nonzero x\mathbf{x}.
  • Indefinite: q(x)q(\mathbf{x}) takes both positive and negative values depending on x\mathbf{x}.

Quadratic forms often include cross terms like xyxy or yzyz, which introduce ambiguity in interpreting the geometry of Q(x)Q(\mathbf{x}). For example:

q(x,y)=ax2+by2+cxyq(x, y) = ax^2 + by^2 + cxy

The term cxycxy complicates the identification of the axes of symmetry, since it's not readily apparent whether or not qq will always be positive or negative, or only sometimes.

By using diagonalization we can express the above quadratic form in terms of the eigenvalues of AA:

q(x,y)=qA(v)=i=1nλi(wiwi)ti2q(x, y) = q_A(\textbf{v}) = \sum_{i=1}^n\lambda_i( \textbf{w}_i\cdot \textbf{w}_i)t_i^2

We see pretty easily that if all λi\lambda_i are positive, then the quadratic form of AxA\textbf{x} is always positive, so the function is always positive for any input vector x0\textbf{x} \not = \textbf{0}. Similarly, if all λi\lambda_i are negative, then the quadratic form is always negative and the function encoded by AA is always negative. And we can't really tell if some eigenvalue is positive, and another is negative.

Spectral Theorem Applications to Exponentiation

Let AA be a symmetric n×nn \times n matrix with orthogonal eigenvectors w1,w2,,wn\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_n, having corresponding eigenvalues λ1,λ2,,λn\lambda_1, \lambda_2, \ldots, \lambda_n. Let WW be the n×nn \times n matrix whose columns are the respective unit eigenvectors

w1w1,w2w2,,wnwn\frac{\mathbf{w}_1}{\|\mathbf{w}_1\|}, \frac{\mathbf{w}_2}{\|\mathbf{w}_2\|}, \ldots, \frac{\mathbf{w}_n}{\|\mathbf{w}_n\|}

Then W=W1W^\top = W^{-1} (i.e., WW is an orthogonal matrix as discussed in Section 20.4), and A=WDW=WDW1A = W D W^\top = W D W^{-1} for the diagonal matrix

D=[λ1000λ2000λn]D = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix}

whose entries are the corresponding eigenvalues.

We can see how this is useful for matrix powers through the following:

A2=WDWTWDWT=WD2WT=WD2W1A^2 = WDW^TWDW^T = WD^2W^T = WD^2W^{-1}

(Because WW is orthogonal, W1=WTW^{-1} = W^T)

A3=(WDW1)(WDW1)(WDW1)A^3 = (WDW^{-1}) (WDW^{-1}) (WDW^{-1}) =WD3W1= WD^3W^{-1}

In general, Am=WDmW1A^m = WD^mW^{-1}, and since DD is diagonal it's easy to compute DmD^m. It's a fun little trick to work out end behaviours of things like Markov matrices by hand.

Hessians And Local Extrema

Remember the Taylor Series from single variable calculus? Or maybe you don't want to remember. Whatever the case, it serves as a refinement of a simple linear approximation, "evolving" it into the quadratic approximation, and it works in multiple variables too! Given some f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R}, we can approximate ff near some point aRn\textbf{a} \in \mathbb{R}^n with:

f(a+h)=f(a)+(f)(a)+12hT((Hf)(a))hf(\textbf{a+h}) = f(\textbf{a}) + (f)(\textbf{a}) + \frac{1}{2}\textbf{h}^T((Hf)(\textbf{a}))\textbf{h}

for small h\textbf{h}. This is the "quadratic approximation" for ff.

When we think about extrema, they always occur at critical points where the gradient vanishes, leaving us with just:

f(a)+12hT((Hf)(a))hf(\textbf{a}) + \frac{1}{2}\textbf{h}^T((Hf)(\textbf{a}))\textbf{h}

In this sense the quadratic form of the Hessian can be extremely useful, since it's basically giving away the geometry of the contour plot around a\textbf{a}. The idea is that for h\textbf{h} near 0\textbf{0}, f(a+h)=f(a)+12q(Hf)(a)(h)f(\textbf{a} + \textbf{h}) = f(\textbf{a}) + \frac{1}{2}q_{(Hf)(\textbf{a})}(\textbf{h}), so we're approximating the level curve of ff around a\textbf{a}. If the quadratic form of the Hessian is positive or negative definite, then we know for sure that around a\textbf{a} all values of ff are either less than or greater than a\textbf{a}, so a\textbf{a} is a local min or a max.

Then we can use the diagonalization formula from earlier, since the Hessian matrix is guaranteed to be a symmetric matrix! So the quadratic form will be:

qH(t1w1+...+tnwn)=λ1t12+...+λntn2q_H(t_1\textbf{w}_1 + ... + t_n\textbf{w}_n) =\lambda_1t_1^2 + ... + \lambda_nt_n^2

For the w1,...,wn\textbf{w}_1, ... , \textbf{w}_n which are eigenvectors for their corresponding eigenvalues.

We can take this at face value and just notice that if the Hessian (Hf)(a)(Hf)(\textbf{a}) is positive-definite then a\textbf{a} is a local min, if it's negative-definite then a\textbf{a} is a local max, and if it's indefinite then a\textbf{a} is a saddle point. Why? The quadratic form describes the geometry of the contour plot around a\textbf{a}, so if the quadratic form is always positive around a\textbf{a} then the function must always be greater in any direction from a\textbf{a} (and therefore a\textbf{a} is a local min). Symmetric reasoning gets us to why a\textbf{a} must be a local max if the quadratic form is always negative (negative-definite).

A minor extension is just that if the Hessian has all positive eigenvalues then a\textbf{a} is a local min, if all negative eigenvalues then a\textbf{a} is a local max, and if there are both positive and negative eigenvalues then a\textbf{a} is a saddle point.

This is a little bit simpler to understand, but I still found it to be one of the most interesting results in the course (aside from stuff related to chemistry and SVMs, which I'll cover at some point). It ties together everything to motivate why the second derivative test works in multiple variables and makes you feel like each step is from a different part of your Math 51 journey - from quadratic forms to eigenvalues to level sets to gradients.