Math 51: Methods in Several Variables
Intro
Stanford is a bit strange in their coverage of linear algebra and multivariable calculus. Math 51 covers what you would find in your typical introductory multivariable calculus class, but only the derivative/gradient side of things, as well as a brief introduction to the usefulness of matrix decompositions and eigenvalues. The course is largely based on memorization of methods and performing computational routine. I didn't find the course that enlightening and honestly wished it followed a traditional routine, like that of Gill Strang's 18.06.
Overall Stats
Grade breakdown:
- 20%: Midterm 1
- 20%: Midterm 2
- 30%: Final exam
- 10%: PCRQs (Pre-class Reading Questionairres)
- 20%: Problem
Here the PCRQs and Problem set marks are divided by 0.8: e.g if I get a total of 75% on the HW section I'll actually end up with a .
Stats for Fall Quarter 2024:
- Midterm 1: Median 45.5/60, SD 11, Mean 44/60
- Midterm 2: Median 50.5/60, SD 8.41, Mean 48.5/60
- Midterm 3: Median 67/90, SD 15, Mean 64/90
The curve in my quarter was only about 2.5-3%, since fall quarter tends to be quite brutal. Most kids taking Math 51 in fall already took linear algebra and multivariable calculus in high school, so exam scores tend to be higher on average. I've heard that the curve in other quarters is more like 8-10%, enough to move up a full letter grade, and I'll add those stats from friends that are taking Math 51 in these upcoming quarters when they're available.
Meta-Level Stuff
The biggest thing is that I learned is that I should have shopped around when it came to TAs and section. On the very last discussion section day before the final exam I dropped in on a dude named Pino's discussion section, which my friend recommended, and it was brilliant and amazingly well taught. This is in contrast to my own section leader, who sort held a quiet session for an hour where we'd just do homework (which is a bit of a waste of time, in my opinion).
I made it out of Math 51 with a decent grade after skipping pretty much every single lecture and section. But my study schedule for math was pretty bad and I tended to leave things to the last minute, which was cognitively draining and required a lot of work.
In an attempt to iterate and learn from mistakes, I'm going to try this study regime:
- Two days before lecture, read textbook in depth
- Actually go to lecture, and do a couple simple example problems the same day as lecture
- Do associated pset problems 2 days later
This way I naturally bake in a spaced repetition of about 2 days or so, 3 times, so hopefully the material sticks better. I'll see if this works better for me in Math 52.
Approximation
Let's define the derivative matrix of a function at a point , which we'll denote with , to be the matrix:
Using this, we can define the linear approximation for a point near to be:
The mathematical intuition here is that is an that gives our multivariable "steps", so multiplying the derivative matrix with we get the change in the surface of the function given by the associated derivative.
In multiple variables we define differentiability of at to be the existence of an matrix so that as approaches we have:
Here, if the partial derivatives exist, then so does the derivative matrix, and for a differentiable function is just .
Again, the intuition for this is similar to single variable definitions of differentiability - there is, if you zoom in on the function's surface enough, the function will exhibit some quality of "smoothness" that allows it to be differentiable.
I think that while these seem like very basic initial applications, I've chosen to include it because it ties together how everything in single variable calculus can easily extended into multiple variables just using the language of matrices and vectors. Take the definition of differentiability: if one sets , then it can be seen that the definition just becomes the single variable definition of differentiability.
Matrices as Transformation
A nice mathematical intuition for what matrices do is encode some linear transformation for a vector. This also means that matrices encode functions, since they take inputs and give some output for a given vector. Take some matrix :
When we multiply by and we get:
This matrix is literally transforming and to swap in a sense, resulting in a reflection across the line .
In general, for an square matrix , one can think of each column of the matrix as encoding where each respective basis vector ends up after applying the transformation encoded by .
We see this in more clarity when we consider how matrix multiplication works for some generic vector :
The matrix is literally encoding a linear combination of its columns, with scalar coefficients given by the elements of . This can be extended to . Consider what happens when we have basis vectors:
Setting to any of the above causes every column vector in the linear combination to disappear except for the associated column vector - if we plug in the basis vector we obtain the column vector of , so is encoding where that basis vector ends up after applying the function represented by .
We can also multiply matrices with each other to encode "compound" transformations, analogous to function compositions.
To illustrate compound matrix transformations, consider two rotation matrices:
- 90° Clockwise Rotation Matrix:
- 90° Counterclockwise Rotation Matrix:
When we multiply these matrices, we get:
Performing the multiplication:
Thus, the product of a 90° clockwise rotation and a 90° counterclockwise rotation results in the identity matrix, which makes sense (rotating a vector clockwise and then counterclockwise by 90 degrees should do nothing to that vector)
This shows that the two transformations cancel out, resulting in the identity matrix.
Friendship Networks
We can consider a matrix that encodes friends, call it - specifically, we say that the entry is if and are friends, and 0 otherwise. We make two assumptions:
- Friendship is reciprocal, so must be symmetric
- You cannot be friends with yourself, so the diagonal of is all zero
Turns out that the entry of is the number of common friends between and . Why? We can give a formula for the entry of :
Notice that only equals if both and are 1, or in other words person is a friend of both person and . So the total summation gives the number of people who are friends with and friends with !
I thought this result was interesting, if not really all that practical. I doubt that in real life any serious social media company is storing information about the graph of their users in a matrix, since 99.999% of that matrix would be empty, and the matrix would have several billion rows and columns. Instead similar functionality can be obtained just with intersect operations and SQL lookups.
Span and linear dependence
Span is a simple concept but it's so important to intuition in linear algebra that I include it here to help ensure I never forget it. We define the span of some collection of vectors to be the collection of all vectors in that we can make through some linear combination of the collection of vectors. In formal logic, the span is something like:
So intuitively, the vectors
span the plane in , since we can make any vector in the plane by linearly combining and . Similarly, the vectors
span all of , since we can linearly combine , , and to make any vector in all of .
Matrix Spaces
Again, simple concepts, but so important to intuitively reasoning about matrices I include them here.
Column Space:
We define column space to be the span of the column vectors of some matrix . So if we define as:
then the column space, which we denote with , is just the span of the vectors:
For any square matrix , since we know that really just encodes some linear combination of its columns from earlier, the system has a solution only when lies in the column space of . If isn't in the column space of , then there's no way to linearly combine the columns of to create . This is the intuitive reason for why some linear systems may lack a solution.
Null Space:
Null spaces are useful since if the null space of some square matrix contains any other vector than just (the zero vector), then it turns out that if has a solution then there are infinite solutions.
We denote the null space of some matrix as , and we define it to be the set of all solutions in to the system .
Say we find some solution to the system , where , and we know that the system has some solution . Then we know that actually has infinite solutions, since . So we see that, by the properties of a linear subspace, the nullspace contains infinite vectors (in this case, any scalar multiple of ), so to obtain a solution we simply take and add any vector in the nullspace of .
Over and Underdetermined Systems
Consider some system . We call the system overdetermined if there are more associated equations than unknowns (So if is an matrix, ). Intuitively it is often (but not always) the case that there are no solutions, since there are fewer columns than there are rows, so the column space doesn't cover all of .
On the other hand we say a system is underdetermined if there are more unknowns than equations (so ). Intuitively these systems often (but not always) have infinite solutions, since the nullspace is nonzero and contains some nonzero vector (since there are more columns than rows, so some column must be "redundant" and not linearly independent of the others).
These are a nice "dipstick" test for some given linear system, even if they aren't definitive truths.
Orthogonal Matrices
An orthogonal matrix is defined as a square matrix where the columns form an orthonormal basis. For orthogonal matrices, .
QR and LU Decompositions
I skip over the details of QR and LU decompositions and finding them since it's pretty tedious, mechanical, and easy to pick up.
To be brief, "most" (but not all) matrices have form for lower triangular and upper triangular . This is helpful since systems that can be decomposed into this form can be easily solved.
Then all we have to do is solve 1) and then solve 2) .
Invertible matrices can also be decomposed in the form , where is an orthogonal matrix and is an upper triangular matrix. Since is an orthogonal matrix, is just , so we solve by simply noting that ( is easy to find mechanically because it is upper triangular).
What's a lot more interesting than finding these decompositions is what we can do with them, or more generally what solving linear algebraic systems lets us do. The most interesting to me was using these decompositions to find approximate solutions to model some function or phenomena.
This applies to me with regards to electrical engineering, where I might need to find a small number of constants in a model that fits a lot of data. Another way to say this is that I may need to approximate an unknown function by a linear combination of a small number of known functions, and we want to find the coefficients of that linear combination. This also, as one might imagine, comes up all the time in data science, physics, or any discipline where decent mathematical models for something are necessary.
To give a better intuition for what we're actually doing, we can pretend we're conducting some signal processing experiment where we're measuring some unknown signal, , on the entire time interval of length . If we have some known basic signals, like simple sine, cosine, saw, etc. , then we want the linear combinations that will allow us to take these known functions (which are only measured over some part of the interval ) and approximate just the one function over the entire interval .
Mathematically, we might want to approximate some unknown function on some region R (which could be a surface in space or a time interval). The approximation will be a linear combination of some known functions , and be of form . All this data comes from "experiments", where we might measure the values of a function at some specific point or time .
Let A be the matrix of values of the at many points . Let be the vector whose entry is the measured value of at the point in the experiment. So the "best approximate solution" to the system will give us the values we need for our function approximation.
For this system all we need to do is decompose it into a decomposition, so we see that . If a solution exists for this system, then is our solution, and its entries will give the scalar coefficients we need.
Spectral Theorem and Applications
The spectral theorem is as follows:
Theorem: Let be a symmetric matrix. There is an orthogonal basis consisting of eigenvectors for . There's a few applications of this that I thought were kinda neat.
Finding Quadratic Forms
For any symmetric matrix , we can now find its quadratic form in terms of eigenvalues. We call this the diagonalization formula, given by
This will give a quadratic form where there are no cross-terms (i.e no terms with , or ), which becomes extremely useful since this allows us to determine the geometry of the function encoded by .
Definite-ness of Matrices
Let be the quadratic form of a matrix . We say the matrix represents a positive-definite, negative-definite, or indefinite quadratic form if:
- Positive-definite: for all nonzero .
- Negative-definite: for all nonzero .
- Indefinite: takes both positive and negative values depending on .
Quadratic forms often include cross terms like or , which introduce ambiguity in interpreting the geometry of . For example:
The term complicates the identification of the axes of symmetry, since it's not readily apparent whether or not will always be positive or negative, or only sometimes.
By using diagonalization we can express the above quadratic form in terms of the eigenvalues of :
We see pretty easily that if all are positive, then the quadratic form of is always positive, so the function is always positive for any input vector . Similarly, if all are negative, then the quadratic form is always negative and the function encoded by is always negative. And we can't really tell if some eigenvalue is positive, and another is negative.
Spectral Theorem Applications to Exponentiation
Let be a symmetric matrix with orthogonal eigenvectors , having corresponding eigenvalues . Let be the matrix whose columns are the respective unit eigenvectors
Then (i.e., is an orthogonal matrix as discussed in Section 20.4), and for the diagonal matrix
whose entries are the corresponding eigenvalues.
We can see how this is useful for matrix powers through the following:
(Because is orthogonal, )
In general, , and since is diagonal it's easy to compute . It's a fun little trick to work out end behaviours of things like Markov matrices by hand.
Hessians And Local Extrema
Remember the Taylor Series from single variable calculus? Or maybe you don't want to remember. Whatever the case, it serves as a refinement of a simple linear approximation, "evolving" it into the quadratic approximation, and it works in multiple variables too! Given some , we can approximate near some point with:
for small . This is the "quadratic approximation" for .
When we think about extrema, they always occur at critical points where the gradient vanishes, leaving us with just:
In this sense the quadratic form of the Hessian can be extremely useful, since it's basically giving away the geometry of the contour plot around . The idea is that for near , , so we're approximating the level curve of around . If the quadratic form of the Hessian is positive or negative definite, then we know for sure that around all values of are either less than or greater than , so is a local min or a max.
Then we can use the diagonalization formula from earlier, since the Hessian matrix is guaranteed to be a symmetric matrix! So the quadratic form will be:
For the which are eigenvectors for their corresponding eigenvalues.
We can take this at face value and just notice that if the Hessian is positive-definite then is a local min, if it's negative-definite then is a local max, and if it's indefinite then is a saddle point. Why? The quadratic form describes the geometry of the contour plot around , so if the quadratic form is always positive around then the function must always be greater in any direction from (and therefore is a local min). Symmetric reasoning gets us to why must be a local max if the quadratic form is always negative (negative-definite).
A minor extension is just that if the Hessian has all positive eigenvalues then is a local min, if all negative eigenvalues then is a local max, and if there are both positive and negative eigenvalues then is a saddle point.
This is a little bit simpler to understand, but I still found it to be one of the most interesting results in the course (aside from stuff related to chemistry and SVMs, which I'll cover at some point). It ties together everything to motivate why the second derivative test works in multiple variables and makes you feel like each step is from a different part of your Math 51 journey - from quadratic forms to eigenvalues to level sets to gradients.