Rendered at 18:41:42 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
alok-g 11 hours ago [-]
I find several things confusing in this article.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, and hence the true process (just like the way you change your mind after getting new information)
Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, ... A GP is simply a distribution over functions (or guesses). Because we have an infinite amount of guesses, the expected true guess (or best model) is the mean of all plausible guesses.
So is the shape of each function changing? OK. What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
>> GP(m(x), k(x, x'))
What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
>> In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together.
It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
>> I will use the rest of this post to go over different kernel representations and their visualizations.
The plots now have y and x, and x1 and x2. How are these related?
And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
The rest of the post looks fine as plots of the various functions given. But given the above, I have not understood their importance as kernel functions or use for GP.
llamaz 6 hours ago [-]
> But given the above, I have not understood their importance as kernel functions or use for GP.
If the author has a CV attached to their blog, the purpose is to signal competence and the target audience is future employers .
magicalhippo 5 hours ago [-]
I was similarly confused, but after a few rounds with Gemini 3.5 Flash (extended) it cleared things up some, for me anyway.
> What is 'x' here?
So as I understand it, a Gaussian Process is defined in terms of a set of random variables which are indexed, typically by either time (t), or space (x). So in the concrete example, x here would be the amount of cheese inserted into the magical machine. In general the "index" can be a vector. Say if the magical machine instead required inserting both cheese and milk to produce some amount of gold, the index x would be two-dimensional, to represents the various amounts of cheese and milk you inserted.
> It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
Right, it's general, and it's kinda confusing to use f when everything else seems to use X_t or similar. Here f is actually a random variable index by x, so one example could be
f(x) = r_1 + x * r_2
where r_1 and r_2 are two independent random variables with the standard normal distribution. In this case f(x) represents all possible lines, and f(3) gives you a random variable for index 3, so r_1 + 3 * r_2, that also follows a normal distribution thanks to how normal random variables behave when added and scaled.
> The plots now have y and x, and x1 and x2. How are these related?
The left plot shows three realizations of y = f(x), ie for three different choices (samples) of the random variables that goes into f(x). The right-hand plot shows the output of the kernel function for two indices x and x'. In the first example, the kernel function was the dot product between the two inputs, but given the indices are 1-dimensional that reduces to just k(x, x') = x * x'.
Back to the example, you can feed the machine various amounts of cheese and record the various amounts of gold you get back. The amount of cheese are the indices which you use with the kernel function you picked, which you run through the Gaussian Process regression math, and you get a new function which takes an index (amount of cheese) and returns a normal distribution that predicts the amount of gold for that index (amount of cheese).
The process spits out the mean and the variance of the normal distribution, so you can look at the variance to determine how certain you can be about the prediction which will be centered around the mean.
As I understand it, the point of the left plot is that you can use it to get an idea for which kernel function to use for your measured data. And as mentioned you can easily make new kernel functions by adding (OR-like) and multiplying (AND-like) other kernel functions.
Also the author made a mistake, he mentioned kernel functions are parameterless, but he meant non-parametric. The kernel functions he shows like the periodic kernel has hyperparameters l and p for example.
At least that's my current understanding.
zahirbmirza 21 hours ago [-]
I cant wait to read/view this in detail. Super exciting, thank you.
RickJWagner 19 hours ago [-]
To the author: Thank you, quite readable. I like the thumbnail explanations.
ranger_danger 19 hours ago [-]
Does not appear to have anything to do with operating systems... looks AI related
dullcrisp 17 hours ago [-]
Sadly those of us hoping to learn about making popcorn are forced to look elsewhere.
thephyber 18 hours ago [-]
Yes, kernel is an overloaded term. This is about functions running on GPUs, not Operating System core functionality.
srean 14 hours ago [-]
Not specifically those either.
This is about inner product functions in a specific kind of Hilbert spaces, a notion that is very useful in many branches of applied mathematics. Machine learning and functional analysis included.
The name collision is unfortunate.
One unfortunate difficulty is that these kernels don't map so well to GPU kernels unless explicitly embedded in high dimensional spaces. This is one of the reasons why kernel methods has recently fallen out of favour in machine learning - lack of mechanical sympathy. Note this just one reason, there are others.
trumpdong 18 hours ago [-]
"AI" now refers to things like ChatGPT - this is ML related (machine learning) which is the thing that used to be called AI six years ago
ranger_danger 17 hours ago [-]
I don't think there is a single universally applicable definition of "AI" that even most people would agree with.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, and hence the true process (just like the way you change your mind after getting new information)
Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, ... A GP is simply a distribution over functions (or guesses). Because we have an infinite amount of guesses, the expected true guess (or best model) is the mean of all plausible guesses.
So is the shape of each function changing? OK. What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
>> GP(m(x), k(x, x'))
What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
>> In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together.
It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
>> I will use the rest of this post to go over different kernel representations and their visualizations.
The plots now have y and x, and x1 and x2. How are these related?
And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
The rest of the post looks fine as plots of the various functions given. But given the above, I have not understood their importance as kernel functions or use for GP.
If the author has a CV attached to their blog, the purpose is to signal competence and the target audience is future employers .
> What is 'x' here?
So as I understand it, a Gaussian Process is defined in terms of a set of random variables which are indexed, typically by either time (t), or space (x). So in the concrete example, x here would be the amount of cheese inserted into the magical machine. In general the "index" can be a vector. Say if the magical machine instead required inserting both cheese and milk to produce some amount of gold, the index x would be two-dimensional, to represents the various amounts of cheese and milk you inserted.
> It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
Right, it's general, and it's kinda confusing to use f when everything else seems to use X_t or similar. Here f is actually a random variable index by x, so one example could be
where r_1 and r_2 are two independent random variables with the standard normal distribution. In this case f(x) represents all possible lines, and f(3) gives you a random variable for index 3, so r_1 + 3 * r_2, that also follows a normal distribution thanks to how normal random variables behave when added and scaled.> The plots now have y and x, and x1 and x2. How are these related?
The left plot shows three realizations of y = f(x), ie for three different choices (samples) of the random variables that goes into f(x). The right-hand plot shows the output of the kernel function for two indices x and x'. In the first example, the kernel function was the dot product between the two inputs, but given the indices are 1-dimensional that reduces to just k(x, x') = x * x'.
Back to the example, you can feed the machine various amounts of cheese and record the various amounts of gold you get back. The amount of cheese are the indices which you use with the kernel function you picked, which you run through the Gaussian Process regression math, and you get a new function which takes an index (amount of cheese) and returns a normal distribution that predicts the amount of gold for that index (amount of cheese).
The process spits out the mean and the variance of the normal distribution, so you can look at the variance to determine how certain you can be about the prediction which will be centered around the mean.
As I understand it, the point of the left plot is that you can use it to get an idea for which kernel function to use for your measured data. And as mentioned you can easily make new kernel functions by adding (OR-like) and multiplying (AND-like) other kernel functions.
Also the author made a mistake, he mentioned kernel functions are parameterless, but he meant non-parametric. The kernel functions he shows like the periodic kernel has hyperparameters l and p for example.
At least that's my current understanding.
This is about inner product functions in a specific kind of Hilbert spaces, a notion that is very useful in many branches of applied mathematics. Machine learning and functional analysis included.
The name collision is unfortunate.
One unfortunate difficulty is that these kernels don't map so well to GPU kernels unless explicitly embedded in high dimensional spaces. This is one of the reasons why kernel methods has recently fallen out of favour in machine learning - lack of mechanical sympathy. Note this just one reason, there are others.