Published on 10th of March, 2022.
In my new paper together with my supervisor, we explain how to use discretized functions and tensors to do supervised machine learning. A discretized function is just a function defined on some grid, taking a constant value on each grid cell. We can describe such a function using a multi-dimensional array (i.e. a tensor), and we can learn this tensor using data. This results in a new and interesting type of machine learning model.
Before we dive into the details of our new type of machine learning model, let’s sit back for a moment and think: what is machine learning in the first place? Machine learning is all about learning from data. More specifically in supervised machine learning we are given some data points , all lying in , together with labels which are just numbers. We then want to find some function such that for all , and such that generalizes well to new data. Or rather, we want to minimize a loss function, for example the least-squares loss
This is obviously an ill-posed problem, and there are two main issues with it:
The first issue has no general solution. We choose some class of functions, usually that depend on some set of parameters . For example, if we want to fit a quadratic function to our data we only look at quadratic functions
and our set of parameters is . Then we minimize the loss over this set of parameters, i.e. we solve the minimization problem:
There are many parametric families of functions we can choose from, and many different ways to solve the corresponding minimization problem. For example, we can choose to be neural networks with some specified layer sizes, or a random forest with a fixed number of trees and fixed maximum tree depth. Note that we should strictly speaking always specify hyperparameters like the size of the layers of a neural network, since those hyperparameters determine what kind of parameters we are going to optimize. That is, hyperparameters affect the parametric family of functions that we are going to optimize.
The second issue, generalization, is typically solved through cross-validation. If we want to know whether the function we learned generalizes well to new data points, we should just keep part of the data “hidden” during the training (the test data). After training we then evaluate our trained function on this hidden data, and we record the loss function on this test data to obtain the test loss. The test loss is then a good measure of how well the function can generalize to new data, and it is very useful if we want to compare several different functions trained on the same data. Typically we use a third set of data, the validation dataset for optimizing hyperparameters for example, see my blog post on the topic.
Keeping the general problem of machine learning in mind, let’s consider a particular class of parametric functions: discretized functions on a grid. To understand this class of functions, we first look at the 1D case. Let’s take the interval , and chop it up into equal pieces:
A discretized function is then one that takes a constant value on each subinterval. For example, below is a discretized version of a sine function:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
DEFAULT_FIGSIZE = (10, 6)
plt.figure(figsize=(DEFAULT_FIGSIZE))
num_intervals = 10
num_plotpoints = 1000
x = np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints)
def f(x):
return np.sin(x * 2 * np.pi)
plt.plot(x, f(x), label="original function")
plt.plot(
x,
f((np.floor(x * num_intervals) + 0.5) / num_intervals),
label="discretized function",
)
plt.legend();
Note that if we divide the interval into pieces, then we need parameters to describe the discretized function .
In the 2D case we instead divide the square into a grid, and demand that a discretized function is constant on each grid cell. If we use grid cells for each axis, this gives us parameters. Let’s see what a discretized function looks like in a 3D plot:
fig = plt.figure(figsize=(DEFAULT_FIGSIZE))
num_plotpoints = 200
num_intervals = 5
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_plotpoints, Y_plotpoints = np.meshgrid(
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
)
# Smooth plot
Z_smooth = f(X_plotpoints, Y_plotpoints)
ax = fig.add_subplot(121, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_smooth, cmap="inferno")
plt.title("original function")
# Discrete plot
X_discrete = (np.floor(X_plotpoints * num_intervals) + 0.5) / num_intervals
Y_discrete = (np.floor(Y_plotpoints * num_intervals) + 0.5) / num_intervals
Z_discrete = f(X_discrete, Y_discrete)
ax = fig.add_subplot(122, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_discrete, cmap="inferno")
plt.title("discretized function");
Before diving into higher-dimensional versions of discretized functions, let’s think about how we would solve the learning problem. As mentioned, we have parameters, and we can encode this using an matrix . We are doing supervised machine learning, so we have data points and corresponding labels . Each data point correspond to some entry in the matrix ; this is simply determined by the specific grid cell the data point happens to fall in.
If the points all fall into the grid cell , then we can define simply by the mean value of the labels for these points;
But what do we do if we have no training data corresponding to some entry ? Then the only thing we can do is make an educated guess based on the entries of the matrix we do know. This is the matrix completion problem; we are presented with a matrix with some known entries, and we are tasked to find good values for the unknown entries. We described this problem in some detail in the previous blog post.
The main takeaway is this: to solve the matrix completion problem, we need to assume that the matrix has some extra structure. We typically assume that the matrix is of low rank , that is, we can write as a product where are of size and respectively. Intuitively, this is a useful assumption because now we only have to learn parameters instead of . If is much smaller than , then this is a clear gain.
From the perspective of machine learning, this changes the class of functions we are considering. Instead of all discretized functions on our grid inside , we now consider only those functions described by a matrix that has rank at most . This also changes the parameters; instead of parameters, we now only consider parameters describing the two matrices .
Real data is often not uniform, so unless we use a very coarse grid, some entries of are always going to be unknown. For example below we show some more realistic data, with the same function as before plus some noise. The color indicates the value of the function we’re trying to learn.
num_intervals = 8
N = 50
# A function to make somewhat realistic looking 2D data
def non_uniform_data(N):
np.random.seed(179)
X = np.random.uniform(size=N)
X = (X + 0.5) ** 2
X = np.mod(X ** 5 + 0.2, 1)
Y = np.random.uniform(size=N)
Y = (Y + 0.5) ** 3
Y = np.sin(Y * 0.2 * np.pi + 1) + 1
Y = np.mod(Y + 0.6, 1)
X = np.mod(X + 3 * Y + 0.5, 1)
Y = np.mod(0.3 * X + 1.3 * Y + 0.5, 1)
X = X ** 2 + 0.4
X = np.mod(X, 1)
Y = Y ** 2 + 0.5
Y = np.mod(Y + X + 0.4, 1)
return X, Y
# The function we want to model
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_train, Y_train = non_uniform_data(N)
X_test, Y_test = non_uniform_data(N)
Z_train = f(X_train, Y_train) + np.random.normal(size=X_train.shape) * 0.2
Z_test = f(X_test, Y_test) + np.random.normal(size=X_test.shape) * 0.2
plt.figure(figsize=(7, 6))
plt.scatter(X_train, Y_train, c=Z_train, s=50, cmap="inferno", zorder=3)
plt.colorbar()
# Plot a grid
X_grid = np.linspace(1 / num_intervals, 1, num_intervals)
Y_grid = np.linspace(1 / num_intervals, 1, num_intervals)
plt.xlim(0, 1)
plt.ylim(0, 1)
for perc in X_grid:
plt.axvline(perc, c="gray")
for perc in Y_grid:
plt.axhline(perc, c="gray")
We plotted an 8x8 grid on top of the data. We can see that in some grid squares we have a lot of data points, whereas in other squares there’s no data at all. Let’s try to fit a discretized function described by an 8x8 matrix of rank 3 to this data. We can do this using the ttml package I developed.
from ttml.tensor_train import TensorTrain
from ttml.tt_rlinesearch import TTLS
rank = 3
# Indices of the matrix Theta for each data point
idx_train = np.stack(
[np.searchsorted(X_grid, X_train), np.searchsorted(Y_grid, Y_train)], axis=1
)
idx_test = np.stack(
[np.searchsorted(X_grid, X_test), np.searchsorted(Y_grid, Y_test)], axis=1
)
# Initialize random rank 3 matrix
np.random.seed(179)
low_rank_matrix = TensorTrain.random((num_intervals, num_intervals), rank)
# Optimize the matrix using iterative method
optimizer = TTLS(low_rank_matrix, Z_train, idx_train)
train_losses = []
test_losses = []
for i in range(50):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=Z_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
Output
Final training loss: 0.0252 Final test loss: 0.0424
Above we see how the train and test loss develops during training. At first both train and test loss decrease rapidly. Then both train and test loss start to decrease much more slowly, and training loss is less than test loss. This means that the model overfits on the training data, but this is not necessarily a problem; the question is how much it overfits compared to other models. To see how good this model is, let’s compare it to a random forest.
from sklearn.ensemble import RandomForestRegressor
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(np.stack([X_train, Y_train], axis=1), Z_train)
Z_pred = forest.predict(np.stack([X_test, Y_test], axis=1))
test_loss = np.mean((Z_pred - Z_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
Output
Random forest test loss: 0.0369
We see that the random forest is a little better than the discretized function. And in fact, most standard machine learning estimators will beat a discretized function like this. This is essentially because the discretized function is very simple, and more complicated estimators can do a better job describing the data.
Does this mean that we should stop caring about the discretized function? Test loss is not the only criterion we should use to compare different estimators. Discretized functions like these have two big advantages:
This makes them excellent candidates for low-memory applications. For example, we may want to implement a machine learning model for a very cheap consumer device. If we don’t need extreme accuracy, and we pre-train the model on a more powerful device, discretized functions can be a very attractive option.
The generalization to -dimensions is now straightforward; we take a -dimensional grid on , with subdivisions in each axis. Then we specify the value of our function on each of the grid cells. These values form a tensor , i.e. a -dimensional array. We access the entries of with a -tuple of indices .
This suffers from the same problems as in the 2D case; the tensor is really big, and during training we would need at least one data point for each entry of the tensor. But the situation is even worse, even storing can be prohibitively expensive. For example, if and ; then we would need about 82 TB just to store the tensor! In fact, grid points in each direction is not even that much, so in practice we might need a much bigger tensor still.
In the 2D case we solved this problem by storing the matrix as the product of two smaller matrices. In the 2D case this doesn’t actually save that much on memory, and we mainly did it so that we can solve the matrix completion problem; that is, so that we can actually fit the discretized function to data. In higher dimensions however, storing the tensor in the right way can save immense amounts of space.
In the 2D case, we store matrices as a low rank matrix; as a product of two smaller matrices. But what is the correct analogue of ‘low rank’ for tensors? Unfortunately (or fortunately), there are many answers to this question. There are many ‘low rank tensor formats’, all with very different properties. We will be focusing on tensor trains. A tensor train decomposition of an tensor consists of a set of cores of shape , where are the ranks of the tensor train. Using these cores we can then express the entries of using the following formula:
This may look intimidating, but the idea is actually quite simple. We should think of the core as a collection of matrices , each of shape . The index then selects which of these matrices to use. The first and last cores are special, by convention , this means that is a collection of matrices, i.e. (row) vectors. Similarly, is a collection of matrices, i.e. (column) vectors. Thus each entry of is determined by a product like this:
row vector matrix matrix … matrix * column vector
The result is a number, since a row/column vector times a matrix is a row/column vector, and the product of a row and column vector is just a number. In fact, if we think about it, this is exactly how a low-rank matrix decomposition works as well. If we write a matrix , then
Here is a row of , and is a column of . In other words, is just a collection of row vectors, and is just a collection of column vectors. Then to obtain an entry , we select the row of and the column of and take the product.
In summary, a tensor train is a way to cheaply store large tensors. Assuming all ranks are the same, a tensor train requires entries to store a tensor with entries; a huge gain if and are big. For context, if , , and then instead of 82 TB we just need 131 KB to store the tensor; that’s about 9 orders of magnitude cheaper! Furthermore, computing entries of this tensor is cheap; it’s just a couple matrix-vector products.
There is obviously a catch to this. Just like not every matrix is low-rank, not every tensor can be represented by a low-rank tensor train. The point, however, is that tensor trains can efficiently represent many tensors that we do care about. In particular, they are good at representing the tensors required for discretized functions.
How can we learn a discretized function represented by a tensor train? Like in the matrix case, many entries of the the tensor are unobserved, and we have to complete these entries based on the entries that we can estimate. In my post on matrix completion we have seen that even the matrix case is tricky, and there are many algorithms to solve the problem. One thing these algorithms have in common is that they are iterative algorithms minimizing some loss function. Let’s derive such an algorithm for tensor train completion.
First of all, what is the loss function we want to minimize during training? It’s simply the least squares loss:
Each data point fits into some grid cell given by index , so using the definition of the tensor train the loss becomes
A straightforward approach to minimizing is to just use gradient descent. We could compute the derivatives with respect to each of the cores and just update the cores using this derivative. This is, however, very slow. There are two reasons for this, but they are a bit subtle:
To efficiently optimize , we can’t just use gradient descent as-is, and we are forced to walk a different route. While is very non-linear as function of the tensor train cores , it is only quadratic in the entries of , and we can easily compute its derivative:
where denotes a sparse tensor that’s zero in all entries except where it takes value . In other words, is a sparse tensor that is both simple and cheap to compute; it just requires sampling at most entries of . For gradient descent we would then update by with the stepsize. Unfortunately, this expression is not a tensor train. However, we can try to approximate by a tensor train of the same rank as .
Recall that we can approximate a matrix by a rank matrix by using the truncated SVD of . In fact this is the best-possible approximation of by a rank matrix. There is a similar procedure for tensor trains; we can approximate a tensor by a rank tensor train using the TT-SVD procedure. While this is not the best approximation of by such a tensor train, it is ‘quasi-optimal’ and pretty good in practice. The details of the TT-SVD procedure are a little involved, so let’s leave it as a black box. We now have the following iterative procedure for optimizing :
If you’re familiar with optimizing neural networks, you might notice that this procedure could work very well with stochastic gradient descent. Indeed is a sum over all the data points, so we can just pick a subset of data points (a minibatch) to obtain a stochastic gradient. The reason we would want to do this is that we have so many data points that the cost of each step is dominated by computing the gradient. In this situation this is however not true, and the cost is dominated by the TT-SVD procedure. We therefore stick to more classical gradient descent methods. In particular, the function can be optimized well with conjugate gradient descent using Armijo backtracking line search.
Let’s now see all of this in practice. Let’s train a discretized function represented by a tensor train on some data using the technique described above. We will do this on a real dataset: the airfoil self-noise dataset. This NASA dataset contains experimental data about the self-noise of airfoils in a wind tunnel, originally used to optimize wing shapes. We can do the fitting and optimization using my
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Load the data
airfoil_data = pd.read_csv(
"airfoil_self_noise.dat", sep="\t", header=None
).to_numpy()
y = airfoil_data[:, 5]
X = airfoil_data[:, :5]
N, d = X.shape
print(f"Dataset has {N=} samples and {d=} features.")
# Do train-test split, and scale data to interval [0,1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=179
)
scaler = MinMaxScaler(clip=True)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define grid, and find associated indices for each data point
num_intervals = 10
grids = [np.linspace(1 / num_intervals, 1, num_intervals) for _ in range(d)]
tensor_shape = tuple(len(grid) for grid in grids)
idx_train = np.stack(
[np.searchsorted(grid, X_train[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
idx_test = np.stack(
[np.searchsorted(grid, X_test[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
# Initialize the tensor train
np.random.seed(179)
rank = 5
tensor_train = TensorTrain.random(tensor_shape, rank)
# Optimize the tensor train using iterative method
optimizer = TTLS(tensor_train, y_train, idx_train)
train_losses = []
test_losses = []
for i in range(100):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=y_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
Output
Dataset has N=1503 samples and d=5 features. Final training loss: 15.3521 Final test loss: 54.4698
We see a similar training profile to the matrix completion case. Let’s see now how this estimator compares to a random forest trained on the same data:
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
Output
Random forest test loss: 3.2568
The random forest has a loss of around
Why is it so bad? Bad initialization!
Recall that a gradient descent method converges to a local minimum of the function. Usually we hope that whatever local minimum we converge to is ‘good’. Indeed for neural networks we see that, especially if we use a lot of parameters, most local minima found by stochastic gradient descent are quite good, and give a low train and test error. This is not true for our discretized function. We converge to local minima that have both bad train and test error.
The solution? Better initialization!
Instead of initializing the tensor trains randomly, we can learn from other machine learning estimators. We fit our favorite machine learning estimator (e.g. a neural network) to the training data. This gives a function . This function is defined for any input, not just for the training/test data points. Therefore we can try to first fit our discretized function to match , i.e. we solve the following minimization problem:
One way to solve this minimization problem is by first (randomly) sampling a lot of new data points and then fitting to these data points with labels . This is essentially data augmentation, and can drastically increase the number of data points available for training. With more training data, the function will indeed converge to a better local minimum.
While data augmentation does improve performance, we can do better. We don’t need to randomly sample data points . Instead we can choose good points to sample; points that give us the most information on how to efficiently update the tensor train. This is essentially the idea behind the tensor train cross approximation algorithm, or TT-Cross for short. Using TT-Cross we can quickly and efficiently get a good approximation to the minimization problem .
We could stop here. If models our data really well, and approximates really well, then we should be happy. Like the matrix completion model, discretized functions based on tensor trains are fast and are memory efficient. Therefore we can make an approximation of that uses less memory and can make faster predictions! However, the model really should be used for initialization only. Usually actually doesn’t do a great job of approximating , but if we first approximate , and then use a gradient descent algorithm to improve even further, we end up with something much more competitive.
Let’s see this in action. This is actually much easier than what we did before, because I wrote the
from ttml.ttml import TTMLRegressor
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None)
tt.fit(X_train, y_train, X_val=X_test, y_val=y_test)
y_pred = tt.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss: {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss: {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
Output
TTML test loss: 2.8970 Random forest test loss: 3.2568
We see that using a random forest for initialization gives a huge improvement to both training and test loss. In fact,the final test loss is better than that of the random forest itself! On top of that, this estimator doesn’t use many parameters:
print(f"TT uses {tt.ttml_.num_params} parameters")
Output
TT uses 1356 parameters
Let’s compare that to the random forest. If we look under the hood, the scikit-learn implementation of random forests stores 8 parameters per node in each tree in the forest. This is inefficient, and you really only need 2 parameters per node, so let’s use that.
num_params_forest = sum(
[len(tree.tree_.__getstate__()["nodes"]) * 2 for tree in forest.estimators_]
)
print(f"Forest uses {num_params_forest} parameters")
Output
Forest uses 303180 parameters
That’s 1356 parameters vs. more than 300,000 parameters! What about my claim of prediction speed? Let’s compare the amount of time it takes both estimators to predict 1 million samples. We do this by just concatenating the training data until we get 1 million samples.
from time import perf_counter_ns
target_num = int(1e6)
n_copies = int(target_num//len(X_train))+1
X_one_million = np.repeat(X_train,n_copies,axis=0)[:target_num]
print(f"{X_one_million.shape=}")
time_before = perf_counter_ns()
tt.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by TT: {time_taken:.0f}ms")
time_before = perf_counter_ns()
forest.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by Forest: {time_taken:.0f}ms")
Output
X_one_million.shape=(1000000, 5) Time taken by TT: 430ms Time taken by Forest: 2328ms
While not by orders of magnitude, we see that the tensor train model is faster. You might be thinking that this is just because the tensor train has fewer parameters, but this is not the case. Even if we use a very high-rank tensor train with high-dimensional data, it is still going to be fast. The speed scales really well, and will beat most conventional machine learning estimators.
With good initialization the model based on distretized functions perform really well. On our test dataset the model is fast, uses few parameters, and beats a random forest in test loss (in fact, it is the best estimator I have found so far for this problem). This is great! I should publish a paper in NeurIPS and get a job at Google! Well… let’s not get ahead of ourselves. It performs well on this particular dataset, yes, but how does it fare on other data?
As we shall see, it doesn’t do all that well actually. The airfoil self-noise dataset is a very particular dataset on which this algorithm excels. The model seems to perform well on data that can be described by a somewhat smooth function, and doesn’t deal well with the noisy and stochastic nature of most data we encounter in the real world. As an example let’s repeat the experiment, but let’s first add some noise:
from ttml.ttml import TTMLRegressor
X_noise_std = 1e-6
X_train_noisy = X_train + np.random.normal(0, X_noise_std, size=X_train.shape)
X_test_noisy = X_test + np.random.normal(scale=X_noise_std, size=X_test.shape)
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None, opt_steps=50)
tt.fit(X_train_noisy, y_train, X_val=X_test_noisy, y_val=y_test)
y_pred = tt.predict(X_test_noisy)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss (noisy): {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test_noisy)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss (noisy): {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend();
Output
TTML test loss (noisy): 7.1980 Random forest test loss (noisy): 5.1036
Even a tiny bit of noise in the training data can severely degrade the model. We see that it starts to overfit a lot. This is because my algorithm tries to automatically find a ‘good’ discretization of the data, not just a uniform discretization as we have discussed in our 2D example (i.e. equally spacing all the grid cells). Some of the variables in this dataset are however categorical, and a small amount of noise makes it much more difficult to automatically detect a good way to discretize them.
The model has a lot of hyperparameters we won’t go into now, and playing with them does help with overfitting. Furthermore, the noisy data we show here is perhaps not very realistic. However, the fact remains that the model (at least the way its currently implemented) is not very robust to noise. In particular, the model is very sensitive to the discretization of the feature space used.
Right now we don’t have anything better than simple heuristics for finding discretizations of the features space. Since the loss function depends in a really discontinuous way on the discretization, optimizing the discretization is difficult. Perhaps we can use an algorithm to adaptively split and merge thresholds used in the discretization, or use some kind of clustering algorithm for discretization. I have tried things along those lines but getting it to work well is difficult. I think that with more study, the problem of finding a good discretization can be solved, but it’s not easy.
We looked at discretized functions and their use in supervised machine learning. In higher dimensions discretized functions are parametrized by tensors, which we can represent efficiently using tensor trains. The tensor train can be optimized directly on the data to produce a potentially useful machine learning model. It is both very fast, and doesn’t use many parameters. In order to initialize it well, we can first fit an auxiliary machine learning model on the same data, and then sample predictions from that model to effectively increase the amount of training data. This model performs really well on some datasets, but in general it is not very robust to noise. As a result, without further improvements, the model will only be useful in a select number of cases. On the other hand, I really think that the model does have a lot of potential, once some of its drawbacks are fixed.
Rik Voorhaar © 2024