Forecasting enthusiasts & data scientists exploring the art and science of predicting future trends and outcomes.
Part 1 of an exploratory series on the TimeGPT model from Nixtla. We'll take a look at how the model works and how it changes the way we think about common forecasting problems.
Discover how this asset is connected to other assets on the platform
Recently, I've been fascinated by a new paradigm in forecasting started by Nixtla and their TimeGPT model.
As you can tell from the name, they are continuing with the popular 'GPT' ending made famous by ChatGPT.
It's a fitting name, given GPT
is a general term for a generative pre-trained transformer (the neural network architecture the model uses) and time is the domain they're working with.
Let's take a little time together to learn more about the model and how it enables a new way of thinking in forecasting.
To inform this post I'll be referencing their paper with the latest revisions as of May 2024.
https://arxiv.org/abs/2310.03589
Before we get into it let me give a little background on where I'm coming from with this post.
Before Ouro, I was a data scientist with a specialty in forecasting. I've spent the last four years working on forecasting problems at a big tech & social media company, so this topic feels right in my wheelhouse. With access to large amount of data there, I've trained similar transformer-based models and have seen how they perform in the real-world.
I'm also familiar with Nixtla's open source package, primarily through using neuralforecast and statsforecast.
Nixtla isn't paying me to say any of this. I'm simply here to share something I'm excited about and I hope others will use this platform to do the same.
That said, it's hard to express just how cool I think TimeGPT is. Before Nixtla came out with their model, I had plans to implement the same idea. It really is a simple idea: collect large amounts of time series data and train one large model on it.
The efficacy of pre-trained models had been demonstrated by LLMs, so applying the concept to the time domain should have been on every forecaster's mind. The next-word prediction done by LLMs is a strikingly similar task to next-value prediction in time series forecasting.
At my previous company Nousot, we sold custom forecasting services to our clients. Primary among these services was demand forecasting. Clients would come to us to forecast sales volume (demand) so they could better plan things like resource allocation, inventory, and staffing. At scale, even a 1% accuracy improvement over existing forecasts can result in millions of dollars saved.
With proximity to all this client data, my hope was to use it to train a global model we could use to serve our clients with more accurate forecasts. This should have possible because we have had multiple clients doing demand forecasting, which means the data we were working with was all somewhat similar.
The idea is that a model trained on all of the available data (across multiple clients) is going to give better predictions than the one that is trained on just an individual client's data.
Unfortunately, clients like the ones we worked with didn't want to share their data. This disappointment was actually one of the core reasons I started Ouro as a place that could enable this kind of thing. More on that here:
So let's get into it, I'm going to do my best to keep this high-level, however where applicable, I'll provide references to source material if you want to go deeper.
Before we can talk about the model and it's architecture, understand the training data is P0 to a good model.
From what I've heard from recent LLM research, there is a certain model scale where the model architecture matters less and less and improvement in performance comes from better data. Model scale here is referencing the number of parameters used to represent the neural network.
We'll explore how this might be a relevant learning in the next section, but for now we can understand that the collection of data is likely primarily responsible for the success of the model.
From section 5.2 Training dataset of the paper,
TimeGPT was trained on, to our knowledge, the largest collection of publicly available time series, collectively encompassing over 100 billion data points. This training set incorporates time series from a broad array of domains, including finance, economics, demographics, healthcare, weather, IoT sensor data, energy, web traffic, sales, transport, and banking.
we learn the scale and domain of the different data the model was trained on. Let's first clarify why this matters.
Training data is THE determinant of predictions for a global model like TimeGPT.
What does this mean, and why does it matter?
When the model is being trained, it's seeing various distributions of data and time series dynamics. Think things like seasonalities, shape, trend, spikiness, energy, etc.
Most of the weights in the model are there to "remember" these dynamics so that when you want to generate a forecast you can call upon those "memories" and create a forecast with what the model has seen to usually follow.
The sketch below shows what that might look like. We have two series used to train the model, then a portion of a third out-of-sample series that we want to predict.
From the prediction in blue you can see how model has learned the dynamics of this kind of series. When there is a small hump, there is always a larger hump that immediately follows. Values then return to a baseline after the second hump.
With only three time series to look at, there's no saying weather this is a good model, and it's very possible that the series we're predicting on does not follow this same pattern.
However, what we've seen illustrated here can give us a deeper intuition about how this kind of model works.
When you're asking for a prediction from the model, it will give you values that are influenced by all of the data it has been trained on.
Looking at just the 3rd, out-of-sample series, there was nothing in the historic data that could have informed us that there would be another hump. Nonetheless, the model could predict it anyways because it had seen other series just like it. This is the power of a global model like TimeGPT.
To really nail this point, let's look at an example where the historic dynamics of a time series actually do tell us about how to predict future values.
Most forecasting models need to be trained on the time series you want to predict. This is because they are working somewhat differently than a model like TimeGPT. When you train a local model like Prophet, you are finding a model that has learned the patterns of the time series you want to predict. All of your predictive power comes from what sorts of patterns you can extract from that series so that you can replicate them into the future as a prediction.
In some cases, that's all you need.
Look at a series like the one below. There is a very clear pattern in the past that can be replicated in the future. This is often called seasonality, although it can also take the form of holidays.
Time series often shrink or grow in a predicable way at predictable times; like each week, month, or year. When you've seen it happen enough times in the past, you can usually say it's going to happen again.
Unfortunately, most real-world time series have much more going on. Seasonality is just one kind of dynamic that could be present. Often there's going to be a trend, sometimes more than one seasonality, and an unknown set of other factors that influence the value of the series.
This brings us back to the power of a global modal. Unlike a local modal, models like TimeGPT have the capacity to learn many of these dynamics simultaneously.
You get to leverage the world's data to learn the patterns behind the complex dynamics present in your own data.
That's all for this one. Stay tuned for part 2! If you want to give the model a try yourself, check it out here:
Discover assets like this one.