Forecasting National Economic Sentiment from Beige Book Publications with VADER, CHRONOS, and ARIMA
The plural of anecdote is not “data”…
— Irwin S. Bernstein
TL;DR
- Here is the code
- 2011–2023 Beige Book publications were used to train CHRONOS and ARIMA models for the sake of comparison
- Both models seem to suggest that the national economic sentiment will remain largely unchanged during the back half of the year, with a projected increase of ~50% expected to occur by the May 29, 2024 Beige Book release
- Neither model seems to have predicted the nearly 50% drop in calculated sentiment between the March 6th and April 17th Beige Book publications
- CHRONOS outperforms ARIMA in the context of our forecasting task, with significant reductions in selected error metrics observed
Intro & Rationale
By definition, the Federal Reserve’s Beige Book publication is a collection of survey responses woven together into a narrative about the condition of the U.S. economy, as well as that of each of its 12 bank districts. Despite the myriad issues surrounding the statistical analysis of survey data, in general (let alone generalizing meaningfully therefrom), sentence-by-sentence sentiment analysis of each report across time would provide us the ability to forecast said sentiment — so let’s do that!
Rather than putting you through the rigors of scraping all of the text data for the period being examined (2011 to present) or cobbling together the workflow end-to-end, I am going to give all that to you straightway (link to Colab notebook) in order to chat more generally about the approach, findings, and ramifications. While we won’t be diving too heavily into anything in particular in this write-up, you will walk away with some additional intuition around web scraping, sentiment analysis, and time series forecasting.
We take as our focus the task of forecasting the national economic sentiment for each scheduled Beige Book publication date in 2024. Knowing that a few Beige Books have already been published this year, we are able to calculate sentiment scores from them and use those as “ground truth” to evaluate the performance of Amazon’s smallest CHRONOS forecasting model (chronos-t5-small) relative to that of a simple (1,1,1) ARIMA model (in other words, an autoregressive integrated moving average model with one autoregressive term, one differencing term, and one moving average term).
Without going too far into, CHRONOS is the byproduct of Amazon’s efforts to repurpose LLM architecture for the task of forecasting. What’s interesting is that the author’s even hypothesize that “Chronos learns general representations that can potentially be deployed for tasks beyond forecasting.” Elaboration on such tasks was beyond the scope of their paper, unfortunately.
Methodology
Some helper functions were created in order to read in and parse text data from Beige Book PDFs, but web scraping wound up being more efficient.
Nevertheless, we wind up with the national summary of economic conditions as well as that for each of the 12 Federal Reserve bank districts, broken out sentence-by-sentence as extracted from the Beige Books for the period being examined:
VADER (Valence Aware Dictionary and sEntiment Reasoner), “a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media”, was used to calculate a sentiment score for each sentence, from which mean sentiment scores were aggregated by report date. This allows for a model to be trained with the target of forecasting the national economic sentiment score or that of a given district; but the ability to compare how, say, Atlanta and San Francisco have recovered from the pandemic relative to the nation is also unlocked, which is just plain cool.
It should be noted that a model could have just as easily been trained to predict the magnitude of the change in national economic sentiment by simply calculating the percent change in sentiment across time:
Regardless, another cool thing about forecasting with CHRONOS is that the forecasts are probabilistic, meaning that we can visualize the confidence interval for each prediction:
On the other hand, while ARIMA is not a probabilistic forecasting model, we are able to gauge its performance with its accompanying report:
We won’t dwell on it here, but higher log likelihood values indicate a better fit of the model to the data, while lower AIC, BIC, and HQIC indicate the same.
The important thing with respect to our task is quantifying the performance of CHRONOS in comparison to the ARIMA model — a fair comparison given that the authors of the CHRONOS paper themselves point out that “[t]raditionally, forecasting has been dominated by statistical models such as ARIMA and ETS [Error Trend and Seasonality, or exponential smoothing].”
Results
Not only are we able to compare forecasts…
…but also CHRONOS’ performance in terms of traditional error metrics:
Conclusion
Zooming back out, we just used a repurposed LLM to forecast national economic sentiment from a collection of survey responses! And it outperformed what is considered to be a standard forecasting model!! Very interesting times, indeed. The implication, of course, being that we are one step closer to “a unified, general-purpose forecasting model, a goal that remains a beacon for time series researchers.”
Thank you for taking the time to read this. If you want to learn more about web scraping, I’ve put together a very brief workshop on the subject that will likely be of interest. If you have an interest in using Python to work with time series data, I wrote an article about that too … I think that’s it for now — please feel free to reach out if ya wanna chat or collaborate!
https://discord.com/invite/NbQ9ucMjzd
References
Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, Yuyang Wang. Chronos: Learning the Language of Time Series. arXiv:2107.07702, 2024.
C.J. Hutto and Eric Gilbert. VADER (Valence Aware Dictionary and sEntiment Reasoner). https://github.com/cjhutto/vaderSentiment/tree/master, 2014.
Josef Perktold, Skipper Seabold, and Jonathan Taylor. statsmodels.tsa.arima.model.ARIMA. https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html, 2009.
Board of Governors of the Federal Reserve System. Beige Book. https://www.federalreserve.gov/monetarypolicy/publications/beige-book-default.htm, 2011–2024.