Paper: Machine Learning from a "Universe" of Signals: The Role of Feature Engineering

The case for LightGBM in Portfolio123.com + Feature Engineering

Jun 22, 2025

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5248179

“In summary, our results suggest that employing machine learning methods is beneficial for real-time investors, and that feature engineering plays a vital role in substantially elevating these benefits.”

”In this paper, we examine machine learning strategies based on a “universe” of over 18,000 fundamental signals that are accessible to investors in real time.”

—> so, they use no feature engineering / selection first and then test their ML Implementation with curated sets of features later (which shows better results): e.g. Feature Selection is key! Brute Force is not optimal! Use your intuition to select features. This gives experienced “traditional” factor investors a huge jump start!
Also use Deepseek to come up with feature sets!

Extra Trees —> BRT

BRT stands for Boosted Regression Trees, which is a type of ensemble learning where many weak learners (typically shallow decision trees) are combined using gradient boosting to create a strong predictive model.

“We choose BRT as our primary machine learning method for several reasons. First, BRT routinely rank among the very best machine learning algorithms in both finance and non-finance applications.
Second, BRT can handle large data sets with high dimensionality without overfitting because they simultaneously perform subsampling, model combination, and shrinkage.
Third, BRT are robust to missing values and outliers (Hastie, Tibshirani, and Friedman, 2009).”
“In particular, BRT are invariant under all monotone transformations of the individual input variables, making the forecasts generated robust to extreme values.”

E.g. you can forget the normalization of features when using LightGBM or ExtraTrees (both are BRT!)

Share Systematic AI Investing Portfolios

Feature importance (BRT delivers that!)

”Finally, because BRT are rooted in the CART framework, they possess good interpretability. For example, BRT return the rank and relative importance of all the potential regressors available, known as relative influence measures. This feature distinguishes BRT from harder-to-interpret methods such as neural networks.”

Suff like that: https://www.portfolio123.com/sv/aiFactor/12254/prediction/importance

Feature (Factor) Choice does matter

”To investigate whether the weaker performance is due to limitations in our machine learning implementation, we replicate our analyses on datasets of published anomalies. Using the Green, Hand, and Zhang (2017, GHZ) sample, our BRT and neural network models achieve equal-weighted long-short returns exceeding 3.5% per month with Sharpe ratios of 2.21-2.81, matching the performance reported by Gu, Kelly, and Xiu (2020). The results are even stronger with the Chen and Zimmermann (2022, CZ) sample: BRT models achieve equal-weighted returns of 5.14% per month with an annualized Sharpe ratio of 3.64. These results indicate that our implementation can deliver performance on par with prior studies when using curated predictors. This suggests that the choice of input predictors, rather than the ML implementation itself, is the key driver of the performance gap between our strategies and those based on published signals.”

Interesting implementation!

“We implement our BRT model by following Gu, Kelly, and Xiu (2020). We divide our sample period (1963-2019) into 12 years of training sample (1963-1974), 12 years of validation sample (1975-1986), and the remaining 33 years (1987-2019) for out-of-sample testing.”

“We refit our model every year because our fundamental signals are updated annually. Each time we refit the model, we increase the training sample by one year while maintaining the length of the validation period at 12 years.”

“Specifically, we first use the training sample to estimate the model under each set of hyperparameter values. We then use the hyper-parameters that show the best performance during the validation period to re-estimate the final model. For example, suppose we want to forecast the crosssection of stock returns for 1987. We fit models under different hyper-parameter values during the training period 1963-1974 and then use the validation period 1975-1986 to gauge the performance of these trained models. We choose the hyper-parameters that deliver the best performance during the validation period and then use these hyper-parameters to re-estimate the final model for the combined training and validation period 1963-1986.”

“We, therefore, assign all missing values as zero.”

I am doing pretty much the same (with less data!): Train —> then Holdout —> see if all MLs are doing fine (robustness test!) —> choose the top ML (or one or several of the Top 5) as the predictor for a ranking system.

Though I am not using a grid search for the best hyperparameters (I tune them via Deepseek, much more cost effective, Grid Search = high server costs!).

Agree —> AI Factor Models show (in general) higher risk adjusted returns!

“…we find that portfolios with higher BRT predicted returns have higher average realized riskadjusted returns.”

Use Base Design Choices and then concentrate on Features!

In general —> concentrate on Feature Engineering. For this you need good (maybe not perfect) design choices on the Scaling, Target, Universe, Target Normalization as base.

Otherwise, there is just too much to test to get early results!