Why Regression Models Are Inappropriate for Heterogeneous Populations, Why Traditional Thinking in Modeling Cannot Get Us Where We Need to Be, and The Type of Thinking That Can
We cannot have accurate predictions when even our best predictors are irrelevant for some patients. Even testing for interaction terms or subgroup analyses fall short of the potential I discovered.
A regression model, whether it's simple linear regression or a more complex multiple regression, is a powerful tool for exploring relationships between a dependent (outcome) variable and one or more independent (predictor) variables. However, regression models have certain assumptions that limit their applicability and accuracy.
One major limitation is the assumption that every independent variable is relevant for every individual in the study. This assumption means that the effects of the predictors are assumed to be constant across all subjects, which is often not the case in real-world scenarios. This limitation is primarily linked to the "homogeneity of regression slopes" assumption and can lead to biased or inefficient estimates if violated. The impact of this assumption varies depending on the nature of the regression model used.
In traditional regression models, like Ordinary Least Squares (OLS), the effect of an independent variable on the dependent variable is fixed and does not vary between different individuals or groups. This may not be a valid assumption in many cases. For instance, if you are studying the impact of educational level on income, it is possible that this impact varies between different demographic groups. The effect might be different based on age, sex, race, and so on. It’s difficult to know if one should adjust for these covariates as confounders if the functional relationships are not well-established, and it’s risky to “adjust” incessantly instead of testing model accuracy with model selection procedures. Importantly, counter to basically all of CDC/HHS’s proscribed methods, the outcome of adjusting on the implications of a key variable is not justification for the adjustments.
An assumption of homogeneity is essentially an assumption of simplicity - that one variable can impact another in a simple and universal way. This limitation is part of a broader issue called omitted variable bias (OVB). In a regression model, if you omit a variable that influences the dependent variable and is correlated with an included independent variable, you will have biased and inconsistent estimates.
While some people might consider the use of interaction terms or subgroup analysis as a solution, this has its own challenges. For instance, it would be impractical to create interaction terms for every possible combination of variables or to analyze each subgroup separately due to power issues and the risk of finding spurious relationships due to multiple testing.
Furthermore, there are also problems with overfitting the model if too many variables are included. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data are picked up and learned as concepts by the model. This problem is particularly pronounced in small sample studies.
The assumption that every independent variable is relevant for every individual in the study has led to the development of more flexible regression techniques. One such approach is multilevel (or hierarchical) models which allows the regression coefficients themselves to vary randomly across groups or individuals, acknowledging that the relationship between independent and dependent variables might not be the same for all groups or individuals.
Many times, however, not all variables are relevant for all individuals, and this can place a ceiling on the generalized accuracy of the prediction models being developed and evaluated. Another approach is to use machine learning techniques such as decision trees or random forests, which are capable of capturing complex interactions and non-linear relationships, and which often require making fewer assumptions about the underlying data.
While regression models are extremely useful in statistical analysis, it's important to understand their assumptions and limitations. One should carefully choose the model based on the nature of the data, the research question, and the assumptions that one is willing and able to make. Proper diagnostic checks should also be performed to test the validity of these assumptions, and if necessary, more complex models should be considered.
I have developed a class of models not yet published that can find an infinite number of equally good perfectly accurate models when the predictors are highly heterogeneous. The class of models in entirely different from decision trees, forests and bagging and boosting. One interesting fact is that the exact parameter values of each variable become largely abstract and uninterpretable, unlike in regression models where a parameter coefficient is generally considered (in linear models) as “for each unit increase in x, the (predicted) value of y increases by a beta amount).
The optimization function is genetic algorithm, coded to my instructions as an expert in genetics, with input variables like mutation rate and cross-over recombination rates. Run the optimization a million times, and you’ll get a million different equally accurate results (if a prediction model CAN be found). If not, the test set results usually land on 50:50, where they should.
Regarding feature selection, the model finds useless features by default of their utility in prediction. In cases of true heterogeneity, the model will use predictors that are informative for even a tiny minority (as low as 1%, or lower) of the number of individuals in the study… with no theoretical low limit.
To purists, it may seem unsatisfying or even heretical to expect that the model will never converge to fixed coefficient parameter value estimates. The key is that everything about the model prediction exercise is optimized, and the sum of the coefficients (learned weights) of all parameters becomes the meta-parameter which itself is meta-optimized.
Those who are interested in learning more, please email info@ipak-edu.org
-JLW
Every model stands or falls on how close it matches empirical observation.
The closer a model comes to the empirical data, the better the model. Yet for all models the essential understanding is that the model is the theory. The data--and only the data--is the reality.
We cannot have accurate predictions when they are weaponized to scare humanity into compliance with laws designed to destroy humanity, not protect it.
With regulatory capture we are more vulnerable than when we had no water treatment, sanitary facilities, refrigeration or credible police authority. In those days a person could avoid water that seemed contaminated. How does one avoid a mandated poison.