Statistical modeling uses mathematics and statistics as a way to make assumptions and reach conclusions from data. The use of statistical models is ubiquitous in scientific fields, including engineering, business, and life sciences. It’s a valuable tool for drawing inferences and making quantitative predictions about data.
This article defines statistical modeling and shows where and why it’s used. It then examines the finer points of the most common forms of modeling, with actual published examples so you can see regression in action.
What you’ll learn in this post
• What statistical modeling is and how it’s used.
• Why statistical modeling is so valuable in all forms of studies.
• The role of sampling, and what to watch for.
• Common types of statistical models, focusing on regressions.
• A source for more assistance with statistical modeling for your own studies.
Definition of statistical modeling
In simple terms, statistical modeling is a way to learn and reach meaningful conclusions from data. A statistical model is defined by a mathematical equation, but defining its very meaning is a good place to start:
- Statistics: the science of displaying, collecting, and analyzing data
- Model: a mathematical representation of a phenomenon
The first step in any statistical analysis is to gather relevant data about the population. The population is a set of similar items or events that you want to study and acquire information about.
An example: U.S. voters
For instance, your population could be all U.S. voters. The population’s quantities you’re interested in are called population parameters. In this case, it could be the approval percentage for a presidential candidate. It would be impractical (and essentially impossible) to collect this data on all U.S. voters.
You’ll typically run into the difficulty of obtaining one parameter for an entire population, because It’s often impossible.
With statistical inference, you can estimate the population parameter by measuring it from part of the population. Choosing that part to collect information on is known as statistical sampling. Election polls do precisely that to predict winners (or losers) from a sample of U.S. voters.
The power of sampling, and things to watch for
That you can produce a good estimate from a relatively small sample is a big reason why statistics are so powerful.
But be careful about how you pick the sample so that you don’t introduce any significant bias in your estimate (i.e., so your sampling won’t favor a certain outcome). This is a source of systemic error, and we usually don’t know how large a bias is.
Another potential source of error in your estimate comes from the so-called chance error, or sampling error. This owes to the randomness of drawing a sampling. The statistics of a sample will generally differ from the entire population’s statistics.
Contrary to a bias, you can compute a sampling error given the sample size. This error also gets smaller as the sample size increases. Hence, the process of statistical modeling must account for the relationship between random and non-random variables.
Why use statistical modeling?
The ability to predict and extract information are the main goals of data analysis.
The “cultures” of modeling
The late Leo Breiman, a noted statistician, stated there are two ways to approach these goals:
- Data modeling culture
- Algorithmic modeling culture
This notion of two cultures translates into a common conflict. How do you decide the best approach to analyze a given dataset?
Statistical modeling is often referred to as data modeling. Many machine learning models fall into the category of algorithmic modeling. These two share similar mathematical underpinnings but differ in their purposes.
Machine learning models are used for large datasets, model automation, and are very good at identifying hidden patterns in data. They’re an appropriate and necessary tool in several data science applications, given their strong predictive power. However, the predictability offered by machine learning doesn’t exclude the need for statistical modeling.
Statistical models are better at explaining the relationship between variables. They seek some understanding of the structure underlying the observed data.
Statistics extracts population inferences from a sample, while machine learning identifies generalizable patterns. More importantly, the approaches complement each other.
If possible, aim for accurate predictions as well as good interpretations.
Types of statistical models
Maybe the “simplest” statistical model is the arithmetic mean, or average, of a population parameter. With this measure, you’re attempting to guess what the expected value is if you take a random sample from the population.
Regression analysis is an important set of statistical models. It allows you to estimate a variable using one or more independent variables. Those independent variables are also known as explanatory variables. A regression model is specified by selecting a functional form for the statistical model. Following are some of the most common regression models.
Simple linear regression is an approach to modeling the association between two variables through a linear relationship. Given a data set of n observations Y1, Y2, …, Yn, X1, X2, …, Xn, the relationship is given by:
for a given observation point Yi, Xi. Y is a dependent variable (want to predict) and X is the independent or explanatory variable. b0 and b1 are the model parameters. These parameters are estimated from the data.
Note that it’s improbable that data points will fall exactly on the line. Hence, a stochastic term i should be included to account for this variation:
This new term represents the error of the model due to unexplained variation (or random disturbance).
In most cases the dependent variable might be related to more than one explanatory quantity. Considering the case of k explanatory variables X, the previous equation can be generalized to:
In this case, it’s called a multiple linear regression model, with k+1 model parameters b0, b1,…, bk that you need to estimate.
An example: COVID-19 mortality rate
In this study published in Nature, the factors associated with the COVID-19 mortality rate (the dependent variable) in 169 countries were identified using linear regression.
The researchers performed a simple linear regression analysis to test the correlation between COVID-19 mortality and a test number (the independent variable). They used multiple regression analysis for predicting mortality rates considering other explanatory factors (e.g., case numbers, critical cases, hospital bed numbers).
The results suggested that an increase in testing was effective at attenuating mortality when other means of control were insufficient. Higher mortality was found to be associated with:
- lower test number
- lower government effectiveness
- elderly population
- fewer hospital beds
- better transport infrastructure
In many settings, a linear relation can’t fit your data. Here, a polynomial equation can be used instead. The functional form of a polynomial regression model is an nth degree polynomial in the independent variable:
Note that the model depends linearly on the n+1 model parameters b0, b1, …, bn. The polynomial is a special case of the multiple regression with only one independent variable.
An example: Gene expression
In this BMC Bioinformatics study, you’ll find an application of polynomial regression. This study focused on the temporal profiling of gene expression of olfactory receptor neurons. Gene expression is the process by which the information encoded in the gene is used to make a functional product, such as a protein.
The authors proposed a polynomial regression with the gene expression being the dependent variable, and time being the explanatory variable. A quadratic regression is shown to be effective in gene discovery and pattern recognition.
In a logistical regression (or logit regression) the dependent variable only has two categories. If the event occurs, it’s coded as 1. The absence of the event is coded as 0. This categorization is particularly useful when the dependent variable is naturally binary. For example, if someone passed or failed a test or if they’re a smoker or non-smoker.
Because the dependent variable can only take only two values, the probability the model depicts is bounded to the 0,1 interval. This probability is given by the logistic function:
When Yi takes large values, the probability approaches zero. However, when Yi takes lower values the probability approaches 1.
If you’re fitting a simple linear regression to your data, the probability is given by:
The coefficient b0 is usually called the intercept, and b1 the rate parameter. You may need computer assistance for the estimation because of the logistic function’s nonlinear nature. (Take a look at this article for a deeper academic dive into logistic regression.)
An example: Twitter and U.S. elections
A practical application of logistic regression is given in this report on the 2010 U.S. elections. The authors built different logistic regression models to predict election results. The focus is on the usage patterns of Twitter by U.S. congressional candidates during the midterm elections in 2010.
The importance of content and structure of the tweets to the election outcome was tested for all candidates. The logistic regression model was able to predict the election outcome with:
- 88% accuracy when all variables were considered
- 81% accuracy when Twitter-derived variables were omitted
Multinomial logistic regression extends the logistic regression model to support variables with more than two categories. If there is a hierarchy between the categories you should use ordinal logistic regression.
An example: Gas station workers’ biomarkers
Multinomial logistic regression is used in this occupational health study to assess the relationship between benzene concentration and biomarkers in urine samples from gas station workers.
There are three categories of exposed groups:
- Gas station attendant (filler)
The cashiers and fillers were taken as the two exposed groups. The managers were treated as a control or reference group. The explanatory variables were:
- Benzine concentration
- Two age groups (<25 years old and 25–34 years old)
- Years of experience
No significant difference was observed for years of experience. The biomarkers exceeding the recommended threshold limit values were more significant in the filler group.
Let’s say you’ve collected the data and chosen a regression model. Now you want to find the best fit to the data. By doing this, you have an estimate for the value of the dependent variable. So, how can you estimate the parameters?
The least-squares method can do the job. The objective consists of minimizing the sum of the squared deviations of the dependent variable:
The hat denotes the values for the fitted line (estimators). The subtraction inside the brackets is called the residual.
Here’s how to do it for the simple linear regression model. In this case, the method is called ordinary least-squares. Start by taking the sum of the linear relationship over all n observations:
You can easily find and the estimator of the parameter b0
The overline denotes the mean value.
Now you need to obtain the estimator of the parameter b1 in terms of Xi and Yi. A simple trick is needed. Multiply the linear relationship by Xi and then sum over all n observations:
Substituting with the value for the estimator of the parameter b0 you get:
Finally, the equation for the fitted line is given by:
Help for your statistical modeling
This article might have excited you or scared you. Either way, statistical modeling is a vital part of creating impactful studies. But you may need some help with Edanz Statistical Analysis services match you with expert statisticians to help you at any stage of your study. Contact us for more details.