Applications of Logistic Regression

Nathanael O'Donnell

What is Logistic Regression?

Along with linear regression, logistic regression is a widely used machine learning algorithm. In spite of its name, however—and in contrast to linear regression—logistic regression is not actually a regression algorithm. Rather, it is a classification algorithm, meaning that it is useful when categorizing (aka "classifying") observations based on any number of feature variables. Mathematically, logistic regression works by mapping observations to points along a logistic function, whose standard equation and graph are below:

Source: Andrew Ng

Note that g(z), called the logistic function or sigmoid function, tends towards 1 as z approaches infinity, and tends towards 0 as z approaches negative infinity. This property of the logistic function is what makes it useful for classification: the output of the function will also be a number between 0 and 1, and more extreme values of z—in either direction—will effectively result in higher confidence that a given observation should be classified one way or the other.

Applying Logistic Regression

Upon this relatively simply foundation, logistic regression can be applied to solve all manner of real-world problems, including those with only two categories and those with n categories. When we solve a problem with more than two categories, we call this multiclass classification. To keep things simple for this blog post, I will discuss a real-world application of logistic regression that categorizes observations into one of two classes.

The case study I will discuss if from a study that was conducted by Dr. Frank van der Meulen et al. at Delft University of Technology in the Netherlands. You can access a link to the paper here. The study aimed to identify the factors that influenced whether patients discharged from the cardiology department at Virga Jess Hospital in Belgium would participate in the post-operation voluntary rehabilitation program offered by the hospital. The motivation of this study was ultimately to highlight opportunities to increase the rate of participation in the rehabilitation program, and thus improve patient outcomes.

In general, when applying logistic regression to a problem, we need to identify the feature variables and response variable. In this case, the response variable is whether a patient enrolled in the rehabilitation program, and the domain of the variable is 0 or 1, with 0 representing patients who did not enroll and 1 representing patients who did enroll. The feature variables—that is, the variables that are used as factors to predict whether the patient will, or won't, enroll in the rehabilitation program, are the following (as chosen by the researchers):

distance from the patient's home to the hospital
age of the patient
mobility of the patient (a binary categorical variable representing whether or not the patient owns a car)
gender of the patient
place of residence

For each patient, their particular values for the above variables make up the x vector in the logistic function, and the theta values (parameter vector) must be learned by the logistic regression model. After running their data through the model, the researchers determined that age, distance, and mobility all had a strong influence on whether a patient would enroll in the rehabilitation program. However, since age and distance are "nuisance factors" (cannot be influenced by any reasonable interventions), they decided to focus their recommendations on increasing patients' mobility be arranging a carpooling service for patients without cars.

Other Applications

From the above example, it is relatively easy to image myriad other applications of logistic regression, including uses in commercial enterprises. For instance, an email marketing analyst could use logistic regression to predict whether a given subscriber would unsubscribe from their email list in the next month, and they could adjust the content and/or frequency of their marketing emails to this person accordingly. Another example could be a production engineer using logistic regression to predict whether a particular machine will need maintenance in the next planning period based on a feature vector containing variables such as age of the machine, time since last maintenance check, and average duration the machine is in use per day.

While there are other popular approaches to classification problems, logistic regression retains its appeal for being easy to understand and implement. In the often-complex world of machine learning, simplicity can be a virtue!

Thoughts on Theta