Estimate a Logistic regression for classification
To estimate a logistic regression we need a binary response variable and one or more explanatory variables. We also need specify the level of the response variable we will count as as success (i.e., the
Choose level: dropdown). In the example data file
titanic, success for the variable
survived would be the level
To access this dataset go to Data > Manage, select
examples from the
Load data of type dropdown, and press the
Load examples button. Then select the
In the Summary tab we can test if two or more variables together add significantly to the fit of a model by selecting variables in the
Variables to test dropdown. This functionality can be very useful to test if the overall influence of a variable of type
factor is significant.
Additional output that requires re-estimation:
Additional output that does not require re-estimation:
We can use the Predict tab to predict probabilities for different values of the explanatory variable(s) (i.e., a common use of Logistic regression models). First, select the type of input for prediction using the
Prediction radio buttons. Choose either an existing dataset for prediction (“Data”) or specify a command (“Command”) to generate the prediction inputs. If you choose to enter a command you must specify at least one variable and one value to get a prediction. If you do not specify a value for each variable in the model either the mean value or the most frequent level will be used. It is only possible to predict outcomes based on variables used in the model (e.g.,
age must be one of the selected explanatory variables to predict survival probability for a 90 year old passenger).
pclass = "3rd"and press enter
age = seq(0, 90, 10)and press enter
pclass = levels(pclass), sex = c("male","female")and press enter
As an example of how to use data as input for prediction (e.g., predict the survival probabilities for 30 year old men and women in each of the passenger classes) you can use
titanic as the dataset for analysis and specify a model in Model > Logistic regression (GLM) with
age as explanatory variables. Choose
Data from the
Prediction input dropdown and select
titanic_pred from the
Predict for profiles dropdown to generate the predictions. Note that the variables in the datafile and in the model must be the same.
To generate predicted values for all cases in, for example, the
titanic dataset select
Data from the
Prediction input dropdown then select the
titanic dataset. You can also create a dataset for input in a spreadsheet and then paste it into Radiant through the Data > Manage tab. You can also load csv data as input. For example, paste the following link
https://radiant-rstats.github.io/docs/examples/glm_pred.csv file into Radiant through the Data > Manage tab and try to generate the same predictions. Hint: Use
csv (url) to load the data link above.
Once the desired predictions have been generated they can be saved to a csv file by clicking the download button button on the top right of the screen. To add predictions to the dataset, click the
As an example we will use a dataset that describes the survival status of individual passengers on the Titanic. The principal source for data about Titanic passengers is the Encyclopedia Titanic. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay. Suppose we want to investigate which factors are most strongly associated with the chance of surviving the sinking of the Titanic. Lets focus on four variables in the database:
survived as the response variable and
Yes in Choose level. Select
age as the explanatory variables. In the screenshot below we see that each of the coefficients is statistically significant (p.value < .05) and that the model has some predictive power (Chi-squared statistic < .05). Unfortunately the coefficients from a logit model are difficult to interpret. The
OR column provides estimated odds-ratios. We see that the odds of survival were significantly lower for 2nd and 3rd class passengers compared to 1st class passenger. The odds of survival for males were also lower than for females. While the effect of age is statically significant, for each extra year in age the odds of survival are not as strongly affected (see also the standardized coefficient).
For each of the explanatory variables the following null and alternate hypotheses can be formulated for the odds ratios:
The odds-ratios from the logistic regression can be interpreted as follows:
In addition to the numerical output provided in the Summary tab we can also evaluate the link between
age visually (see Plot tab). The settings in the side-panel are the same as before. In the screenshot below we see a coefficient (or rather an odds-ratio) plot with confidence intervals. The relative importance of gender and class compared to age clearly stands out. Note: click the check box for standardized coefficients (i.e.,
standardize) in the Summary tab and see if your conclusion changes.
Probabilities, are more convenient for interpretation than coefficients or odds from a logit model. To see how survival probabilities change across passenger classes select
Command from the
Prediction input dropdown in the Predict tab, type
pclass = levels(pclass) in the Prediction command box, and press return.
The figure above shows that the probabilities drop sharply for 2nd and 3rd class passengers compared to 1st class passengers. For males of average age (approx. 30 yo in the sample) the survival probability was close to 50%. For 30 yo, male, 3rd class passengers this probability was closer to 9%.
age sex pclass pred 29.881 male 1st 0.499 29.881 male 2nd 0.217 29.881 male 3rd 0.092
To see the effects of gender type
sex = levels(sex) in the Prediction command box and press return. For average age females in 3rd class the survival probability was around 50%. For males with the same age and class characteristics the chance of survival was closer to 9%.
age pclass sex pred 29.881 3rd female 0.551 29.881 3rd male 0.092
To see the effects for age type
age = seq(0, 100, 20) in the Prediction command box and press return. For male infants in 3rd class the survival probability was around 22%. For 60 year old males in 3rd class the probability drops to around 3.5%. For the oldest males on board, the model predicts a survival probability close to 1%.
pclass sex age pred 3rd male 0 0.220 3rd male 20 0.124 3rd male 40 0.067 3rd male 60 0.035 3rd male 80 0.018 3rd male 100 0.009
For a more comprehensive overview of the influence of gender, age, and passenger class on the chances of survival we can generate a full table of probabilities by selecting
Command from the
Prediction input dropdown in the Predict tab and selecting
Titanic from the
Predict for profiles. There are too many numbers to easily interpret in table form but the figure gives a clear overview of how survival probabilities change with