Transform variables
All transformations applied in the Data > Transform tab can be logged. If, for example, you apply a log transformation to numeric variables the following code is generated and put in the Transform command log window at the bottom of your screen when you click the Store button.
## transform variable
r_data[['diamonds']] <- mutate_each(r_data[['diamonds']], funs(log), ext = '_log', price, carat)
This is an important feature if you want to re-run a report with new, but similar, data. Even more important is that there is a record of the steps taken to transform the data and to generate results, i.e., your work is now reproducible.
To add commands contained in the command log window to a report in R Report click the icon.
Even if a filter has been specified it will be ignored for (most) functions available in Data > Transform. To create a new dataset based on a filter navigate to the Data > View tab and click the Store button. Alternatively, to create a new dataset based on a filter, select Split data > Holdout sample from the Transformation type dropdown.
For larger datasets, or when summaries are not needed, it can useful to click Hide summariesbefore selecting the transformation type and specifying how you want to alter the data. If you do want to see summaries make sure that Hide summaries is not checked.
The Bin command is a convenience function for the xtile command discussed below when you want to create multiple quitile/decile/… variables. To calculate quintiles enter 5 as the Nr bins. The reverse option replaces 1 by 5, 2 by 4, …, 5 by 1. Choose an appropriate extension for the new variable(s).
When you select Type from the Transformation type drop-down another drop-down menu is shown that will allow you to change the type (or class) of one or more variables. For example, you can change a variable of type integer to a variable of type factor. Click the Store button to commit the changes to the data set. A description of the transformation options is provided below.
Choose Normalize from the Transformation type drop-down to standardize one or more variables. For example, in the diamonds data we may want to express price of a diamond per-carat. Select carat as the Normalizing variable and price in the Select variable(s) box. You will see summary statistics for the new variable (e.g., price_carat) in the main panel. Commit changes to the data by clicking the Store button.
To use the recode feature select the variable you want to change and choose Recode from the Transformation type drop-down. Provide one or more recode commands, separated by a ;, and press return to see information about the changed variable. Note that you can specify a name for the recoded variable in the Recoded variable name input box (press return to submit changes). Finally, click Store to add the recoded variable to the data. Some examples are given below.
Low and all others to Highlo:20 = 'Low'; else = 'High'
High and all others to Low20:hi = 'High'; else = 'Low'
A, 13:24 to B, and the remainder to C1:12 = 'A'; 13:24 = 'B'; else = 'C'
<25 and 25-34 are recoded to <35, 35-44 and 35-44 are recoded to 35-54, and 55-64 and >64 are recoded to >54'<25' = '<35'; '25-34' = '<35'; '35-44' = '35-54'; '45-54' = '35-54'; '55-64' = '>54'; '>64' = '>54'
sales that is equal to 400 we would (1) select the variable sales in the Select variable(s) box and enter the command below in the Recode box. Press return and Store to add the recoded variable to the data400 = NA
carat in the Select variable(s) box and enter the command below in the Recode box to set the value for carat to 2 in all rows where carat is currently larger than or equal to 2. Press return and Store to add the recoded variable to the data2:hi = 2
Note: Do not use = in a variable label when using the recode function (e.g., 50:hi = '>= 50') as this will cause an error.
If a (single) variable of type factor is selected in Select variable(s), choose Reorder/Remove levels from the Transformation type drop-down to reorder and/or remove levels. Drag-and-drop levels to reorder them or click the \(\times\) to remove them. Press Store to commit the changes. To temporarily exclude levels from the data use the Filter box (see the help file linked in the Data > View tab).
Choose Rename from the Transformation type drop-down, select one or more variables, and enter new names for them in the Rename box. Separate names by a ,. Press return to see summaries for the renamed variables on screen and press Store to alter the variable names in the data.
Choose Replace from the Transformation type drop-down if you want to replace existing variables in the data with new ones created using, for example, Create, Transform, Clipboard, etc.. Select one or more variables to overwrite and the same number of replacement variables. Press Store to alter the data.
When you select Transform from the Transformation type drop-down another drop-down menu is shown you can use to apply common transformations to one or more variables in the data. For example, to take the (natural) log of a variable select the variable(s) you want to transform and choose Ln (natural log) from the Apply function drop-down. The transformed variable will have the extension specified in the Variable name extension input (e.g,. _ln). Make sure to press return after changing the extension. Click the Store button to add the (changed) variable(s) to the data set. A description of the transformation functions included in Radiant is provided below.
Although not recommended, you can manipulate your data in a spreadsheet (e.g., Excel or Google sheets) and copy-and-paste the data back into Radiant. If you don’t have the original data in a spreadsheet already use the clipboard feature in Data > Manage so you can paste it into the spreadsheet or click the download icon on the top right of your screen in the Data > View tab. Apply your transformations in the spreadsheet program and then copy the new variable(s), with a header label, to the clipboard (i.e., CTRL-C on windows and CMD-C on mac). Select Clipboard from the Transformation type drop-down and paste the new data into the Paste from spreadsheet box. It is key that new variable(s) have the same number of observations as the data in Radiant. To add the new variables to the data click Store.
Note: Using the clipboard feature for data transformation is discouraged because it is not reproducible.
Choose Create from the Transformation type drop-down. This is the most flexible command to create new or transform existing variables. However, it also requires some basic knowledge of R-syntax. A new variable can be any function of other variables in the (active) dataset. Some examples are given below. In each example the name to the left of the = sign is the name of the new variable. To the right of the = sign you can include other variable names and basic R-functions. After you type the command press return to see summary statistics for the new variable. If the result is as expected press Store to add it to the dataset.
Note: If one or more variables is selected from the
Select variableslist they will be used to group the data before creating the new variable (see example 1. below). If this is not the intended result make sure that no variables are selected when creating new variables
z that is equal to the mean of price. To calculate the mean of price per group (e.g., per level of clarity) select clarity from the Select variables list before creating zz = mean(price)
z that is the difference between variables x and yz = x - y
z that is a transformation of variable x with mean equal to zero (see also Transform > Center):z = x - mean(x)
z that takes on the value TRUE when x > y and FALSE otherwisez = x > y
z that takes on the value TRUE when x is equal to y and FALSE otherwisez = x == y
z that is equal to x lagged by 3 periodsz = lag(x,3)
smaller and bigger)z = ifelse(x < y, 'smaller', 'bigger')
Recode function described belowz = ifelse(x < 60, '< 60', ifelse(x > 65, '> 65', '60-65'))
sales that is equal to 400 we could use an ifelse statement and enter the command below in the Create box. Press return and Store to add the sales_rc to the data. Note that if we had entered sales on the left-hand side of the = sign the original variable would have been overwrittensales_rc = ifelse(sales > 400, NA, sales)
ifelse statement and enter the command below in the Create box. As before, press return and Store to add sales_rc to the dataincome_rc = ifelse(ID == 3, income/1000, income)
Create and enter:income_rc = ifelse(ID %in% c(1, 3, 15), income/1000, income)
Type menu you can use the parse_date_time function. For a date formated as 2-1-14 you would specify the command below (note that this format will also be parsed correctly by the mdy function in the Type menu)date = parse_date_time(x, '%m%d%y')
tdiff = as_duration(time2 - time1)
m = month(date)
minute, hour, day, week, quarter, year, wday (for weekday). For wday and month it can be convenient to add label = TRUE to the call. For example, to extract the weekday from a date variable and use a label rather than a numberwd = wday(date, label = TRUE)
dist = as_distance(lat1, long1, lat2, long2)
recency by using the xtile command. To create deciles replace 5 by 10.rec_iq = xtile(recency, 5)
Note: For examples 7, 8, and 15 above you may need to change the new variable to type factor before using it for further analysis (see also Type above)
Choose Remove missing from the Transformation type drop-down to eliminate rows with one or more missing values. Rows with missing values for Select variables will be removed. Press Store to change the data. If missing values were present you will see the number of observations in the data summary change (i.e., the value of n changes) as variables are selected.
Choose Reorder/Remove variables from the Transformation type drop-down. Drag-and-drop variables to reorder them in the data. To remove a variable click the \(\times\) symbol next to the label. Press Store to commit the changes.
It is common to have one or more variables in a dataset that should have only unique values (i.e., no duplicates). Customers IDs, for example, should be unique unless the dataset contains multiple orders for the same customer. To remove duplicates select one or more variables to determine uniqueness. Choose Remove duplicates from the Transformation type drop-down and check how the summary statistics change. Press Store to change the data. If there are duplicate rows you will see the number of observations in the data summary change (i.e., the value of n and n_distinct will change).
If there are duplicates in the data use Show duplicates to get a better sense for the data points that have the same value in multiple rows. If you want to explore duplicates using the Data > View tab make sure to Store them in a different dataset (i.e., make sure not to overwrite the data you are working on). If you choose to show duplicates based on all columns in the data only one of the duplicate rows will be shown. These rows are exactly the same so showing 2 or 3 isn’t helpful. If, however, we look for duplicates based on a subset of the available variables Radiant will generate a dataset with all similar rows.
Create a dataset with all combinations of values for a selection of variables. This is useful to generate datasets for prediction in, for example, Model > Linear regression (OLS) or Model > Logistic regression (GLM). Suppose you want to create a dataset with all possible combinations of values for cut and color of a diamond. By selecting Expand grid from the Transformation type dropdown and cut and color in the Select variable(s) box we can see in the screenshot below that there are 35 possible combinations (i.e., cut has 5 unique values and color has 7 unique values so 5 x 7 combinations are possible). Choose a name for the new dataset (e.g., diamonds_expand) and click the Store button to add it to the Datasets dropdown.
Turn a frequency table into a dataset. The number of rows will equal the sum of all frequencies.
To create a holdout sample based on a filter, select Holdout sample from the Transformation type dropdown. By default the opposite of the active filter is used. For example, if analysis is conducted on observations with date < '2014-12-13' then the holdout sample will contain rows with date >= '2014-12-13' if the Reverse filter box is checked.
To create a variable that can be used to (randomly) filter a dataset for training and validation, select Training variable from the Transformation type dropdown. Specify either the number of observations to use for training (e.g., set Size to 2000) or a proportion of observations to select (e.g., set Size to .7). The new variable will have a value 1 for training and 0 holdout.
Combine multiple variables into one column. If you have the diamonds dataset loaded, select cut and color in the Select variable(s) box after selecting Gather columns from the Transformation type dropdown. This will create new variables key and value. key has two levels (i.e., cut and color) and value captures all values in cut and color.
Spread one column into multiple columns. The opposite of gather. For a detailed discussion about tidy data see the tidy-data vignette.