Ryan Swanstrom

R Commands for Cleaning Data

Mar 22, 2013

—

Ryan Swanstrom

in Data Science 101, education

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

sub() replace the first occurrence
gsub() replaces all occurrences

Quantitative Variables in Ranges

cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
cut2(data$col, g=6) return a factor variable with 6 groups
cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

merge() for combining data frames
sort() sorting an array
order(data$col, na.last=T) returns indexes for the ordered row
data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
melt() in the reshape2 package, this is for reshaping data
rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.

Ryan Swanstrom

R Commands for Cleaning Data

String Replacement

Quantitative Variables in Ranges

Manipulating Rows/Columns

Comments

Leave a ReplyCancel reply

R Commands for Cleaning Data

String Replacement

Quantitative Variables in Ranges

Manipulating Rows/Columns

Comments

Leave a ReplyCancel reply

Discover more from Ryan Swanstrom