This post is notes from the Coursera Data Analysis Course.
Here are some R commands that might serve helpful for cleaning data.
String Replacement
- sub() replace the first occurrence
- gsub() replaces all occurrences
Quantitative Variables in Ranges
- cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
- cut2(data$col, g=6) return a factor variable with 6 groups
- cut2(data$col, m=25) return a factor variable with at least 25 observations in each group
Manipulating Rows/Columns
- merge() for combining data frames
- sort() sorting an array
- order(data$col, na.last=T) returns indexes for the ordered row
- data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
- melt() in the reshape2 package, this is for reshaping data
- rbind() adding more rows to a data frame
Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.
Leave a Reply