Learn R Programming: Getting Started with R Language Cheatsheet

R is a programming language designed for statistical computing and graphics. It is free, open-source, and available on Linux, Mac OS X, and Windows operating systems. R is an essential language in Data Science. R was built to be fast, and it is used by more than 10% of all statisticians and data scientists.

As R is a community-driven language and software platform, it thrives and improves on user contributions. In this article, we will present R cheat sheets that are organized in the following manner:

1. Data processing and transformation

For any kind of analysis, input/output and transformation of data are core tasks. R is a robust platform with many features that we will cover in the following sections.

a) Data handling

To extract and load data for any kind of analysis, R provides pretty powerful and easy-to-use utility functions. Some of these are listed, as follows:

read.csv(<file_name>): This imports a standard .csv file
write.csv(<object_name>,<file_name>): This exports to a .csv file
data(<dataset_name>): This loads R’s built-in dataset
head(<object>): This prints the first few entries of the data imported
names(<object>): This lists variables in an object
read.table(<file_name>): This reads contents from an ASCII file

b) Basic data types

Data types form the basic constructs for R—or any other language as a matter of fact. What makes R special is an extended list of basic data types to handle varied data types. These are as follows:

numeric (integer and double) and character: These are data types that are available in R
factor: This allows you to store categorical data while a complex data type is used for complex numbers
is.<data_type> and as.<data_type>: These are used to check data types and type conversion, respectively
length(<variable>): This gives you a count of characters in a variable

2. Data structures

R provides many data structures out of the box, which we discuss in the following subsections.

a) Vectors

This is the most basic data structure in R. It is similar to a mathematical vector. The following are ways to interact with a vector in R:

r[1]: This allows you to access elements using square braces. The element count begins from 1.
r[ x > 100 ]: These vectors support logical expressions as indices.
r[5:10]: These vectors support subselection. The given example returns vector values between the index 5 to 10.
r[-1]: This returns all indices except 1.
factor(x): This converts a vector x to factor.
which.max(x) and which.min(x): These return the maximum and minimum values of x, respectively.
rev(x): This reverses the elements of x.
table(x): This gives you the frequency table for elements of the x vector.
match(a,b): This returns values from a which exist in b; otherwise, this is not applicable.

b) Arrays and matrices

R supports multidimensional arrays. A matrix is a two-dimensional array. The following are access patterns for these data structures:

array (<vector>,<vector_dimensions>): This generates an array from an input vector
%o%: This gives you the outer or cross product of two arrays
x[a,b,c]: This is when the dimensions of an array are comma-separated and accessed from within square braces
matrix(<vector>,nrow=r,ncol=c): This generates an r X c matrix with values from <vector>
t(<matrix>): This is the transpose of a matrix
diag(<matrix>): This gives the diagonal of a matrix
colsum(<matrix>) and rowsum(<matrix>): This calculates the sum of columns and rows of a matrix, respectively
colmeans(<matrix>) and rowmeans(<matrix>): This calculates the sum of columns and rows of a matrix, respectively
%*%: This is a matrix multiplication operator
lower.tri(<matrix>): This returns a vector with values from the lower triangle of a matrix

c) Lists

A list is an ordered collection of named or unnamed objects, which may or may not be homogenous. These are recursive data structures; that is, a list’s element can itself be a list. A list can be manipulated using the following:

list(<object_1>,<object_2>,…): This generates a list of objects that are separated by a comma
L[[i]]: This is when double-square brackets are used to access elements at the ith index of the list
length(<list>): This returns the count of the topmost elements of a list
L$<name>: This is when the $ operator allows access to the <named> element of list L; this is the same as L[[i]]

d) Data frames

Data frames are tabular structures that can have columns of different data types and attributes. A data frame may contain components of the numeric, character, factor, or list types, or it may contain other data frames. The following utilities help in manipulating data frames:

data.frame(col1=<object1>,col2=<object2>,…): This generates a data frame with n columns or components, which have values from corresponding objects
attach(<data.frame>): This exposes components of a data frame in a search path for easy access
merge(x,y): This combines two data frames that are based on common columns or row names

e) General utilities

Apart from the utilities and the other constructs that we just discussed, R provides a rich set of general utilities to make data analysis even easier. Check out the following utilities:

c(1:5): This is a generic function that concatenates values. The given example would generate a vector with values 1 to 5.
rep(<value>,<count>): This generates a vector with repeating <value> elements of the <count> size.
seq(to,from): This generates a sequence vector starting with to and ending with from. You can also specify increments; the default is 1.
sort(c(10,9,8,7): This returns a sorted vector 7,8,9,10.
order(10,9,1,2): This returns indices in ascending order as 3,4,2,1.
rank(10,5,6,9): This returns the rank order of elements as 4,1,2,3.
summary(<object>): This has summary details, such as min, max, mean, median, and so on, for the object.
choose(n,k): This returns the combination of k in n repetitions.
na.omit(x): This suppresses all the missing values (nas) from x.
na.fail(x): This errors out if x contains even a single missing value.
unique(x): This returns only distinct or unique values of x. This works with vectors and data frames.
paste(…): This converts objects to strings and concatenates them.
substr(cv,start,stop): This substrings from the cv character vector from the start to the stop position.
grep(ptrn,cv): This searches for the ptrn patterns in the cv vector.
gsub(ptrn,rep,cv): This replaces match for the ptrn pattern with the rep replacement in the cv vector.
tolower and toupper: This converts character vector elements to lowercase and uppercase, respectively.

3. Math and modeling

R has a rich set of inbuilt functions and packages to perform mathematical and modeling operations.

a) Math and modeling utilities

As R is a statistical language, it provides a rich set of mathematical functions that are available right out of the box (while more can be added using additional libraries or packages):

sum(x): This is the sum of the elements of x.
cumsum(x): This calculates the cumulative sum of the elements of x.
diff(x): This is the pair-wise difference between the elements of vector x.
prod(x): This is the product of the elements of x.
mean(x)and median(x): This is the mean and median of x, respectively.
var(x,y): This is the variance between the elements of x and y. It works with matrices and data frames as well. This is the same as cov(x, y).
quantile(x,probs): This returns the quantile breakup of x for given probabilities.
sd(x): This is the standard deviation for x.
weighted.mean(x,w): This returns the weighted mean of x using the w weight vector.
cor(x,y): This is the linear correlation between x and y.
round(x,n): This rounds the elements of x to n digits.
log(a,b): This calculates the log of a for base b.
sin, cos, tan, asin, acos, atan, and so on: These are Trigonometric functions.
exp(x): This exponentiates each element of the x vector.
scale(m): This centers or scales the elements of an m numeric matrix.
union(x,y), intersect(x,y), and is.element(e,x): These are Set functions that are also available.
Conj(c): This returns the conjugate of the c complex number.
rnorm, rpois, rgamma, rexp, rcauchy, rt, and so on: These can be used to generate Gaussian, Poisson, Gamma, Exponential, Cauchy, and Student distributions.
fft(x): This calculates Fast Fourier Transform of the elements of x.
apply(m,INDEX,FUNC): This applies the FUNC function on the INDEX index of the m matrix.
lapply(l,FUNC): This applies the FUNC function on the l list.
optim(params, func, mtds): This is the general-purpose method to optimize a func function for the params parameters using the mtds methods.
lm(frml): This fits a linear model on the frml formula. This is used for regression and covariance analysis. Also, check glm for generalized linear models.
nls(fml): This fits nonlinear least squares estimates for nonlinear models.
spline(s): This calculates the cubic spline.
predict(fit,[…]): This is a generic function to test model fitting on input data.
df.residual(fit): This calculates the degrees of residual freedom from fit.
coef, residuals, and deviance: These return coefficients, residuals, and deviance of models fitted.
logLik(fit): This calculates the log likelihood of the model fitted.
aov(frml): This performs analysis of variance model calculations on frml.
Anova(fit,[…]): This performs analysis of variance of models fitted.

b) Math and modeling packages

The following is a list of popular and mature sets of packages, which enhance the power of R:

arules: This is association rule mining
cluster, fpc, mclust: This is clustering and classification
DmwR, dprep,rlof: This is outlier detection
multicore, snow: This is a multiprocessing library
nlme: This is regression, linear, and nonlinear modeling
TraMiner: This is sequential pattern mining
party and rpart: These are recursive partitioning, decision trees, and survival analysis
nnet: This is neural networks
kernlab and e1071: These support Vector Machines, PCA, Naive Bayes, fuzzy clustering, and so on.
stats, ast, forecast: This is for time series analysis
RgoogleMaps, ggmap, plotKML, and spdep: These are for spatial analysis
sna, network, and igraph: These are for social network analysis
tm, lda, topicmodels, RTextTools, and tau: These are for text mining

4. Plotting

Statistical analysis and data science are way too difficult without graphs and visualization. R has a rich set of utilities and libraries for plotting. Let’s have a look at a few of these:

plot(y): This plots the values of y on the y axis ordered by indices on the x axis.
plot(x,y): This plots values on the x and y axis, respectively.
barplot(x): This is a bar plot of the values of x.
hist(x): This is a histogram of frequencies of the elements of x.
pie(x): This is a pie chart for the elements of x.
boxplot(x): This is a boxplot for the elements of x.
plot.ts(x): This is a plot with respect to time.
mosaicplot(x): This is a mosaic graph of residuals of a log-linear regression.
contour(x,y,z): This is a contour plot of x and y, where x and y must be vectors and z should be a matrix of the x X y dimension.
qqplot(x,y): This is a quantile plot of y with respect to x.
abline(m,c): This draws a line with the m slope and the c intercept. This can also be used to draw horizontal, vertical, and regression lines.
rect(x1,y1,x2,y2): This draws a rectangle, based on the top-left (x1,y1) and bottom-right (x2,y2) coordinates.
polygon(x,y): This draws a polygon, connecting the elements of x and y.
xlim,ylim: These are the x and y limits of a graph.
col(): This is the line or symbol color.
text(), title(), and legend(): These are for text, title, and legends on a graph.