R is a programming language designed for statistical computing and graphics. It is free, open-source, and available on Linux, Mac OS X, and Windows operating systems. R is an essential language in Data Science. R was built to be fast, and it is used by more than 10% of all statisticians and data scientists.
- The Importance of Cybersecurity in Today’s World
- Nokia rebrands itself after 60 years, what do you think?
- Top 5 Sound Processing Tools Recommended by AIToolMall
- From Performance to Security: Why MongoDB Beats MySQL Every Time
- Pros and Cons of ELK Stack (Elasticsearch, Logstash and Kibana)
- 5 Android Tips and Tricks to Supercharge Your Experience
- ISO Standard – Its Common Types and Purposes
- How to Hack Windows OS with Four Different Methods
- Smartphone Hacking Steps and Its Prevention
- How to Perform a Social Engineering Attack – Examples Included
As R is a community-driven language and software platform, it thrives and improves on user contributions. In this article, we will present R cheat sheets that are organized in the following manner:
1. Data processing and transformation
For any kind of analysis, input/output and transformation of data are core tasks. R is a robust platform with many features that we will cover in the following sections.
a) Data handling
To extract and load data for any kind of analysis, R provides pretty powerful and easy-to-use utility functions. Some of these are listed, as follows:
- read.csv(<file_name>): This imports a standard .csv file
- write.csv(<object_name>,<file_name>): This exports to a .csv file
- data(<dataset_name>): This loads R’s built-in dataset
- head(<object>): This prints the first few entries of the data imported
- names(<object>): This lists variables in an object
- read.table(<file_name>): This reads contents from an ASCII file
b) Basic data types
Data types form the basic constructs for R—or any other language as a matter of fact. What makes R special is an extended list of basic data types to handle varied data types. These are as follows:
- numeric (integer and double) and character: These are data types that are available in R
- factor: This allows you to store categorical data while a complex data type is used for complex numbers
- is.<data_type> and as.<data_type>: These are used to check data types and type conversion, respectively
- length(<variable>): This gives you a count of characters in a variable
2. Data structures
R provides many data structures out of the box, which we discuss in the following subsections.
This is the most basic data structure in R. It is similar to a mathematical vector. The following are ways to interact with a vector in R:
- r: This allows you to access elements using square braces. The element count begins from 1.
- r[ x > 100 ]: These vectors support logical expressions as indices.
- r[5:10]: These vectors support subselection. The given example returns vector values between the index 5 to 10.
- r[-1]: This returns all indices except 1.
- factor(x): This converts a vector x to factor.
- which.max(x) and which.min(x): These return the maximum and minimum values of x, respectively.
- rev(x): This reverses the elements of x.
- table(x): This gives you the frequency table for elements of the x vector.
- match(a,b): This returns values from a which exist in b; otherwise, this is not applicable.
b) Arrays and matrices
R supports multidimensional arrays. A matrix is a two-dimensional array. The following are access patterns for these data structures:
- array (<vector>,<vector_dimensions>): This generates an array from an input vector
- %o%: This gives you the outer or cross product of two arrays
- x[a,b,c]: This is when the dimensions of an array are comma-separated and accessed from within square braces
- matrix(<vector>,nrow=r,ncol=c): This generates an r X c matrix with values from <vector>
- t(<matrix>): This is the transpose of a matrix
- diag(<matrix>): This gives the diagonal of a matrix
- colsum(<matrix>) and rowsum(<matrix>): This calculates the sum of columns and rows of a matrix, respectively
- colmeans(<matrix>) and rowmeans(<matrix>): This calculates the sum of columns and rows of a matrix, respectively
- %*%: This is a matrix multiplication operator
- lower.tri(<matrix>): This returns a vector with values from the lower triangle of a matrix
A list is an ordered collection of named or unnamed objects, which may or may not be homogenous. These are recursive data structures; that is, a list’s element can itself be a list. A list can be manipulated using the following:
- list(<object_1>,<object_2>,…): This generates a list of objects that are separated by a comma
- L[[i]]: This is when double-square brackets are used to access elements at the ith index of the list
- length(<list>): This returns the count of the topmost elements of a list
- L$<name>: This is when the $ operator allows access to the <named> element of list L; this is the same as L[[i]]
d) Data frames
Data frames are tabular structures that can have columns of different data types and attributes. A data frame may contain components of the numeric, character, factor, or list types, or it may contain other data frames. The following utilities help in manipulating data frames:
- data.frame(col1=<object1>,col2=<object2>,…): This generates a data frame with n columns or components, which have values from corresponding objects
- attach(<data.frame>): This exposes components of a data frame in a search path for easy access
- merge(x,y): This combines two data frames that are based on common columns or row names
e) General utilities
Apart from the utilities and the other constructs that we just discussed, R provides a rich set of general utilities to make data analysis even easier. Check out the following utilities:
- c(1:5): This is a generic function that concatenates values. The given example would generate a vector with values 1 to 5.
- rep(<value>,<count>): This generates a vector with repeating <value> elements of the <count> size.
- seq(to,from): This generates a sequence vector starting with to and ending with from. You can also specify increments; the default is 1.
- sort(c(10,9,8,7): This returns a sorted vector 7,8,9,10.
- order(10,9,1,2): This returns indices in ascending order as 3,4,2,1.
- rank(10,5,6,9): This returns the rank order of elements as 4,1,2,3.
- summary(<object>): This has summary details, such as min, max, mean, median, and so on, for the object.
- choose(n,k): This returns the combination of k in n repetitions.
- na.omit(x): This suppresses all the missing values (nas) from x.
- na.fail(x): This errors out if x contains even a single missing value.
- unique(x): This returns only distinct or unique values of x. This works with vectors and data frames.
- paste(…): This converts objects to strings and concatenates them.
- substr(cv,start,stop): This substrings from the cv character vector from the start to the stop position.
- grep(ptrn,cv): This searches for the ptrn patterns in the cv vector.
- gsub(ptrn,rep,cv): This replaces match for the ptrn pattern with the rep replacement in the cv vector.
- tolower and toupper: This converts character vector elements to lowercase and uppercase, respectively.
3. Math and modeling
R has a rich set of inbuilt functions and packages to perform mathematical and modeling operations.
a) Math and modeling utilities
As R is a statistical language, it provides a rich set of mathematical functions that are available right out of the box (while more can be added using additional libraries or packages):
- sum(x): This is the sum of the elements of x.
- cumsum(x): This calculates the cumulative sum of the elements of x.
- diff(x): This is the pair-wise difference between the elements of vector x.
- prod(x): This is the product of the elements of x.
- mean(x)and median(x): This is the mean and median of x, respectively.
- var(x,y): This is the variance between the elements of x and y. It works with matrices and data frames as well. This is the same as cov(x, y).
- quantile(x,probs): This returns the quantile breakup of x for given probabilities.
- sd(x): This is the standard deviation for x.
- weighted.mean(x,w): This returns the weighted mean of x using the w weight vector.
- cor(x,y): This is the linear correlation between x and y.
- round(x,n): This rounds the elements of x to n digits.
- log(a,b): This calculates the log of a for base b.
- sin, cos, tan, asin, acos, atan, and so on: These are Trigonometric functions.
- exp(x): This exponentiates each element of the x vector.
- scale(m): This centers or scales the elements of an m numeric matrix.
- union(x,y), intersect(x,y), and is.element(e,x): These are Set functions that are also available.
- Conj(c): This returns the conjugate of the c complex number.
- rnorm, rpois, rgamma, rexp, rcauchy, rt, and so on: These can be used to generate Gaussian, Poisson, Gamma, Exponential, Cauchy, and Student distributions.
- fft(x): This calculates Fast Fourier Transform of the elements of x.
- apply(m,INDEX,FUNC): This applies the FUNC function on the INDEX index of the m matrix.
- lapply(l,FUNC): This applies the FUNC function on the l list.
- optim(params, func, mtds): This is the general-purpose method to optimize a func function for the params parameters using the mtds methods.
- lm(frml): This fits a linear model on the frml formula. This is used for regression and covariance analysis. Also, check glm for generalized linear models.
- nls(fml): This fits nonlinear least squares estimates for nonlinear models.
- spline(s): This calculates the cubic spline.
- predict(fit,[…]): This is a generic function to test model fitting on input data.
- df.residual(fit): This calculates the degrees of residual freedom from fit.
- coef, residuals, and deviance: These return coefficients, residuals, and deviance of models fitted.
- logLik(fit): This calculates the log likelihood of the model fitted.
- aov(frml): This performs analysis of variance model calculations on frml.
- Anova(fit,[…]): This performs analysis of variance of models fitted.
b) Math and modeling packages
The following is a list of popular and mature sets of packages, which enhance the power of R:
- arules: This is association rule mining
- cluster, fpc, mclust: This is clustering and classification
- DmwR, dprep,rlof: This is outlier detection
- multicore, snow: This is a multiprocessing library
- nlme: This is regression, linear, and nonlinear modeling
- TraMiner: This is sequential pattern mining
- party and rpart: These are recursive partitioning, decision trees, and survival analysis
- nnet: This is neural networks
- kernlab and e1071: These support Vector Machines, PCA, Naive Bayes, fuzzy clustering, and so on.
- stats, ast, forecast: This is for time series analysis
- RgoogleMaps, ggmap, plotKML, and spdep: These are for spatial analysis
- sna, network, and igraph: These are for social network analysis
- tm, lda, topicmodels, RTextTools, and tau: These are for text mining
Statistical analysis and data science are way too difficult without graphs and visualization. R has a rich set of utilities and libraries for plotting. Let’s have a look at a few of these:
- plot(y): This plots the values of y on the y axis ordered by indices on the x axis.
- plot(x,y): This plots values on the x and y axis, respectively.
- barplot(x): This is a bar plot of the values of x.
- hist(x): This is a histogram of frequencies of the elements of x.
- pie(x): This is a pie chart for the elements of x.
- boxplot(x): This is a boxplot for the elements of x.
- plot.ts(x): This is a plot with respect to time.
- mosaicplot(x): This is a mosaic graph of residuals of a log-linear regression.
- contour(x,y,z): This is a contour plot of x and y, where x and y must be vectors and z should be a matrix of the x X y dimension.
- qqplot(x,y): This is a quantile plot of y with respect to x.
- abline(m,c): This draws a line with the m slope and the c intercept. This can also be used to draw horizontal, vertical, and regression lines.
- rect(x1,y1,x2,y2): This draws a rectangle, based on the top-left (x1,y1) and bottom-right (x2,y2) coordinates.
- polygon(x,y): This draws a polygon, connecting the elements of x and y.
- xlim,ylim: These are the x and y limits of a graph.
- col(): This is the line or symbol color.
- text(), title(), and legend(): These are for text, title, and legends on a graph.
Let’s now take a look at some plotting packages.
- ggplot2: This is the de facto graphics grammar for R
- ggvis: This is a rich and powerful plotting library
- googleVis: This brings the power of Google Visualization APIs to R
- lattice: This is specialized for multivariate data
- iplots: These are interactive plots