This is a tutorial about the Basic Analytic Techniques - Using R Tutorial offered by Simplilearn. The tutorial is part of the Data Science with R Language Certification Training course.
After completing the Basic Analytic Techniques - Using R Tutorial, you will be able to:
Understand the basic introduction to R Basic data exploration.
Learn how to explore data using R.
Visualize data using R.
Perform Data analysis using R.
Conduct Basic tests of diagnostic analytics.
Implement diagnostic analytics using R.
Let’s start our lesson with a basic introduction to R. R has become one of the most popular tools for data mining. Let’s see the reasons why we should be using R and a little into the basic knowledge of R.
R is a freely available programming language for statistical computations and graphics.
R programming is published under the GNU public license.
R programming is mainly used in the fields of data mining and statistical analysis. Out of its variety of applications, it includes time series analysis, linear modeling and nonlinear modeling.
The main advantage of using Rtools over other such tools for data mining is its active community and the built in packages; and the package contributions by the members of the community.
Another reason for its popularity is that R needs very little programming knowledge. You can download R from its official website - http://www.r-project.org/ The website has instructions on how to download and install R and the basic machine requirements.
R studio is an IDE for programming in R that is freely available. It is completely optional for you to download R studio.
The programming in the following chapters will be taught using the R command line prompt. If you are using R Studio, you can type in the same commands as shown in the chapters.
Finally, you are strongly encouraged to check the community forums on R as you go through this lesson, to answer basic questions and to explore about the different functionalities of R.
In the next section, let’s start with a very basic introduction to the commands in R.
Before we get into the specifics and statistical analysis in R, listed here are a few important commands that would be used throughout the lessons.
To use a particular package, the install.packages() function is used.
This installation needs to be done only once.
Once they are installed, the functions can be loaded into the current session by calling the library() function.
R comes with a set of built in sample data sets, which we would be using in our lessons. These data can be loaded using the data() function, and the syntax is data(dataset name).
To use external data, the read function is used.
The write function writes data from the R session to a file.
R’s default directory is the user’s Documents folder – it can be verified by using the getwd() function.
To change the working directory, use the setwd() function, with the full path name as the argument.
In case the working directory is not set, the full path name needs to be specified for read and write functions. For example, read.csv(“C:/Rtutorials/Sampledata.csv”).
Note that R uses the forward slash for specifying directories.
The assignment operator in R is different from the equal to operator.
To assign a value to a variable in R, use the
Before getting into the functions and commands for data exploration, let’s look at how the data is stored in R.
The data in R are stored as “data frames” – which are a tabular representation of rows and columns.
There are other data types as well – including table, matrices, vectors, and also single values.
Throughout our tutorial, we will primarily use data frames and time series data in the later chapters. In this section, you can see a sample data frame.
The iris data set is a very popular, commonly used data set introduced by Sir Donald Fisher. The data contains 150 entries, belonging to three different species and the features of the different flowers – sepal length and width, and petal length and width. The three species has 50 entries each.
As shown in the data frame, each row denotes a particular case, or in this context, features of a particular flower. The columns denote the different attributes measured.
In the next section, we will look at different commands to view data.
Let’s start with basic commands to view the data in R. Throughout this lesson, we will look at commands and the screenshots of R will display a sample output. You are encouraged to pause the video and try out the commands on your command prompt, for better understanding.
You can try out these commands on command prompt for a better understanding,
The iris dataset is loaded by default in R.
To view the dataset, just type the dataset name on the prompt.
To view the top few records, use the head() function. The syntax is head(datasetname, number of rows). The number of rows is an optional argument, and the default number of rows is 6.
To view the last few records, the tail() function is used. The syntax is similar to the head function – tail(datasetname, number of rows).
In the next section, we will look at commands to view the dimensions of data.
Listed here are few commands to view the dimensions of a data set.
Dataset Name |
What does it do? |
dim |
Gives a vector result of the number of rows followed by the number of columns. |
ncol |
Gives the number of columns separately. |
nrow |
Gives the number of rows separately. |
Try these commands with the sample iris dataset.
In the next section, we will look at attributes of the dataframe.
The attributes of the dataframe can be distinguished as-
To view the column names of a particular dataset, type names(dataset name).
To view all attributes, type attributes(dataset name).
Given here is a snapshot of the two commands on the iris data set. You can see that the column names are displayed as strings in the first output.
Attributes display the column names, row names and the data type of the dataset.
Class command is used to display the data type of the argument.
The syntax for class command is class(variablename). For example, class(iris) would display “data.frame”.
To know the type of a particular column, type class(iris$Sepal.Length). The $ sign is used in referencing the columns of a dataset.
In the next few sections, we will look at subsetting data.
Subnetting data can be classified into two types, column subnetting and row subnetting.
Let us look at the key points in column subnetting-
To view data of a column, the following two notations could be used.
As mentioned in the previous section, dataset$column name would display the particular column.
Another way is to use a matrix form, that is, dataset[ , “column name”].
Instead of the column name, the column number could also be used.
In this case, petal length is the third attribute, hence iris[ , 3] would also give the same result.
In the next section, we will discuss row subsetting.
Similar to columns, row data can also be subsetted using the square brackets. The format is datasetname[row numbers, ] to display all columns, or datasetname[row numbers, column name/numbers] to display particular rows of particular columns.
You can see the screenshot for the subsets using square brackets on the iris data set. Also, note that if a single column/row is displayed, the output is in default vector form. This concludes viewing data and exploration in R. You are advised to try out all these commands on your R or Rstudio for better understanding.
Next, we will go look at ways of summarizing data.
The summary function is R can be described as-
The summary function is a generic function in R that displays summaries of data or models, as we will see in later chapters. The syntax is summary(data frame).
As shown in the screenshot, the summary function displays the minimum value, maximum value, mean, median, first and third quartiles of every numeric data.
For categorical data, like Species, it displays a table of the different values and their frequencies.us
The table can be displayed separately by giving table(dataframe$column name).
Look at the example – table(iris$Species) displays a frequency distribution of the three different classes. Note that summaries for individual columns can also be obtained by using the summary function, and giving the referenced column name as an argument. For example, summary(iris$Sepal.Length) would display the results for Sepal length column alone.
In the next section, we will look at a list of commands for the individual summary statistics.
The summary function displays all the summary statistics for the particular data. Here you can see a list of commands to display individual summary statistics.
The argument for each function is the column name for which the statistics are to be obtained.
The commands for individual summary statistics are –
min – to get the minimum value
max – to get the maximum value
range – to get the range of the data, that is, maximum minus minimum value
mean – to get the average value
median – to get the median or middle value
IQR – to get the interquartile range of the data, that is, the difference between the first and third quartiles
sd – to get the standard deviation
var – to get the variance.
The aggregate function is used to group a data by values of a particular column.
The compulsory arguments are the formula for aggregation and the function for aggregation.
The first command aggregates all the columns, as denoted by the dot symbol; by the value of Species, belonging to the iris dataset, and aggregates by the average of all values.
You can see that the result has three rows – one for each class, and the mean values for each column. The second command aggregates only the sepal length column by species; belonging to the iris dataset.
In addition to the mean function, the sum function is a commonly used statistic in aggregation. You can see the other available functions by typing aggregate to see the help content.
Next, we will look at ways to visualize data in R.
plot() is a generic function used for plotting data in R. The function can be used to plot a variety of graphs on a variety of data, including data frames, time series, and even vectors.
The plot function creates a scatter plot by default. Other plots can be created using the type attribute. Shown here is a plot of iris data.
On plotting the R data frame, it creates a pairwise data plot of all the attributes in the data frame.
In the next section, we will look at a simple scatter plot.
The plot function can be used to create scatter plots of one variable against another. For example, let us plot sepal length against species. We will use a few optional attributes of the plot function –
main: to specify the title of the plot
xlab: the x-axis label
ylab: the y-axis label.
The function would now be - plot(iris$Sepal.Length, iris$Species, main = "Iris Data", xlab = "Sepal Length”, ylab = "Species"). The output plot is shown below. From the scatter, it is easier to notice the differences in the sepal length according to species.
Next, we will look at pie charts.
Pie charts are the simplest form of visualizing the numerical proportion of the different classes through the sectors of the circle. The pie function is used to create pie charts in R.
The table() function is used to create a frequency table and then the pie function is called to create a chart of the table. The main attribute, as mentioned before, is used to specify the title for the chart.
Here is an example chart showing the different species of iris data. The circle is divided into three equal sectors for the three species.
In the next section, we will look at bar charts.
Bar plots are used to depict values in a lengthwise manner, with the height equivalent to the value that is being shown. For this example, we will use another built-in the dataset – US Personal Expenditure.
The data set is displayed in the table. It displays the personal expenditure data for categories across years 1940 to 1960.In R, bar plots can be created using the barplot() function.
As with plot function, the main, xlab and ylab attributes can be used for labeling.
This section shows an example function to create a bar plot using the expenditure data and above, the created bar plot is shown.
In the next section, we will create box plots.
Box plots are used to show numerical data with their quartile ranges. Also called a box-whisker plot, the boxes show the interquartile region, with the middle line equal to the median.
The whiskers show the lower and upper quartiles, and the points show the outliers. The box plots are very useful in detecting the outliers.
In boxplot in R can be created using the boxplot() function.
Let’s see the example function displayed above .
The first attribute shows the features to be plotted, that is, sepal length against the species; the data, and the labeling information.
On typing this into the R prompt, you will get a graph similar to the one shown in this section.
You can see that there is an outlier in the Virginica species.
In the next section, we will look at histograms.
Histograms are used to depict frequency distribution data. R has a default islands dataset that is best suited to create histograms. Once you are done with this section, you are encouraged to try creating a histogram using the islands data.
In R, histograms can be created using the hist() function. The first attribute gives the vector or data frame to the plot, and the usual labeling attributes can be used to label the plot.
It can be seen from the plot that the sepal length is mostly concentrated around 4.5 to 6.5. For numerical data such as sepal length, the data is put into buckets and the histograms are created. This section concludes data visualization.
In addition to these attributes, plots can have other attributes such as x and y-axes limits, colors for points/bars/lines etc..
Next, we will look at methods of diagnostic analysis.
Let’s look at the function to test the correlation between two variables.
Correlation is a class of statistical relationship between two variables that form any form of dependence. For example, is there a correlation between the height of parents and their offspring?
In the example given here, we try to find if there is a correlation between the sepal length and width of a flower.
In R, correlation can be calculated using the cor.test() function. By default, the function calculates the Pearson’s correlation coefficient. Let’s look at how to interpret the results. The output shows the correlation method used and the data.
The important output for this test is the p-value, which is calculated using the t-statistic and degrees of freedom. As seen in earlier chapters, if the p-value is less than 0.05 we can conclude that the null hypothesis is rejected, that is, there is no correlation between the two variables.
The correlation coefficient is given as -0.1175698. This means there might be a negative correlation between the two variables, but since the p-value is quite high, we can conclude that the result is not significant, that is, the correlation is almost zero.
In the next section, we will see the analysis of variance.
Let us look at some important points about ANOVA:
Analysis of variance (ANOVA) is used to compare the means between different groups. Here, we show a simple example of a one way ANOVA. To illustrate, we use the InsectSprays dataset.
This data contains the insect count after using 6 different sprays. Here, the null hypothesis would be that there is no difference in using different sprays. aov() takes the first attribute as the dependent variable and independent variable, separated by a tilde.
Here, it is aov(count ~ spray). The data attribute specifies where the data is to be taken from. After fitting the aov() model, the summary function is used to display the result. It can be seen that the p-value, that is the last column is
Chi-squared test in R is used to calculate the goodness of a particular fit. It compares the observed values against expected values obtained from a null hypothesis. Here, we try to test the sepal
Pairwise t-tests are used to check if there is any difference in paired values, example – marks obtained by a student before and after a training.
To illustrate, the anorexia dataset from the MASS package is used. The data contains pre-treatment and post-treatment weights of patients.
To implement it in R, type t.test(Prewt, Postwt, paired = true).
The first two attributes are the features to be compared, and the attribute paired = true specifies that it is pairwise t-test.
The low p-value suggests that the null hypothesis can be rejected, that is, there is a difference between the weights before and after treatment.
In the next section, we will look at independent t-test.
The independent t-tests are used in comparing two values where the value of one variable is not directly related to the other variable. For example, marks of students in two different schools.
Here, we implement a t-test on the sepal length and sepal width of the iris data set. From the result, it can be seen that the p-value is almost zero, and hence the null hypothesis that there is no relation can be rejected.
To quickly summarize what we have learned in this basic analytic techniques using R tutorial, we have discussed –
A basic introduction to R
Data exploration using R
Data visualizations in R
Pie Charts
Bar plots
Box plots
Histogram Diagnostic analytics using R
Chi-Squared test
T-tests
Analysis of Variance
With this, we come to an end about the Basic Analytic Techniques - Using R Tutorial.
A Simplilearn representative will get back to you in one business day.