Introduction to R

A Short Introduction to R

1.1. Introduction

R is a software package exercised for theused for data analysis purpose and graphical representation representation of data. It won’t be wrong to say that R can be utilized as a statistical tool. It can be used as a. R is a programming language also which make.s it very is highly flexible and extremely customizable support easy customization. Graphical tools compose R a perfect environment for tentative data analysis and for preparing the R is suitable for creating publication ready figures (exportable as .jpg files). Here All all the work is done in command style text functions and therefore it is unlike from other windows style programs like SPSS that apply menus with choose and click options for the predefined statistical processes. Once you learn the R, you can easily use it. Learning R is a bit tricky, it is not for the beginners. It obtains substantial time to learn to use R, but once you have passed the first trouble, it is quite suitable to handle. It is not for the beginners. It is basically for the advanced users for whom the statistical functions of the Microsoft Excel are no longer sufficient. For example, if you would like to do the Principal Component Analysis, in contrast to SAS and SPSS, which are very costly commercial programs for doing statistics, R is free software. It is distributed under the GNU and GPL license terms.

The R Development Core Team is responsible for the maintains the base distribution of R. A large group of volunteers keeps adding functionality through add-on packages. A huge quantity of further functionality is executed in add-on packages authored and preserved by a large group of volunteers. The R system is available at world wide web, connect to the home page of the main source of data about the R system is the World Wide Web (WWW) with the official home page of the R project http://www.R-project.org and get full accessibility of R system.

http://www.R-project.org

All resources are accessible from this page: the R system itself, a collection of add-on packages, manuals, documentation and more.

1.2. Installing R

The R system is made of two major parts: the base system and add-on packages, contributed by the users. The core R language is executed in the base system. Whereas the Implementations of statistical and graphical procedures are organized in the form of packages. A package is nothing but a collection of functions, examples and documentation. The package is designed to focus on special statistical methodology. The R software is distributed by the Archive Network (CRAN) accessible under

http://CRAN.R-project.org

1.2.1 The Base System and the First Steps

Download the precompiled binary and install it on the local machine. For window user, the link is

http://CRAN.R-project.org/bin/windows/base/release.htm

Follow step –by-step instruction given by the installer and you are done with the installation.

Starting of R depends on the operating system used by the user. One can start by clicking on the R symbol (as shown below) created by the installer (Windows) or by typing ‘R’ on the shell (Unix systems).

The user can change the appearance of the prompt by:

>options(prompt = "R> ")

1.2.2 Packages

The base distribution of R comes along with these add-on packages: :

Matrix boot lattice mgcv

rpart survival KernSmoothMASS

base class cluster codetools

compiler datasets foreign grDevices

graphics grid methods nlme

nnet parallel spatial splines

stats stats4 tcltk tools

utils

These packages are used to execute standard statistical functionality, as classical tests, linear models, a vast collection of high-level plotting functions. Packages that are not offered along with the base distribution can be installed directly from the R prompt.

For Windows operating systems users ,there is precompiled versions of the packages, just download it and install it on the system. Whereas in unix operating system ,, packages are first compiled locally and then installed on the Unix systems.

1.3. Getting Started

R is a command line based language, where all commands are entered directly. R can be used as a substitute for pocket calculator in its simplest form. When you type typing4+3 into the console and press the Enter key. Here is what appears on the screen:

> 4+3

[1] 7

Here the result is 4. The[1] says, “first requested element will follow”. Here, there is just one element. The > indicates that R is ready for another command.

Other simple operators include

4-3 # Subtraction

4*3 # Multiplication

4/3 # Division

4ˆ3 # Exponential

sqrt(3) # Square roots

log(3) # Logarithms (to the base e)

One can use multiple operators, e.g.

(4- 3) * 2

first subtracts 3 from 4 and then multiplies the result with 2.

Exit or quit command:

>q()

If commands are stored in an external file, say commands. R in the working directory work, they may be executed at any time in an R session with the command

>source("commands.R")

For Windows Source is also available on the File menu. The function sink,

>sink("record.lis")

will divert all subsequent output from the console to an external file, record.lis. The command

>sink()

Restores it to the console once again.

1.4. Some R commands information

Like all UNIX based packages, R is a case sensitive appearance, language with simple syntax. when we say that the language is case sensitive, then we are saying that in R capital A and small a are different symbols and would refer to dissimilar variables.

The set of symbols used in R depends on the operating system and the country where R is being run. The alphanumeric symbols are widely used almost in all countries (and in some countries this includes accented letters) plus ‘.’ and ‘_’, there is a rule that a name must start with ‘.’ or a letter, and if it starts with ‘.’ then the second character cannot be a digit.

Separating Commands

A new line or semi colon is used to separate commands. All Elementary commands are grouped into one compound expression by braces (‘{’ and ‘}’).

Adding comments

The comment Start with hashmark (‘#’), everything to the end of the line is a comment.

To continue the command to the next line, , R will give a different prompt, by default it is +on second and subsequent lines and it continues to read input until the command is syntactically complete. The length of the Command lines entered at the console are 4095 bytes.not characters).

R allow recalling and re-executing previous commands. With the help of vertical arrow keys on the keyboard tone can scroll forward and backward through a command history. Once a command is located, one can move the cursor within the command with the help of horizontal arrow keys, and characters can be removed with the DEL key or added with the other keys.

1.5. Special Values

In R, the NA values are is used to signify missing values. The full form of NA is ] “not available.”. You will find various NA values in text loaded into R or in data loaded from the databases (to replace the NULLvalues).

When you expand the size of a vector or matrix or array further, the new spaces will have the value NA (meaning “not available”):

> s <- c(5,7,9,11)

[1] 5 7 9 11

>length(s)<- 6

[1] 5 7 9 11 NA NA

Inf and -Inf

If the output of the calculation is number and that too big in size, R will return Inf and –Inf for a positive and negative number respectively:

> 3^1250

[1] Inf

> -3^1250

[1] –Inf

When you divide a number by 0 this value will also return:

> 3 / 0

[1] Inf

1.6. Objects

When we carry simple calculation, it does not produce the output that is remembered by R: The answers are displayed in the console window and for further calculations with the available answer you need to give it a name and store it as an object in R.

answer<-3+2

Tells R to add 3+2 and store the answer in an object called answer. To retrieve the stored in answer, just write the name of the object:

answer

The symbol used in the middle <-. is the allocation symbol, or the assign symbol, it has a “less than” arrow and a hyphen <- and it looks like an arrow pointing towards “answer”. The symbol represents “make the object on the left into the output of the command on the right”.

In earlier versions of R, and in S-Plus, the underscore character is used for allocation, so next time when you try to use S-Plus code in R you can figure out why it doesn’t work.

One can use objects in calculations just as the numbers being used above.

answer2<- (5.5+2)^2

answer+answer2

[1] 61.25

You can store the results as another object.

answer3<-answer2/answer

answer3

[1] 12.25

When you first start R, you will not find any objects stored, but once you start using it for a while there might be several. You can get a list of what’s there by using the ls() function

ls()

[1] "answer" "answer2" "answer3"

To remove any object from R’s memory use rm() function.

rm(answer2)

Notice that when you type this it doesn’t ask you if you’re sure, or give you any other sort of warning, nor does it let you know whether it’s done as you asked. The object you asked it to remove has just gone: you can confirm this by using ls() again.

ls()

[1] "answer" "answer3"

It’s removed, sure enough. when a user try to delete an object that doesn’t exist they will receive an error message. you will often notice that while using R when you type in a command and receive command prompt popping up again. s that means there is no error. .

1.7. Functions

R is aprogramming language it is not statistical package, it is used for carrying out statistical analyses. R is enriched with variety of short ready-made pieces of code designed for tasks such as managing the data, or perform complex mathematical operations on data, draw graphs and representing statistical analyses ranging from the simple and straightforward to the eye-wateringly complex. These pre-designed e code are called functions. The name of Each function ends in a pair of brackets, and if you to use more straightforward functions all oyu have to do is to type the name of the function and put the name of the object you’d like the procedure carried out on in the brackets.

The natural log of 15

>log(15)

[1] 2.70805

e raised to the power 5

>exp(5)

[1] 148.4132

Square root of 64

>sqrt (64)

[1] 8

Absolute (i.e. unsigned) value of −5

>abs (-5)

[1] 5

for more complex calculations turn the argument of the function (the bit between the brackets) a calculation itself:

sin(15+answer)

you will receive the answer the sine of 15 plus whatever the value of the object “answer” is.

To ensure that the complex calculations are done in a right way, use brackets within the function’s brackets: exp((x*3)^(1/3))

it will return the value of e raised to the power of whatever the value of x is, multiplied by 3, raised to the power 1/3.

A functions can be used for creating new objects:

P<- 1/sqrt(y)

creates an object called “P” that has the value of 1 divided by the square root of the value of the object y.

We have only discussed about the functions that have a single argument between the brackets. One can control the way that the function operates, you can add further arguments, by putting commas. These extra arguments will modify the way that the function is applied, or tell which part needed to use from the part of a dataset, or specify how the function should deal with missing data points:. Here is an example to explain it: With the help of the function round(), one can get rounds off a number to a certain number of decimal places. Type a number in between the brackets after the function, specify how many decimal places to round to by adding a second argument, digits=, using a comma to separate it from the first argument.

>round(19.7564, digits=2)

[1] 19.76

>round(17.4325, digits=1)

[1] 17.4

Most R functions use default values specified for most of their arguments If a user does not mention a number of digits for round(), R will return the number rounded off to no decimal places.

>round(13.7784)

[1] 14

Some other examples:

>logb(15, base=2.5)

[1] 2.955449

Here we have specified to calculate the logarithm of 15 to the base 2.5.

>signif(pi, digits=4)

[1] 3.142

>signif(pi, digits=2)

[1] 3.1

Tn the above example the argument is precisely mentioned.

1.8.1. Vectors

A vector represents a sequence of data elements of the same basic type. Members in a vector are officially called components.

R runs on named data structures similar to numeric vector. It is a single entity that consist a collection of ordered numbers. To set up a vector named p, consisting of four numbers, namely 11.5, 6.8, 5.2, and 25.8, use the R command

> p<- c(11.5, 6.8, 5.2, 25.8

This is an assignment declaration using the function c(). In this context c() can take a random number of vector arguments. The value of c() is a vector got by concatenating its arguments end to end.

Assignment can also be done by using the function assign(). A corresponding way of making the same assignment as above is:

>assign("p", c(11.5, 6.8, 5.2, 25.8))

Here is one more way for the Assignments. One can use the apparent modification in the assignment operator. Here the same assignment could be completed using

>c(11.5, 6.8, 5.2, 25.8) -> p

When the expression is used as an absolute command, the value is printed and lost. But when we use the command

> 1/p

the reciprocals of the four values would be printed at the terminal

[1] 0.08695652 0.14705882 0.19230769 0.03875969

The further assignment

> y <- c(p, 0, p)

would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place.

[1] 11.5 6.8 5.2 25.8 0.0 11.5 6.8 5.2 25.8

1.8.2. Vector Arithmetic

1.8.2

Reword to 2 paragraphs with your examples

Vectors can be used in arithmetic expressions, where the operations are executed element by element. It is not necessary that the vectors arising in the same expression is of the same length. If they are not, then the value of the expression will be the vector with the same length as of the longest vector occurs in the expression.

>p<-4.5

> q<-6.25

>p+q

[1] 10.75

The basic arithmetic operators are +, -, *, / and ^ for raising to the power. In addition all of the regular arithmetic functions are available like log, exp, sin, cos, tan, sqrt, and so on. The max and min pick the largest and smallest elements of a vector correspondingly. The range function’s value is a vector of length two, namely c(min(p), max(p)) where length(p) is the number of elements in p. The sum(p) gives the total of the elements in p, and prod(p) calculates the product.

Thestatistical function mean(p) calculates the sample mean, which is same as sum(p)/length(p) , and var(p) which givessum((p-mean(p))^2)/(length(p)-1)or the sample variance.

sort(p) revisits a vector of the same size as p with the elements placed in increasing order. There are other more flexible sorting commands available (see order() or sort.list() which produces a permutation to do the sorting).

In most cases the user will not be worried if the “numbers” in a numeric vector are integers, real or even complex. Internally the calculations are done as double precision real numbers or the double precision complex numbers if the input data are complex.

To work with the complex numbers, the output would be the warning message

sqrt(-17)

[1] NaN

Warning message:

In sqrt(-17) : NaNs produced

But

sqrt(-17+0i)

will do the computations as complex numbers

[1] 0+4.123106

1.8.3. Generating regular sequences

R is also used for generating the commonly used series of numbers. For example 1:20 is the vector c(1, 2, ..., 19, 20). Here, the colon operator (:) has the main concern within an expression. lets take another example 2*1:10 is the vector c(2, 4, ..., 18, 20).

> 1:20

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2*1:10

[1] 2 4 6 8 10 12 14 16 18 20

if there is a structure 10:1, then it defines to generate a sequence backwards.

> 10:1

[1] 10 9 8 7 6 5 4 3 2 1

The function seq() used for generating the sequences. It has five arguments,. The first two arguments, denotes beginning and finish of the sequence, The seq(1,20) is same vector as 1:20.

>seq(1,20)

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

One can assign Arguments in named form also.. The first two arguments can be named from=value and to=value; so the seq(1,10), seq(from=1, to=10) and seq(to=10, from=1) are all the same as 1:10.

In next example we have used two arguments to seq()named by=value and length=value, they specify a step size and a length for the sequence correspondingly. If none of the argument is defined, it is taken as 1 by default ,

For example

>seq(2, 3, by=.2) -> p

[1] 2.0 2.2 2.4 2.6 2.8 3.0

Similarly

> p1 <- seq(length=6, from=2, by=.2)

> p1

[1] 2.0 2.2 2.4 2.6 2.8 3.0

generates the same vector in p1.

The fifth argument is named along=vector, this argument is used to create the sequence 1, 2, ..., length(vector), or the empty series if the vector is empty.

A related function is rep() as the name suggest it is used for replicating an object in various ways. The simplest form is

> p2 <- rep(p, times=3)

> p2

[1] 2.0 2.2 2.4 2.6 2.8 3.0 2.0 2.2 2.4 2.6 2.8 3.0 2.0 2.2 2.4 2.6 2.8 3.0

which will put three copies of p end-to-end in p2. Another useful version is

> p3 <- rep(p, each=3)

> p3

[1] 2.0 2.0 2.0 2.2 2.2 2.2 2.4 2.4 2.4 2.6 2.6 2.6 2.8 2.8 2.8 3.0 3.0 3.0

which repeats each element of p three times before moving on to the next.

1.8.4. Logical Vectors

R also supports logical quantities operation. The logical vector may have the values TRUE(T), FALSE(F), and NA. The T and F are just variables representing TRUE and FALSE by default, but it is not preserved words and can be overwrite by the user. Hence, you should always use TRUE and FALSE. For example,

> x<-c(1,2,3)

> y<-c(5,6,3)

>x==y

[1] FALSE FALSE TRUE

The logical operators are <, <=, >, >=,

It is used for == for accurate equality and != for denoting inequality. In addition, if c1 and c2 are the logical expressions, then c1 & c2 is their intersection (“and”), c1 | c2 is their union (“or”), and !c1 is the negation of c1.

1.8.5. Character Vector

Character vectors are widely used in R, for they are defined by using a double quote character, e.g., "y-values", "Old Calculations".

Character strings are penetrated using either matching double (") or single (’) quotes, but for printing double quotes are used or sometimes one can print without quotes.

The c() function is used to concatenate character vector.

The paste() function obtains an random number of arguments and concatenates them one by one into the character strings. The arguments are by default divided in the result by a single blank character.

>pr<- paste(c("X","Y"), 1:10, sep="")

Makes pr into the character vector

>pr

[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8" "X9" "Y10"

1.9. Matrices and arrays

A matrix is two-dimensional array of numbers. In R, the matrix is made of elements of any type, for example, a matrix of character strings. Matrices and arrays are nothing but vectors with dimensions:

>x<- 1:9

>dim(x) <- c(3,3)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

with the help of dim assignment, function sets the dimension attribute of x, causing R to take care of the vector of 9 numbers as a 3 × 3 matrix. The storage is column-major; i.e. the elements of the first column are trailed by those of the second, etc.

A suitable way to create matrices is to exercise the matrix function:

>matrix(1:9,nrow=3,byrow=T)

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9

The byrow=T switch causes the matrix to be filled in a rowwise rather than column wise.

The transposition function t (notice the lowercase t as resist to the uppercase T for TRUE), which turns rows into columns and vice versa:

> x <- matrix(1:9,nrow=3,byrow=T)

>rownames(x) <- LETTERS[1:3]

[,1] [,2] [,3]

A 1 2 3

B 4 5 6

C 7 8 9

Transpose of a matrix is:

> p <- t(x)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

The character vector LETTERS is an integrated variable it represents capital letters A–Z.

one can attach vectors together, column wise or row wise, with cbind and rbind functions.

>cbind(P=1:4,Q=5:8,R=9:12)

P Q R

[1,] 1 5 9

[2,] 2 6 10

[3,] 3 7 11

[4,] 4 8 12

>rbind(P=1:4,Q=5:8,R=9:12)

[,1] [,2] [,3] [,4]

P 1 2 3 4

Q 5 6 7 8

R 9 10 11 12

The operator ‘*’ is used for matrix multiplication. Here both the matrices should be of same size.

>p*x

[,1] [,2] [,3]

[1,] 1 8 21

[2,] 8 25 48

[3,] 21 48 81

1.10. Factors

The statistical data have categorical variables, that specify subdivision of data, like social class, tumor stage, Tanner stage of puberty, primary diagnosis, etc. these variables are represented with a numeric code.Such variables are indicated as factors in R.

The factor has a set of levels—states four levels for compactness.On the inside, a four-level factor consists of two items: (a) a vector of integers between 1 and 4 and (b) a character vector of length 4 enclosing strings. Here is an example:

>unique<- c(0,4,1,1,2)

>funique<- factor(unique,levels=0:3)

>levels(funique) <- c("none","more","medium","large")

The first command will generate a numeric vector , encoding the unique levels of five values. To treat this as a categorical variable, create a factor funique from it by using the function factor. This is called with one argument in addition to unique, namely levels=0:3that specifythat the input coding exercises the values 0–3. The final line is that the level names are changed to the four indicated character strings.

>funique

[1] none<NA> more more medium

Levels: none more medium large

>as.numeric(funique)

[1] 1 NA 2 2 3

>levels(funique)

[1] "none" "more" "medium" "large"

1.11. Lists

The list is used for merging collection of object in a larger object. The list is built from the elements of the function list.

For example, consider a set of data, and place the data in two vectors as follows:

> A <- c(7900,7090,2680,5170,6300,

+ 4875,6508,7010,6535,6250,6790)

> B <- c(5990,7270,4880,5290,5849,

+ 4640,5160,6995,7595,6005,5331)

Notice how input lines are broken and carry on the next line. If a user press Enter key while an expression is syntactically incomplete, R will keep it in continuation on the next line and will alter its normal > prompt to the continuation prompt +. If such situation, either complete the expression on the next line or press ESC (Windows) or Ctrl-C (Unix). The “Stop” button can also be exercised under Windows.

To merge these individual vectors into a list:

>Totallist<- list(before=A,after=B)

>Totallist

$before

[1] 7900 7090 2680 5170 6300 4875 6508 7010 6535 6250 6790

$after

[1] 5990 7270 4880 5290 5849 4640 5160 6995 7595 6005 5331

Named elements may be extracted like this:

>Totallist$before

[1] 7900 7090 2680 5170 6300 4875 6508 7010 6535 6250 6790

there are many built-in function in R that calculate more than a single vector of values and return the results in list form.

1.12. Data Frames

A data frame is a two-dimensional array-like structure. Each column holds the values of one variable and each row contains one set of values from each column.

The basic characteristics of a data frame are as follows.

The column names cannot be left empty.

The row must have a unique name

The data stored in a data frame can be of numeric, factor or character type.

Each column should contain same number of data items.

Data frames helps in managing tabular data. A data frame is a natural way to represent these data sets in R.

A data frame represents a table of data. The column may differ in type, but each row in the data frame must have the same length:

>data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5))

Error in data.frame(a = c(1, 2, 3, 4, 5, 6), b = c(1, 2, 3, 4, 5)) :

arguments imply differing number of rows: 6, 5

Here is a simple example of a data frame, showing the top travel countries.:

>top_travel_countries<-data.frame(

+ country=c("India","Egypt","Norway","Switzerland",

+ "Newzeland"),

+ rank=c(1,2,18,

+ 15,25)

+ )

Here is what this data frame contains:

>top_travel_countries

country rank

1 India 1

2 Egypt 2

3 Norway 18

4 Switzerland 15

5 Newzeland25

Data frames are applied as lists with class data.frame:

>typeof(top_travel_countries)

[1] "list"

>class(top_travel_countries)

[1] "data.frame"

1.12.1 Names and Indexing

R object can also have names. It helps in writing readable code and self describing objects. For example, we are creating a vector with a integer sequence

1, 2, 3

and by default, there's no name.

>x<-1:3

>names(x)

NULL

>names(x) <-c(“foo”, “bar”, “norf”)

foo bar norf

1 2 3

>names(x)

[1] “foo” “bar””norf”

1.13. Objects and Classes

1.13.1. Description

The simple generic functions of R can be utilized for an object-oriented style of programming. Method transmit takes place based on the class of the first argument to the generic function.

1.13.2. Usage

class(x)

class(x) <- value

unclass(x)

inherits(x, what, which = FALSE)

oldClass(x)

oldClass(x) <- value

1.13.3. Arguments

x a R object

what, value a character vector naming classes. value can also be NULL.

which logical affecting return value: see ‘Details’.

Summary

After completing the chapter, you will learn how to start with R. The chapter includes help and documentation related to R. You will learn how to customize R, what is R prompt, what are the different data types in R.

The chapter also explains what are the operators, objects and factors. How to create a new function and what are the object classes and methods.

Friday, July 1, 2016