The R language to help the habra statistician. What is R package: instruction manual What is R

10.10.2023 -

You need to type this in the terminal.

The beauty of R is this:

This program is free (distributed under the GPL license),
Many packages have been written for this program to solve a wide range of problems. All of them are also free.
The program is very flexible: the sizes of any vectors and matrices can be changed at the user's request, the data does not have a rigid structure. This property turns out to be extremely useful in the case of forecasting, when the researcher needs to give a forecast for an arbitrary period.

The latter property is especially relevant since other statistical packages (such as SPSS, Eviews, Stata) assume that we may only be interested in analyzing data that has a fixed structure (for example, all data in a working file must be of the same periodicity with the same start dates and end).

However, R is not the friendliest program. While working with it, forget about the mouse - almost all the most important actions in it are performed using the command line. However, in order to make life a little easier, and the program itself a little more welcoming, there is a frontend program called RStudio. You can download it from here. It is installed after R itself has already been installed. RStudio has many convenient tools and a nice interface, however, analysis and forecasting in it are still carried out using the command line.

Let's try to take a look at this wonderful program.

Getting to know RStudio

The RStudio interface looks like this:

In the upper right corner in RStudio the name of the project is indicated (which for now we have “None” - that is, it is missing). If we click on this inscription and select “New Project”, we will be prompted to create a project. For basic forecasting purposes, just select “New Directory” (a new folder for the project), “Empty Project” (an empty project), and then enter the name of the project and select the directory in which to save it. Use your imagination and try to come up with a name yourself :).

When working with one project, you can always access the data, commands and scripts stored in it.

On the left side of the RStudio window is the console. This is where we will enter various commands. For example, let's write the following:

x< - rnorm (100 , 0 , 1 )

This command will generate 100 random variables from a normal distribution with zero expectation and unit variance, then create a vector called "x" and write the resulting 100 values into it. Symbol "<-» эквивалентен символу «=» и показывает какое значение присвоить нашей переменной, стоящей слева. Иногда вместо него удобней использовать символ «->”, although our variable in this case should be on the right. For example, the following code will create an object "y" that is absolutely identical to the object "x":

x -> y

These vectors now appear in the upper right part of the screen, under the tab, which I have entitled “Environment”:

Changes in the “Environment” tab

This part of the screen will display all the objects that we save during the session. For example, if we create a matrix like this:

$A = \begin(pmatrix) 1 & 1 \\ 0 & 1 \end(pmatrix) $

with this command:

A< - matrix (c (1 , 0 , 1 , 1 ) , 2 , 2 )

then it will appear in the “Environment” tab:

Any function we use requires us to assign some values to certain parameters. In function matrix() There are the following options:

data – vector with data that should be written to the matrix,
nrow – number of rows in the matrix,
ncol – number of columns in the matrix,
byrow - logical parameter. If “TRUE” (true), then the matrix will be filled row by row (from left to right, row by row). By default, this parameter is set to FALSE.
dimnames - a sheet with row and column names.

Some of these parameters have default values (for example, byrow = FALSE), while others may be omitted (for example, dimnames).

One of the tricks of "R" is that to any function (for example, to our matrix()) can be addressed by setting the values directly:

Another option is to click on the object name in the "Environment" tab.

Matrix

where matrix is the name of the function we are interested in. In this case, RStudio will open the “Help” panel especially for you with a description:

You can also find help on a function by typing the name of the function in the “search” window (icon with a lens) in the “Help” tab.

If you don’t remember exactly how to write the name of a function or what parameters are used in it, just start writing its name in the console and press the “Tab” button:

In addition to all this, you can write scripts in RStudio. You may need them if you need to write a program or call a sequence of functions. Scripts are created using the button with a plus sign in the upper left corner (you need to select “R Script” in the drop-down menu). In the window that opens after this, you can write any functions and comments. For example, if we want to plot a line graph over the series x, we can do it like this:

plot(x)

lines(x)

The first function builds a simple scatter plot, and the second function adds lines on top of the points connecting the points in series. If you select these two commands and press “Ctrl+Enter”, they will be executed, causing RStudio to open the “Plot” tab in the lower right corner and display the plotted plot in it.

If we still need all the typed commands in the future, then this script can be saved (floppy disk in the upper left corner).

In case you need to refer to a command that you have already typed sometime in the past, there is a “History” tab at the top right of the screen. In it you can find and select any command you are interested in and double-click to paste it into the console. In the console itself, you can access previous commands using the Up and Down buttons on your keyboard. The “Ctrl+Up” key combination allows you to show a list of all recent commands in the console.

In general, RStudio has a lot of useful keyboard shortcuts that make working with the program much easier. You can read more about them.

As I mentioned earlier, there are many packages for R. All of them are located on the CRAN server and to install any of them you need to know its name. Installation and updating of packages is carried out using the “Packages” tab. By going to it and clicking on the “Install” button, we will see something like the following menu:

Let's type in the window that opens: forecast is a package written by Rob J. Hyndman that contains a bunch of useful functions for us. Click the “Install” button, after which the “forecast” package will be installed.

Alternatively, we can install any package, knowing its name, using the command in the console:

install. packages("smooth")

provided that it is, of course, in the CRAN repository. smooth is a package in which I develop and maintain functions.

Some packages are only available in source code on sites like github.com and require that they be built first. To build packages under Windows, you may need the Rtools program.

To use any of the installed packages, you need to enable it. To do this, you need to find it in the list and tick it, or use the command in the console:

library (forecast)

One unpleasant problem may appear in Windows: some packages are easily downloaded and assembled, but are not installed in any way. R in this case writes something like: “Warning: unable to move temporary installation...”. All you need to do in this case is to add the folder with R to the exceptions in your antivirus (or turn it off while installing packages).

After downloading the package, all the functions included in it will be available to us. For example, the function tsdisplay(), which can be used like this:

tsdisplay(x)

She will build us three graphs, which we will discuss in the chapter “Forecaster Toolkit”.

Besides the package forecast I use the package quite often for various examples Mcomp. It contains data series from the M-Competition database. Therefore, I recommend that you install it too.

Very often we will need not just data sets, but data of the “ts” class (time series). In order to make a time series from any variable, you need to run the following command:

x< - ts (x , start = c (1984 , 1 ) , frequency = 12 )

Here's the parameter start allows you to specify the date from which our time series begins, and frequency set the data frequency. The number 12 in our example indicates that we are dealing with monthly data. As a result of executing this command, we transform our vector “x” into a time series of monthly data starting from January 1984.

Let's talk a little about the programming language called R. Recently, you could read articles on our blogs about and, those areas where you simply need to have a powerful language at hand for working with statistics and graphs. And R is just one of those. It will be quite difficult for a newcomer to the world of programming to believe this, but today R is already more popular than SQL, it is actively used in commercial organizations, research and universities.

Without getting into the rules, syntax, and specific uses, let's just look at the basic books and resources that will help you learn R from scratch.

What is the R language, why do you need it and how can you use it wisely, you can learn from the wonderful Ruslan Kuptsov, which he conducted a little less than a year ago as part of GeekWeek-2015.

Books

Now that there is a certain order in your head, you can start reading literature, fortunately there is more than enough of it. Let's start with domestic authors:

Internet resources

Anyone who wants to learn any programming language must visit two resources in search of knowledge: the official website of its developers and the largest online community. Well. Let's not make an exception for R:

But again, being imbued with concern for those who have not yet had time to learn English, but really want to learn R, let’s mention several Russian resources:

In the meantime, let’s complete the picture with a small list of English-language, but no less informative sites:

CRAN is actually a place where you can download the R development environment to your computer. In addition, manuals, examples and other useful reading;

Quick-R - briefly and clearly about statistics, methods of processing them and the R language;

Burns-Stat - about R and its predecessor S with a huge number of examples;

R for Data Science is another book from Garrett Grolemund, translated into an online textbook format;

Awesome R - a selection of the best code from the official website, posted on our beloved GitHub;

Mran - R language from Microsoft;

Tutorial R is another resource with organized information from the official website.

The following topic prompted me to write this article: In search of the ideal post, or the riddle of Habr. The fact is that after becoming familiar with the R language, I look extremely askance at any attempts to calculate something in Excel. But I must admit that I only became acquainted with R a week ago.

Goal: To collect data from your favorite HabraHabr using the R language and carry out, in fact, what the R language was created for, namely: statistical analysis.

So, after reading this topic you will learn:

How can you use R to extract data from Web resources?
How to transform data for later analysis
What resources are highly recommended reading for anyone who wants to get to know R better?

The reader is expected to be independent enough to familiarize himself with the basic constructions of the language. The links at the end of the article are best suited for this.

Preparation

We will need the following resources:

After installation you should see something like this:

In the lower right pane, on the Packages tab, you can find a list of installed packages. We will need to additionally install the following:

Rcurl - for working with the network. Anyone who has worked with CURL will immediately understand all the opportunities that open up.
XML - a package for working with the DOM tree of an XML document. We need functionality for finding elements by xpath

Click “Install Packages”, select the ones you need, and then select them with a checkmark so that they are loaded into the current environment.

Getting data

To get the DOM object of a document received from the Internet, just follow these lines:
url<-"http://habrahabr.ru/feed/posts/habred/page10/" cookie<-"Мои сверхсекретные печеньки" html<-getURL(url, cookie=cookie) doc<-htmlParse(html)
Please pay attention to the cookies being sent. If you want to repeat the experiment, you will need to substitute your cookies, which your browser receives after logging in to the site. Next, we need to obtain the data we are interested in, namely:

When the post was published
How many views were there?
How many people have added this entry to favorites?
How many clicks on +1 and -1 were there (total)
How many +1 clicks were there?
How much -1
Current rating
Number of comments

Without going into too much detail, I’ll just give you the code:
published<-xpathSApply(doc, "//div[@class="published"]", xmlValue) pageviews<-xpathSApply(doc, "//div[@class="pageviews"]", xmlValue) favs<-xpathSApply(doc, "//div[@class="favs_count"]", xmlValue) scoredetailes<-xpathSApply(doc, "//span[@class="score"]", xmlGetAttr, "title") scores<-xpathSApply(doc, "//span[@class="score"]", xmlValue) comments<-xpathSApply(doc, "//span[@class="all"]", xmlValue) hrefs<-xpathSApply(doc, "//a[@class="post_title"]", xmlGetAttr, "href")
Here we used xpath search for elements and attributes.
Next, it is highly recommended to create a data.frame from the received data - this is an analogue of database tables. It will be possible to make requests of different levels of complexity. Sometimes you are amazed at how elegantly you can do this or that thing in R.
posts<-data.frame(hrefs, published, scoredetailes, scores, pageviews, favs, comments)
After generating the data.frame, you will need to correct the received data: convert the lines into numbers, get the real date in a normal format, etc. We do it this way:

Posts$comments<-as.numeric(as.character(posts$comments)) posts$scores<-as.numeric(as.character(posts$scores)) posts$favs<-as.numeric(as.character(posts$favs)) posts$pageviews<-as.numeric(as.character(posts$pageviews)) posts$published<-sub(" декабря в ","/12/2012 ",as.character(posts$published)) posts$published<-sub(" ноября в ","/11/2012 ",posts$published) posts$published<-sub(" октября в ","/10/2012 ",posts$published) posts$published<-sub(" сентября в ","/09/2012 ",posts$published) posts$published<-sub("^ ","",posts$published) posts$publishedDate<-as.Date(posts$published, format="%d/%m/%Y %H:%M")

It is also useful to add additional fields that are calculated from those already received:
scoressplitted<-sapply(strsplit(as.character(posts$scoredetailes), "\\D+", perl=TRUE),unlist) if(class(scoressplitted)=="matrix" && dim(scoressplitted)==4) { scoressplitted<-t(scoressplitted) posts$actions<-as.numeric(as.character(scoressplitted[,1])) posts$plusactions<-as.numeric(as.character(scoressplitted[,2])) posts$minusactions<-as.numeric(as.character(scoressplitted[,3])) } posts$weekDay<-format(posts$publishedDate, "%A")
Here we have converted the well-known messages of the form “Total 35: 29 and ↓6” into an array of data on how many actions were performed, how many pluses there were and how many minuses there were.

At this point, we can say that all data has been received and converted to a format ready for analysis. I formatted the code above as a ready-to-use function. At the end of the article you can find a link to the source.

But the attentive reader has already noticed that in this way we received data for only one page in order to obtain it for a whole series. To get data for a whole list of pages, the following function was written:

GetPostsForPages<-function(pages, cookie, sleep=0) { urls<-paste("http://habrahabr.ru/feed/posts/habred/page", pages, "/", sep="") ret<-data.frame() for(url in urls) { ret<-rbind(ret, getPosts(url, cookie)) Sys.sleep(sleep) } return(ret) }
Here we use the system function Sys.sleep so as not to accidentally cause a habra effect on the habr itself :)
This function is proposed to be used as follows:
posts<-getPostsForPages(10:100, cookie,5)
Thus, we download all pages from 10 to 100 with a pause of 5 seconds. We are not interested in pages up to 10, since the ratings are not visible there yet. After a few minutes of waiting, all our data is in the posts variable. I recommend saving them right away so as not to disturb the hub every time! This is done this way:
write.csv(posts, file="posts.csv")
And we read it as follows:
posts<-read.csv("posts.csv")

Hooray! We learned how to receive statistical data from Habr and save it locally for the next analysis!

Data analysis

I will leave this section unsaid. I invite the reader to play with the data himself and get his own long-lasting conclusions. For example, try to analyze the dependence of the mood of plus and minus people depending on the day of the week. I will give only 2 interesting conclusions that I made.

Habr users are much more willing to upvote than downvote.

This can be seen from the following graph. Notice how much more uniform and wider the “cloud” of minuses is than the spread of pluses. The correlation between pros and the number of views is much stronger than for cons. In other words: we add without thinking, but we minus for action!
(I apologize for the inscriptions on the graphs: I have not yet figured out how to display them correctly in Russian)

There are indeed several classes of posts

This statement was used as a given in the mentioned post, but I wanted to make sure of it in reality. To do this, it is enough to calculate the average share of pluses to the total number of actions, the same for minuses, and divide the second by the first. If everything were homogeneous, then we should not observe many local peaks in the histogram, but they are there.

As you can see, there are pronounced peaks around 0.1, 0.2 and 0.25. I invite the reader to find and “name” these classes himself.
I would like to note that R is rich in algorithms for data clustering, approximation, hypothesis testing, etc.

Useful Resources

If you really want to dive into the world of R, I recommend the following links. Please share your interesting blogs and sites on the topic of R in the comments. Is there anyone who writes about R in Russian?

Programming on R. Level 1. Basics

The R language is the world's most popular statistical data analysis tool. It contains a wide range of capabilities for analyzing data, visualizing it, and creating documents and web applications. Want to master this powerful language with the guidance of an experienced mentor? We invite you to the course "Programming in the R language. Level 1. Basic knowledge".

This course is intended for a wide range of specialists who need to look for patterns in large amounts of data, visualize them and build statistically correct conclusions: sociologists, clinical trial managers/pharmacologists, researchers (astronomy, physics, biology, genetics, medicine, etc.) , IT analysts, business analysts, financial analysts, marketers. The course will also appeal to specialists who are not comfortable with the functionality (or fees) / .

During the classes you will gain basic skills in data analysis and visualization in the environment R. Most of the time is devoted to practical tasks and working with real data sets. You will learn all the new tools for working with data and learn how to apply them in your work.

After the course, a certificate of advanced training of the center is issued.

What is an R package?

An R package is an extension created to solve a specific problem in . Packages without which it would be difficult to imagine working in R are included in the base assembly and are automatically available after installing R on your computer (the so-called R kernel). For example, the stat package allows you to conduct statistical tests, and thanks to the graphics package, you can build graphs in R. However, most packages have highly specialized applications and to work with them you need to “extend” your R library by installing the necessary package on your computer.

From a technical point of view, an R package is a collection of data and documentation for it, collected into a single whole according to a standard scheme. Each package must be tested for errors and against the standards of the Official R Package Archive (CRAN). If any discrepancies are detected, the packet will not be accepted into the CRAN. Thanks to this approach, the principles of working with any R package are the same, which makes them simple and easy to use. By the fall of 2018, the number of packages in CRAN exceeded !!!

How to install and load a package in R?

There are several ways to install an R package. Let's start with the most common case: installing a package from CRAN. To do this, simply enter the install.packages function into the console, and in the arguments write the name of the package you are looking for (for example, take the ggplot2 package):

install.packages("ggplot2")

In the window that opens with a list of countries, select any mirror for downloading. The process of installing the package into your library will begin automatically. Sometimes you may notice that instead of one package, several are loaded into your library. This happens because the package often uses functions or data from other packages without which the installed R package will not be able to fully work. So the package with dependencies ( dependencies) “pulls up” other packages and is installed in the library along with them.

After installing the package, you need to load it into your current session using the library() function:

library("ggplot2")

If this is not done, the functions of the installed package will not work. This is explained by the fact that when R starts, only the basic packages (which we wrote about above) are automatically loaded into it, while the rest must be loaded manually.

Installing an R package via GitHub

However, not all R packages are available on CRAN. Many researchers collectively work on R packages on the GitHub platform, where they share ideas, report bugs, fix them, and notify other users online. Most often, packages are published on GitHub that are still in the development/testing stage, so their stability is not guaranteed. In order to download an R package from GitHub, you must first install the "devtools" package, load it into the R environment and use the install_github() function, where we indicate in the arguments the name of the main developer of the package and, separated by a slash, the name of the package:

install.packages("devtools") library("devtools") install_github("Author/PackageName")

Install the R package manually (tar.gz or zip archive)

Some packages are located on other platforms (for example, ResearchGate), on the sites of research groups, or on the developer's personal website, from where you can download the R package to your computer as an archive with a .tar.gz or .zip extension. In this case, you should download the package manually using the same install.packages() command. However, in the first argument of the function you must specify not the name of the package, but the address of the downloaded archive, and also enter additional arguments:

install.packages("Desktop/PackageName.tar.gz", repos = NULL, type="source")

Reading documentation is the key to working with R packages!

Documentation is the most important element of user interaction with the R package. It could be in the form of a website post, educational video, scientific publication, or reference guide. The first three options allow you to clearly demonstrate the idea and capabilities of the package. It is with them that I recommend starting to get acquainted with a package unknown to you (if they are available on the Internet).

Reference Guide ( Reference Manual), on the contrary, is technical a description of the R package, its functions and data. Unlike other types of documentation, any package available in CRAN has a reference manual. It is written in a specific format and is synchronized with the function code. As a result, you can search for help information using help commands in the R environment. For example, to find out a description of the installed package ggplot2 , simply enter the package name followed by a question mark:

?ggplot2

We now have all the available information about the ggplot2 package. In the same way, you can look at the documentation of a specific function: after the package name, put a double colon and the name of the function you are looking for (for example, the stat_ellipse function ) :

All the same can be found in PDF format on the official CRAN website (for example, the ggplot2 package reference manual). The first page contains a description of the R package, then a list of its functions and data tables, then a detailed technical description of each of them in alphabetical order.

Conclusion

After reading the documentation, you can safely use the R package for your own purposes. I can’t give universal instructions here, because... We all have different tasks, and accordingly we use different packages. Therefore, if you have any difficulties or questions, write them in the comments, I will be happy to answer.

And in the next article we will assemble the R package with our own hands!