Sections in this Chapter:

## Write & run scripts

In the first Chapter, we are coding within the interactive mode of R/Python. When we are working on the real world projects, using an Integrated development environment (IDE) is a more pragmatic choice. There are not many choices for R IDE, and among them RStudio1 is the best one I have used so far. As for Python, I would recommend either Visual Studio Code2 or PyCharm3. But of course, you could use any text editor to write R/Python scripts.

Let’s write a simple script to print Hello World! in R/Python. I have made a directory chapter2 on my disk, the R script is saved as hello_world.R and the Python script is saved as hello_world.py, inside the directory.

R

chapter2/hello_world.R

Python

chapter2/hello_world.py

There are a few ways to run the R script. For example, we can run the script from the console with the r -f filename command. Also, we can open the R interactive session and use the source() function. I would recommend the second approach with source() function. As for the Python script, we can run it from the console.

R
Python

## Debugging

Debugging is one of the most important aspects of programming. What is debugging in programming? The programs we write might include errors/bugs and debugging is a step-by-step process to find and remove the errors/bugs in order to get the desired results.

If you are smart enough or the bugs are evident enough then you can debug the program on your mind without using a computer at all. But in general we need some tools/techniques to help us with debugging.

### Print

Most of programming languages provide the functionality of printing, which is a natural way of debugging. By trying to place print statements at different positions we may finally catch the bugs. When I use print to debug, it’s feeling like playing the game of minesweeper. In Python, there is a module called logging4 which could be used for debugging like the print function, but in a more elegant fashion.

#### Browser in R and pdb in Python

In R, there is a function browser() which interrupts the execution and allows the inspection of the current environment. Similarly, there is a module pdb in Python that provides more debugging features. We would only focus on the basic usages of browser() and the set_trace() function in pdb module.
The essential difference between debugging using print() and browser() and set_trace() is that the latter functions allows us to debug in an interactive mode.

Let’s write a function which takes a sorted vector/list v and a target value x as input and returns the leftmost index pos of the sorted vector/list so that v[pos]>=x. Since v is already sorted, we may simply loop through it from left to right to find pos.

R

chapter2/find_pos.R

Python

chapter2/find_pos.py

Now let’s run these two scripts.

R
Python

When x=11, the function returns NULL in R and None in Python because there is no such element in v larger than x.
The implementation above is trivial, but not efficient. If you have some background in data structures and algorithms, you probably know this question can be solved by binary search. The essential idea of binary search comes from divide-and-conquer5. Since v is already sorted, we may divide it into two partitions by cutting it from the middle, and then we get the left partition and the right partition. v is sorted implies that both the left partition and the right partition are also sorted. If the target value x is larger than the rightmost element in the left partition, we can just discard the left partition and search x within the right partition. Otherwise, we can discard the right partition and search x within the left partition. Once we have determined which partition to search, we may apply the idea recursively so that in each step we reduce the size of v by half. If the length of v is denoted as n, in terms of big O notation6, the run time complexity of binary search is $\mathcal{O}(\log{}n)$, compared with $\mathcal{O}(n)$ of the for-loop implementation.

The code below implements the binary search solution to our question (It is more intuitive to do it with recursion but here I write it with iteration since tail recursion optimization7 in R/Python is not supported).

R

chapter2/find_binary_search_buggy.R

Python

chapter2/find_binary_search_buggy.py

Now let’s run these two binary search scripts.

R
Python

The binary search solutions don’t work as expected when x=11. We write two new scripts.

R

chapter2/find_binary_search_buggy_debug.R

Python

chapter2/find_binary_search_buggy_debug.py

Let’s try to debug the programs with the help of browser() and set_trace().

R

In the R code snippet above, we placed the browser() function on the top of the function binary_search_buggy. Then when we call the function we enter into the debugging environment. By calling ls() we see all variables in the current debugging scope, i.e., v,x. Typing n will evaluate the next statement. After typing n a few times, we finally exit from the while loop because start = 4 such that start < end is FALSE. As a result, the function just returns the value of start, i.e., 4. To exit from the debugging environment, we can type Q; to continue the execution we can type c.

The root cause is that we didn’t deal with the corner case when the target value x is larger than the last/largest element in v correctly.

Let’s debug the Python function using pdb module.

Python

Similar to R, command n would evaluate the next statement in pdb. Typing command l would show the current line of current execution. Command b line_number would set the corresponding line as a break point; and c would continue the execution until the next breakpoint (if exists).

In R, besides the browser() function there are a pair of functions debug() and undebug() which are also very handy when we try to debug a function; especially when the function is wrapped in a package. More specifically, the debug function would invoke the debugging environment whenever we call the function to debug. See the example below how we invoke the debugging environment for the sd function (standard deviation calculation).

R

The binary_search solutions are fixed below.

R

chapter2/find_binary_search.R

Python

chapter2/find_binary_search.py

## Benchmarking

By benchmarking, I mean measuring the entire operation time of a piece of program. There is another term called profiling which is related to benchmarking. But profiling is more complex since it commonly aims at understanding the behavior of the program and optimizing the program in terms of time elapsed during the operation.

In R, I like using the microbenchmark package. And in Python, timeit module is a good tool to use when we want to benchmark a small bits of Python code.

As mentioned before, the run time complexity of binary search is better than that of a for-loop search. We can do benchmarking to compare the two algorithms.

R

chapter2/benchmark.R

In the R code above, times=1000 means we want to call the function 1000 times in the benchmarking process. The sample() function is used to draw samples from a set of elements. Specifically, we pass the argument 1 to sample() to draw a single element. It’s the first time we use set.seed() function in this book. In R/Python, we draw random numbers based on the pseudorandom number generator (PRNG) algorithm8. The sequence of numbers generated by PRNG is completed determined by an initial value, i.e., the seed. Whenever a program involves the usage of PRNG, it is better to set the seed in order to get replicable results (see the example below).

R
R

Now let’s run the R script to see the benchmarking result.

R

The binary_search solution is much more efficient based on the benchmarking result.
Doing the same benchmarking in Python is a bit of complicated.

Python

chapter2/benchmark.py

The most interesting part of the Python code above is from __main__ import. Let’s ignore it for now, and we would revisit it later.

Below is the benchmarking result in Python (the unit is second).

Python

## Vectorization

In parallel computing, automatic vectorization9 means a program in a scalar implementation is converted to a vector implementation which process multiple pairs of operands simultaneously by compilers that feature auto-vectorization. For example, let’s calculate the element-wise sum of two arrays x and y of the same length in C programming language.

C

The C code above might be vectorized by the compiler so that the actual number of iterations performed could be less than 4. If 4 pairs of operands are processed at once, there would be only 1 iteration. Automatic vectorization may make the program runs much faster in some languages like C. However, when we talk about vectorization in R/Python, it is different from automatic vectorization. Vectorization in R/Python usually refers to the human effort paid to avoid for-loops. First, let’s see some examples of how for-loops may slow your programs in R/Python.

R

chapter2/vectorization_1.R

Running the R code results in the following result on my local machine.

R Python

Please note that in this Python example we are using the random submodule of numpy module instead of the built-in random module since random module doesn’t provide the vectorized version of random number generation function. Running the Python code results in the following result on my local machine.

Also, the timeit.timeit measures the total time to run the main statements number times, but not the average time per run.

Python

In either R or Python, the vectorized version of random normal random variable (r.v.) is significantly faster than the scalar version. It is worth noting the usage of the print(f'') statement in the Python code, which is different from the way how we print the object of Complex class in \autoref{ch:chp1. In the code above, we use the f-string10 which is a literal string prefixed with 'f' containing expressions inside \{\ which would be replaced with their values. f-string was a feature introduced since Python 3.6. If you are familiar with Scala, you may find that this feature is quite similar with the string interpolation mechanism introduced since Scala 2.10.

It’s also worth noting that lots of built-in functions in R are already vectorized, such as the basic arithmetic operators, comparison operators, ifelse(), element-wise logical operators &,|. But the logical operators &&, || are not vectorized.

In addition to vectorization, there are also some built-in functions which may help to avoid the usages of for-loops. For example, in R we might be able use the apply family of functions to replace for-loops; and in Python the map() function can also be useful. In the Python pandas module, there are also many usages of map/apply methods. But in general the usage of apply/map functions has little or nothing to do with performance improvement. However, appropriate usages of such functions may help with the readability of the program. Compared with the apply family of functions in R, I think the do.call() function is more useful in practice. We would spend some time in do.call() later.

Application – Biham–Middleton–Levine (BML) traffic model

Considering the importance of vectorization in scientific programming, let’s try to get more familiar with vectorization thorough the Biham–Middleton–Levine (BML) traffic model11. The BML model is very important in modern studies of traffic flow since it exhibits a sharp phase transition from free flowing status to a fully jammed status. A simplified BML model could be characterized as follows:

• Initialized on a 2-D lattice, each site of which is either empty or occupied by a colored particle (blue or red);
• Particles are distributed randomly through the initialization according to a uniform distribution; the two colors of particles are equally distributed.
• On even time steps, all blue particles attempt to move one site up and an attempt fails if the site to occupy is not empty;
• On Odd time steps, all red particles attempt to move one site right and an attempt fails if the site to occupy is not empty;
• The lattice is assumed periodic which means when a particle moves out of the lattice, it would move into the lattice from the opposite side.

The BML model specified above is implemented in both R/Python as follows to illustrate the usage of vectorization.

R

chapter2/BML.R

Now we can create a simple BML system on a $5\times5$ lattice using the R code above.

R

In the initialization step, we used the inverse transform sampling approach12 to generate the status of each site. Inverse transform sampling method is basic but powerful approach to generate r.v. from any probability distribution given its cumulative distribution function (CDF). Reading the wiki page is enough to master this sampling method.

Python

The Python implementation is also given.

R

Please note that although we have imported numpy in BML.py, we import it again in the code above in order to set the random seed. If we change the line to from BML import *, we don’t need to import numpy again. But it is not recommended to import * from a module.

## Embarrassingly parallelism in R/Python

According to the explanation of wikipedia13, single-threading is the processing of one command at a time, and its opposite is multithreading. A process is the instance of a computer program executed by one or many threads14. Multithreading is not the same as parallelism. In a single processor environment, the processor can switch execution between threads, which results in concurrent execution. However, it is possible a process with multithreads runs on on a multiprocessor environment and each of the threads on a separate processor, which results in parallel execution.

Both R and Python are single-threaded. In Python, there is a threading package15, which supports multithreading on a single core. It may suit some specific tasks. For example, in web scraping multithreading on a single core may be able to speed up the program if the download time is in excess of the CPU processing time.

Now let’s talk about embarrassingly parallelism by multi-processing. Embarrassingly parallel problem is one where little or no effort is needed to separate the problem into a number of parallel tasks16. In R there are various packages supporting multi-processing on multiple CPU cores, for example, the parallel package, which is my favorite one. In Python, there are also some available modules, such as multiprocessing, joblib and concurrent.futures. Let’s see an application of embarrassingly parallelism to calculate $\pi$ using Monte Carlo simulation17.

Application – Monte Carlo simulation to estimate $\pi$ via parallelization

Monte Carlo simulation provides a simple and straightforward way to estimate $\pi$. We know the area of a circle with radius 1 is just $\pi$. Thus, we can convert the original problem of $\pi$ calculation to a new problem, i.e., how to calculate the area of a circle with radius 1. We also know the area of a square with side length 2 is equal to 4. Thus, $\pi$ can be calculated as 4$r_{c/s}$ where $r_{c/s}$ denotes the ratio of areas of a circle with radius 1 and a square with side length 2. Now the problem is how to calculate the ratio $r_{c/s}$? When we randomly throw $n$ points into the square and $m$ of these pionts fall into the inscribed circle, then we can estimate the ratio as $m/n$. As a result, a natural estimate of $\pi$ is $4m/n$. This problem is an embarrassingly parallel problem by its nature. Let’s see how we implement the idea in R/Python.

R

chapter2/pi.R

In the above R code snippet, we use the function mclapply which is not currently available on some operation systems18. When it is not available, we may consider to use parLapply instead.

Python

chapter2/pi.py

In the above Python code snippet, we defined a function generate_points_parallel which returns a list of number of inside points. But the problem of this function is that each process in the pool shares the same random state. As a result, we will obtain the a list of duplicated numbers by calling this function. To overcome the issue, we defined another function generate_points_parallel_set_seed. Let’s see the results of our estimation for $\pi$ running on a laptop.

R Python

We see the winner in this example is vectorization, and the parallel solution is better than the naive solution. However, when the problem cannot be vectorized we may use parallelization to achieve better performance.

## Scope of variables

We have seen how to define variables in R/Python in Chapter 1, and we have known that a variable is an identifier to a location in memory. What is the scope of a variable and why does it matter? Let’s first have a look at the code snippets below.

R
Python

The results of the code above seem strange before knowing the concept of variable scope. Inside a function, a variable may refer to a function argument/parameter or it could be formally declared inside the function which is called a local variable. But in the code above, x is neither a function argument nor a local variable. How does the print() function know where the identifier x points to?

The scope of a variable determines where the variable is available/accessible (can be referenced). Both R and Python apply lexical/static scoping for variables, which set the scope of a variable based on the structure of the program. In static scoping, when an ‘unknown’ variable referenced, the function will try to find it from the most closely enclosing block. That explains how the print() function could find the variable x.

In the R code above, x=x+1 the first x is a local variable created by the = operator; the second x is referenced inside the function so the static scoping rule applies. As a result, a local variable x which is equal to 2 is created, which is independent with the x outside of the function var_func_2(). However, in Python when a variable is assigned a value in a statement the variable would treated as a local variable and that explains the UnboundLocalError.

Is it possible to change a variable inside a function which is declared outside the function? Based on the static scoping rule only, it’s impossible. But there are workarounds in both R/Python. In R, we need the help of environment; and in Python we can use the keyword global.

So what is an environment in R? An environment is a place where objects are stored. When we invoke the interactive R session, an environment named as .GlobalEnv is created automatically. We can also use the function environment() to get the present environment. The ls() function can take an environment as the argument to list all objects inside the environment.

R

The above code shows that each function has its own environment containing all function arguments and local variables declared inside the function. In order to change a variable declared outside of a function, we need the access of the environment enclosing the variable to change. There is a function parent_env(e) that returns the parent environment of the given environment e in R. Using this function, we are able to change the value of x declared in .GlobalEnv inside a function which is also declared in .GlobalEnv. The global keyword in Python works in a totally different way, which is simple but less flexible.

R Python

I seldomly use the global keyword in Python, if any. But the environment in R could be very handy in some occasions. In R, environment could be used as a purely mutable version of the list data structure.

R
Python

Actually, the object of an R6 class type is also an environment.

R

In Python, we can assign values to multiple variables in one line.

R
Python

Even though in the left snippet above there aren’t parentheses embracing 1,2 after the = operator, a tuple is created first and then the tuple is unpacked and assigned to x, y. Such mechanism doesn’t exist in R, but we can define our own multiple assignment operator with the help of environment.

R

chapter2/multi_assignment.R

Before going into the script deeper, first let’s see the usage of the multiple assignment operator we defined.

R

In the \%=\% operator defined above, we used two functions substitute, deparse which are very powerful but less known by R novices. To better understand these functions as well as some other less known R functions, the Rchaeology19 tutorial is worth reading.

It is also interesting to see that we defined the function recursive_assign inside the \%=\% function. Both R and Python support the concept of first class functions. More specifically, a function in R/Python is an object, which can be

• stored as a variable;
• passed as a function argument;
• returned from a function.

The essential idea behind the recursive_assign function is a depth-first search (DFS), which is a fundamental graph traversing algorithm20. In the context of the recursive_assign function, we use DFS to traverse the parse tree of the left argument created by calling substitute(left).

## Evaluation strategy

R is a lazy programming language since it uses lazy evaluation by default21. Lazy evaluation strategy delays the evaluation of an expression until its value is needed. When an expression in R is evaluated, it follows an outermost reduction order in general. But Python is not a lazy programming language and the expression in Python follows an innermost reduction order.

The code snippets below illustrate the innermost reduction vs. outermost reduction.

R
Python

Because of the outermost reduction order, the R code snippet evaluate the function power first and since if second argument is 0 the first argument is not required to evaluate. Thus, the function call returns 1. But the Python code snippet first evaluates the function call of divide and an exception is raised because of division by zero.

Although Python is an eager language, it is possible to simulate the lazy evaluation behavior. For example, the code inside a generator22 is evaluated when the generator is consumed but not evaluated when it is created. We can create a sequence of Fibonacci numbers with the help of generator.

Python

In the code snippet above, we create a generator which generates the sequence of Fibonacci numbers less than 10. When the generator is created, the sequence is not generated immediately. When we consume the generator in a loop, each element is then generated as needed. It is also common to consume the generator with the next function.

Python

## Speed up with C/C++ in R/Python

The purpose of this section is not to teach C/C++. If you know how to write C/C++ then it may help to improve the performance of R/Python program.

It is recommended to use vectorization technique to speed up your program whenever possible. But what if it’s infeasible to vectorize the algorithm? One potential option is to use C/C++ in the R/Python code. According to my limited experience, it is much easier to use C/C++ in R with the help of Rcpp package than in Python. In Python, more options are available but not as straightforward as the solution of Rcpp. In this section, let’s use Cython23. Actually, Cython itself is a programming language written in Python and C. In Rcpp we can write C++ code directly and use it in native R code. With the help of Cython, we can write python-like code which is able to be compiled to C or C++ and automatically wrapped into python-importable modules. Cython could also wrap independent C or C++ code into python-importable modules.

Let’s see how to calculate Fibonacci numbers with these tools. The first step is to install Rcpp package in R and the Cython module (use pip) in Python. Once they are installed, we can write the code in C++ directly for R. As for Cython, we write the python-like code which would be compiled to C/C++ later.

C

chapter2/Fibonacci.cpp

Cython

chapter2/Fibonacci.pyx

It’s worth noting the extension of the Cython file above is pyx, not py. And in Fibonacci.pyx, the function we defined is quite following the native Python syntax. But in Cython we can add static typing declarations which are often helpful to improve the performance, although not mandatory. The keyword cdef makes the function Fibonacci_c invisible to Python and thus we have to define the Fibonacci_static function as a wrapper which can be imported into Python. The function Fibonacci is not static typed for benchmarking.

Let’s also implement the same functions in native R/Python for benchmarking purpose.

R

chapter2/Fibonacci.R

Python

chapter2/Fibonacci.py

Now let’s compare the performance of different implementations.

R

The native implementation in R is much slower the C++ implementation (see the difference of units).

Python

The results show the static typed implementation is the fastest, and the native implementation in pure Python is the slowest. Again the time measured by timeit.timeit is the total time to run the main statement 1000 times. The average time per run of Fibonacci_static function is close to the average time per run of Fibonacci in R.

## A first impression of functional programming

Functional (programming) languages define programs as mathematical functions and treat them as first-class24. In pure functional programming, a function takes input and returns output without making any side effect. A side effect means a state variable outside the function is modified. For example, printing the value of a variable inside a function is a side effect.

Both R and Python are not purely functional languages per se, although R behaves more functional than Python25. If you wonder why the return function could be ignored to return a value in an R function, now you get the explanation – a function should return the output automatically from a FP perspective.

I don’t think it makes any sense to debate on if we should choose OOP or FP using R/Python. And both languages support multiple programming paradigms (FP, OOP etc.). If you want to do purist FP, neither R nor Python would be the best choice. For example, purist FP should avoid loop because loop always involves an iteration counter whose value changes over iterations, which is obviously a side effect. Thus, recursion is very important in FP. But tail recursion optimization is not supported in R and Python although it can be implemented via trampoline26.

Let’s introduce a few tools that could help to get a first impression of FP in R and Python.

### Anonymous function

We have seen anonymous functions before. Anonymous functions are commonly used in FP. To recap, let’s see how to define a useless anonymous function.

R
Python

Usually, we define anonymous functions because we want to pass them to other functions. And thus, the functions defined above are useless as they are not passed to any other function and after the definition they are deleted automatically from the memory.

Of course, we can assign names to the anonymous functions – but what is the point to name anonymous functions?

R
Python

### Map

Map could be used to avoid loops. The first argument to map is a function.

In R, we use Map which is shipped with base R; and in Python we use map which is also a built-in function.

R
Python

There are a few things to note from the code snippets above. First, the returned value from Map in R is a list rather than a vector. That is because Map is just a wrapper of mapply function in R, and the SIMPLIFY argument is set to FALSE. If we want to get a vector value returned, just use mapply instead.

R

The returned value from map in Python is a map object which is an iterator. The generator function we have seen before is also an iterator. To use an iterator, we could convert it to a list, or get the next value using the next function, or just use a for loop.

Python

### Filter

Similar to map, the first argument to Filter in R and filter in Python is also a function. This function is used to get the elements which satisfy a certain condition specified in the first argument. For example, we can use it to get the even numbers.

R
Python

### Reduce

Reduce behaves a bit different from Map and Filter. The first argument is a function with exactly two arguments. The second argument is a sequence, on each element of which the first function argument will be applied from left to right. And there is one optional argument called initializer. When the initializer is provided, it would be used as the first argument of the function argument of Reduce. The examples below depict how it works.

R
Python

Please note that in order to use reduce in Python, we have to import it from the functools module.

Utilizing these functions flexibly may help to make the code more concise, at the cost of readability.

## Miscellaneous

We have introduced the basics of R/Python programming so far. There are much more to learn to become an advanced user of R/Python. For example, the appropriate usages of iterator, generator, decorator could improve both the conciseness and readability of your Python code. The generator27 is commonly seen in machine learning programs to prepare training/testing samples. decorator is a kind of syntactic sugar to allow the modification of a function’s behavior in a simple way. In R there are no built-in iterator, generator, decorator, but you may find some third-party libraries to mimic these features; or you may try to implement your own.

One advantage of Python over R is that there are some built-in modules containing high-performance data structures or commonly-used algorithms implemented efficiently. For example, I enjoy using the deque structure in the Python collections module28, but there is no built-in counterpart in R. We have written our own binary search algorithm earlier in this Chapter, which can also be replaced by the functions in the built-in module bisect29 in Python.

Another important aspect of programming is testing. Unit testing is a typical software testing method that is commonly adopted in practice. In R there are two third-party packages testthat and RUnit. In Python, the built-in unittest is quite powerful. Also, the third-party module pytest30 is very popular.

1 https://www.rstudio.com/products/rstudio

2 https://code.visualstudio.com/

3 https://www.jetbrains.com/pycharm

4 https://docs.python.org/3/library/logging.html

5 https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm

6 https://en.wikipedia.org/wiki/Big_O_notation

7 https://en.wikipedia.org/wiki/Tail_call

8 https://en.wikipedia.org/wiki/Pseudorandom_number_generator

9 https://en.wikipedia.org/wiki/Automatic_vectorization

10 https://www.python.org/dev/peps/pep-0498

11 https://en.wikipedia.org/wiki/Biham-Middleton-Levine_traffic_model

12 https://en.wikipedia.org/wiki/Inverse_transform_sampling

14 https://en.wikipedia.org/wiki/Process_(computing)

16 https://en.wikipedia.org/wiki/Embarrassingly_parallel

17 https://en.wikipedia.org/wiki/Monte_Carlo_method

18 https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

19 https://cran.r-project.org/web/packages/rockchalk/vignettes/Rchaeology.pdf

20 https://en.wikipedia.org/wiki/Depth-first_search

21 Floréal Morandat, Brandon Hill, Leo Osvald, and Jan Vitek. Evaluating the design of the r language. In European Conference on Object-Oriented Programming, pages 104–131. Springer, 2012.

22 https://www.python.org/dev/peps/pep-0255

23 https://en.wikipedia.org/wiki/Cython

24 https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Functional_languages

25 John M Chambers et al. Object-oriented programming, functional programming and r. Statistical Science, 29(2):167–180, 2014.

26 https://en.wikipedia.org/wiki/Tail_call#Through_trampolining

27 https://docs.python.org/3/howto/functional.html

28 https://docs.python.org/3/library/collections.html

29 https://docs.python.org/3.7/library/bisect.html

30 https://docs.pytest.org/en/latest