Sections in this Chapter:

Upon the feedbacks and requests from readers of the first few chapters, I made a decision to spend a whole chapter in the introduction of data.table and pandas. Of course, there are many other great data analysis tools in both R and Python. For example, many R users like using dplyr to build up data analysis pipelines. The performance of data.table is superior and that is the main reason I feel there is no need to use other tools for the same tasks in R. But if you are a big fan for the pipe operator \%>\% you may use data.table and dplyr together. Regarding the big data ecosystem, Apache Spark has API in both R and Python. Recently, there are also some emerging projects aiming at better usability and performance, such as Apache Arrow1, Modin2.

## SQL

Similar to the previous chapters, I will introduce the tools side by side. However, I feel before diving into the world of data.table and pandas, it is better to talk a little bit about SQL3. SQL is a Query language designed for managing data in relational database management system (RDBMS). Some of the most popular RDBMSs include MS SQL Server, MySQL, PostgreSQL, etc. Different RDBMSs may use SQL languages with major or subtle differences.

If you have never used RDBMS you may wonder why we need it?

• first, we need a system to store the data;
• second, we also need a system that allows us to easily access, manage and update the data.

Let’s assume there is a table mtcars in a database (I’m using sqlite3 in this book) and see some simple tasks we can do with SQL queries.

 name mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 Lincoln Continental 10.4 8 460 215 3 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318 150 2.76 3.52 16.87 0 0 3 2 AMC Javelin 15.2 8 304 150 3.15 3.435 17.3 0 0 3 2 Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79 66 4.08 1.935 18.9 1 1 4 1 Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
SQL

In the example above, I select two rows from the table using the syntax select from. The keyword limit in sqlite specifies the number of rows to return. In other RMDBSs, we may need to use top instead.
It is straightforward to select on conditions with the where keyword.

SQL

We can definitely use more conditions with the where clause.

SQL

Staring at the example above, what are we doing? Actually, we are just accessing specific rows and columns from the table in database with select from where.
We can also do something a bit of more fancy, for example, to get the maximum, the minimum and the average of mpg for all vehicles grouped by the number of cylinders.

SQL

In the above example, there are a few things worth noting. We use as to create an alias for a variable; we group the original rows by the number of cylinders with the keyword group by; and we sort the output rows with the keyword order by. max, min and avg are all built-in functions that we can use directly.

It is also possible to have user-defined functions in SQL as what we usually do in other programming languages.

SQL is a very powerful tool for data analysis, but it works on RMDBS and generally we can’t apply R or Python functions to the database tables directly. Many practitioners in data science have to work with database at times but more often they need to work in a programming languages such as R or Python. We have introduced the data.frame in both R and Python in previous chapters. A data.frame is just like a database table that you may operate within the corresponding language. Usually a data.frame is stored in memory, but of course it can also be deserialized for storage in hard disks.

With the data.frame-like objects, we could build up a better data processing pipeline by only reading the original data from the database and storing the final output to the database if necessary. Most of works we can do with data.frame-like objects may also be done in RMBDS with SQL. But they may require intense interactions with a database, which is not preferred if they could be avoided.

Now, let’s get started with data.table and pandas. In this book, I will use data.table 1.12.0 and pandas 0.24.0.

## Get started with data.table & pandas

R
Python

I have saved the mtcars data to a csv4 file (code/chapter3/mtcars.csv), which is loaded from R 3.5.1. Although the mtcars data is loaded into R environment by default, let’s load the data by reading the raw csv file for learning purpose.

R Python

The type of mtcars_dt is data.table, not data.frame. Here we use the fread function from data.table to read a file and the output type is a data.table directly. Regarding reading csv in R, a very good package is readr for very large files, but the output has a data.frame type. In practice, it is very common to convert a data.frame to data.table with the function as.data.table.

## Indexing & selecting data

Before to introduce the indexing rules in data.table and pandas, it’s better to understand the key in data.table and the index in pandas.

What is the key in a data.table? We have talked about RMDBS and SQL in the previous section. With select from where we can easily access specific rows satisfying certain conditions. When the database table is too large, a database index is used to improve the performance of data retrieval operations. Essentially, a database index is a data structure, and to maintain the data structure additional cost (for example, space) may be required. The reason to use key in data.table and index in pandas is very similar. Now let’s see how to set key and index.

R Python

There are quite a few things worth noting from the above code snippets. When we use the setkey function the quotes for the column name is optional. So setkey(mtcars_dt, name) is equivalent to setkey(mtcars_dt, 'name'). But in pandas, quotes are required. The effect of setkey is in place, which means no copies of the data made at all. But in pandas, by default set_index set the index on a copy of the data and the modified copy is returned. Thus, in order to make the effect in place, we have the set the argument inplace=True explicitly. Another difference is that setkey would sort the original data in place automatically but set_index does not. It’s also worth noting every pandas data.frame has an index; and by default it is numpy.arange(n) where n is the number of rows. But there is no default key in a data.table.

In the above example, we only use a single column as the key/index. It is possible to use multiple columns as well.

R
Python

To use multiple columns as the key in data.table, we use the function setkeyv. It is also interesting that we have to use index.names rather than index.name to get the multiple column names of the index (which is called MultiIndex) in pandas. There are duplicated combinations of (cyl, gear) in the data, which implies key or index could be duplicated.

Once the key/index set, we can access rows with given the indices fast.

R

Here is a bit of explanation for the code above. We can simply use [] to access the rows with the specified key values if the key has a character type. But if the key has a numeric type, list() is required to enclose the key values. In data.table, .() is just an alias of list(), which means we would get the same results with mtcars_dt[list(6,4)]. Of course, we can use also do mtcars_dt[.('Merc 230')] which is equivalent to mtcars_dt[.('Merc 230')].

Python

Compared to data.table, we need to use the loc method when accessing rows based on index. The loc method also takes boolean conditions.

Python

When using boolean conditions, loc could be ignored for convenience.

Python

If the key/index is not needed, we can remove the key or reset the index. For data.table we can set a new key to override the existing one which then becomes a column. But in pandas, set_index method removes the exiting index which also disappears from the data.frame.

R Python

In Chapter 1, we introduced integer-based indexing for list/vector. It is also applicable to data frame and data.table

R Python

So far we have seen how to access specific rows. What about columns? Accessing columns in data.table and pandas is quite straightforward. For data.table, we can use \\$ sign to access a single column or a vector to specify multiple columns inside []. For data.frame in pandas, we can use . to access a single column or a list to specify multiple columns inside [].

R
Python

In addition to passing a vector to access multiple columns in data.table, we can also use the .(variable_1,variable_2,...).

It is also possible to do integer-based columns slicing.

R
Python

To access specific rows and specific columns, there are two strategies:

• select rows and then select columns in a chain;
• select rows and columns simultaneously.

Let’s see some examples.

R Python

As we have seen, using the setkey function for data.table sorts the data.table automatically. However, sorting is not always desired. In data.table, there is another function setindex/setindexv which has similar effects to setkey/setkeyv but doesn’t sort the data.table. In addition, one data.table could have multiple indices, but it cannot have multiple keys.

R

First, let’s see how to delete a single column.

R
Python

The := operator in data.table can be used to add/remove/update columns, by reference. Thus, when we use := no copies of the data is created. Getting familiar with this operator is critical to master data.table.
Next, let’s see how to delete multiple columns at the same time.

R
Python

The interesting fact of the code above is that in R the \%in\% function is vectorized, but the in function in Python is not.

Adding a single column to an existing data.table or DataFrame is as straightforward as removing.

R Python

Adding multiple columns is a bit of tricky compared with adding a single column.

R Python

In the R code, we use := to create multiple columns. In the Python code, we put the new columns into a dictionary and use the assign function with the dictionary unpacking operator **. To learn the dictionary unpacking operator, please refer to official document5. The assign method of a DataFrame doesn’t have inplace argument so we need to assign the modified DataFrame to the original one explicitly.

Now let’s see how to update values. We can update the entire column or just the column on specific rows.

R Python

We can also combine the technique of rows indexing with column update.

R Python

In addition to :=, we can also use set function to modify values in a data.table. When used properly, the performance gain could be significant. Let’s a fictional use case.

R

We see that updating the values with the set function in this example is as fast as updating the values in an array.

## Group by

At the beginning of this Chapter, we have seen an example with group by in SQL query. group by is a very powerful operation, and it is also available in data.table and pandas. Let’s try to get the average mpg grouped by cyl.

R Python

Group by also works on multiple columns.

R Python

We can also create multiple columns with group by. My feeling is data.table is more expressive.

R Python

In data.table, there is also a keyword called keyby which enables group by and sort operations together.

R

It is even possible to add expressions after the by keyword.

R

## Join

Join6 combines columns from one or more tables for RMDBs. We also have the Join operation available in data.table and pandas. We only talk about 3 different types of joins here, .i.e., inner join, left join, and right join. Left join and right join are also referred as outer join.

Let’s make two tables to join.

R Python

We can join tables with or without the help of index/key. In general, joining on index/key is more fast. Thus, I recommend always to set key/index for join operations.

R

To join data.table A and B, the syntax A[B] and B[A] only work when the keys of A and B are set. The function merge from data.table package works with or without key.

R

The join operations for data.table are summarized in table below.

 type syntax 1 syntax 2 A inner join B A[B, nomatch=0] or B[A, nomatch=0] merge(A, B, all.x=FALSE, all.y=FALSE) A left join B B[A] merge(A, B, all.x=TRUE, all.y=FALSE) B right join A A[B] merge(A, B, all.x=FALSE, all.y=TRUE)

In pandas, there are also different ways to join DataFrame. Let’s just focus on the basic method with the merge function (other methods may also be based on this function).

Python

We have learned the very basics of data.table and pandas. In fact, there are lots of other useful features in both tools which are not covered in this chapter. For example, the .I/.N symbol in data.table, and the stack/unstack method in pandas.

1 https://arrow.apache.org

2 https://github.com/modin-project/modin

3 https://en.wikipedia.org/wiki/SQL

4 https://en.wikipedia.org/wiki/Comma-separated_values

5 https://www.python.org/dev/peps/pep-0448

6 https://en.wikipedia.org/wiki/Join_(SQL)