Another Book on Data Science

Learn R and Python in Parallel

Nailong Zhang

Why am I writing this book?

Maybe a major reason is an existential crisis.

The feedback from readers is another important reason. A few months ago I submitted a git repo with three Chapters of this book in PDF format to Hacker News, and surprisingly the repo got 500 stars in a week. I received a few emails expressing thanks and interests in more Chapters. Since then, I have been working on this project constantly.

About

There has been considerable debate over choosing R vs. Python for Data Science. I started to learn Python when I was an undergraduate. At that time I never heard of Data Science. A few years later I read an R script for the first time. Since then R had been my primary programming language for quite a while during my Ph.D. study. I also used to learn new programming languages as a hobby. Based on my limited knowledge/experience, both R and Python are great languages and are worth learning; so why not learn them together?

The book is still under development. And the code can be found at this git repo. If you have any idea to share or find any errors of the book, please contact me directly via email setseed2016@gmail.com.

You may find the structure of this book loose, deliberately. Because the definition of Data Science is vague.

A PDF version of this book would be available soon!

Target audience

  • If you have little programming experience, and want to learn it
  • If you know either R or Python, and want to learn the other
  • If you are interested in Data Science or Quantitative Analytics
  • If you want to see how to implement some basic machine learning models from scratch, such as linear regression (ridge, Lasso), gradient boosting regression, etc.
  • Or if you just want to read it

Chapters

Introduction to R/Python Programming
calculator, variable & type, functions, control flows, some built-in data structures, object-oriented programming
More on R/Python Programming
write & run R/Python scripts, debugging, benchmarking, vectorization, embarrassingly parallelism, scope of variables, evaluation strategy, speed up with C/C++, functional programming
data.table and pandas
SQL, introduction to data.table and pandas, indexing & selecting data, add/remove/update, group by, join
Linear Regression
basics of linear regression, linear hypothesis testing, ridge regression
Optimization in Practice
convexity, gradient descent, root-finding, general purpose minimization tools, linear programming, simulated annealing
Predictive Modeling in Practice (under development)
population & random samples, universal approximation, overfitting, gradient boosting machine, etc.