2 Introduction
This is not a book on learning to program with R or python. Consider it the companion for non-programmers who are learning python with some knowledge of R.
Learning a new programming or scripting language begs comparison. Often this is not overly helpful to the learner who, coming from a zone of comfort is often frustrated by the old subtleties not consciously recognized, being yet unknown in the new. I recall a family member, completely new to computers, saying why doesn’t it know what I mean expecting the implied parts of language to transfer from English (or German, or Chinese etc.) to computer language.
R writers early on recognize the absence of vectorization of functions in core python. It’s available though in numpy and pandas, but base python doesn’t have it and they miss it. You’ll get there, really. Object Orientation is also not natural to either R or Python. It came later to both, but it seems more smoothly accomplished by python than R.
I soon realized, though that you had to know the programming commonalities of python first; R comparisons really should start with the libraries of python. R in someways has cut right to the data science chase often at the expense of strong programming constructs. This is the nod to interactive coding and scripting rather than the need to undertake formal programming. Python began early on with a scripting functionality but it wasn’t until the development of iPython that interactive python could be realized in the way that R users were used to.
All this is not to say that R was lacking in basic programming constructs from the start, it wasn’t; but the proportion of users who used R in a quick-and-dirty small problem focused way was quite high. This suited it’s use as a learning environment for introductory statistics7. I should think that those who try to start python at the same time as their first statistics course will find the learning curve too steep to help with their need to learn statistics.
Dalgaard7 was the first source I used to learn R during my second semester of Statistics. I similarly used Lutz8 to learn Python well after I was comfortable with R and Probability and Statistics. One needs to get past the very basic programming knowledge of python to effectively get up to speed with matrix algebra, probability and statistics and data munging using the available libraries outside of the basic and Standard Library2 functions of python. There are books to help you cut some corners in applying python to data science9, 10, but it’s very easy to find oneself confused without a good level of comfort with the ideas and the pitfalls of importing modules in python. Namespaces and variable visibility are less forgiving in python than R.
Once you get comfortable with the math and stats functions available in Numpy and/or Scipy modules, the data structures and munging functions of Pandas and the graphics and plotting functions of Matplotlib you are up to the speed you had gotten to much earlier with R.
As a final word, it may seem like I distain style, in favor of substance. I recall in a list of interview questions somewhere whether software that worked was more important than well documented code or vis-a-versa. It is a trick question. Though software that doesn’t work is useless, it may be worse if it is unfixable too. You can’t debug what is not understandable and spagetti code that is undocumented and disappearing into thin air never to be seen again is impossible to work on. Not only will others who want to adapt or debug our old code be lost, but given a month away from it, we probably won’t know whatwe mwere doing either. Comment obscessively from the start.
There is a plethora of free open source code and applications out there. Poor documention for users is lazy and sloppy and betrays our responsibility to tell our clients how to use the solutions we offer. Sometimes those clients are us and we deserve better too. Writing user documentation helps us think about our projects in a new perspective - making our work more valuable.
Reproducibility is a necessity for any modern endeavor. In the face of the reseach fraud resurfacing more and more, we can no longer ask people to “just trust us”. Data Scientists must know this from the outset. markdown with integrated code, Rstudio and Jupyter addresses this issue superbly well and so their use is central to what I am writing. The fact that they allow multi-language processing in the documents is the bonus. See the References and Resources at the end of the book for other books and articles on Documentation, Style and Reprpoducibility .
So now we can forge ahead and look at R and python side-by-side and perhaps reduce the learning curve for python DataScience libraries.