Periodically someone asks me about books, courses, and other learning resources for data science. Given that I don’t have a numerate degree but have somehow ended up in my job, I’ve certainly worked my way through a lot of resources as an independent learner.
Does that mean I have opinions about those resources? Yes.
Does that make me an expert in the field of data science learning? No.
Does that mean these opinions will definitely be useful to you? Also no.
That’s not much of a sales pitch, but that’s OK because this primarily exists to help me stop repeating myself and give better answers to people. Mostly these are the resources that I think have been worth my time. Hopefully you’ll find this write-up helpful, but please remember that this is only my perspective and isn’t meant to be authoritative. For example: when I describe something as “essential” I mean that I find it essential, but of course you may not. I’m not your boss, your teacher, or your mum, and I’m not assigning you homework.
This list isn’t guaranteed to be complete, but I’ll update it whenever I remember to and redeploy this document. If you spot anything wrong (typos, broken links, etc.) please file an issue or tell me on Twitter.
Maths and Statistics
Stat 110, Introduction to Probability
This is (I think, I’m not going to look it up) the most popular undergraduate course at Harvard. It covers probability in (what I think is) a rigorous-enough way to be productive, which is to say that you learn about probability distributions, random variables, and conditioning in a careful way but without getting into measure theory. I use the ideas from this course probably every day.
At the start it all seems very simple (i.e. learning how to count) but it ratchets up the difficulty until you’re taking on quite challenging problems.
There is a slimmed-down version on EdX version, which looks good. But the full version was great for me. Without doubt the most valuable course I’ve ever taken.
MIT OCW 18.01SC and 18.02SC, Single Variable and Multivariable Calculus
Calculus comes up everywhere (including Stat 110). Lots of my colleagues (e.g. physicists) have enough calculus from their degrees to be able to take on most of the stuff you’ll encounter in data science. For the rest of us these two courses are a great start: video lectures and exercises to work through the course in full. Both of these are SC-edition courses on OCW, which means they are designed specifically for independent learners.
MIT OCW 18.06SC, Linear Algebra
Delivered by the recently-retired Gilbert Strang, and it is excellent. It’s a fairly practical course in linear algebra as opposed to a more theory-heavy course that starts with abstract vector spaces1. Linear algebra underpins so much of my work that understanding it really pays off.
MIT OCW 18.03SC, Differential Equations
Not strictly necessary but a brilliant course that I have found extremely useful. The lecturer (the late Arthur Mattuck) is amazing. His explanation of the Laplace transform is worth watching on its own, whether or not you want to study the course in full. The section on Fourier series is also very good.
Statistical Rethinking Second Edition, by Richard McElreath
I’m not the only one that likes this: “a pedagogical masterpiece”, says Rasmus Bååth; Gelman regularly recommends it as a more accessible alternative to his own book. It’s aimed at researchers (interpreted quite broadly) who need to use statistics to study their own field: the author is an evolutionary anthropologist who accidentally became a statistics teacher. The framework is Bayesian, and I found the justification for going Full-Luxury Bayesian Inference convincing.
Much like Stat 110 it starts simple (coincidentally also with counting) and builds up from there. The explanations of Markov Chain Monte Carlo and Hamiltonian Monte Carlo were the first to really click for me.
The book’s subtitle is A Bayesian Course with Examples in R and Stan, and the author’s examples are in base R and mostly using his own {rethinking} package as a Stan interface. But the popularity of the book means that there are plenty of secondary resources in other languages and/or dialects listed on his website.
Elements of Statistical Learning, by Trevor Hastie et al.
This is a pretty heavy book, but it’s rigorous. In-depth treatment of topics, and I’ve often turned to this book for an authoritative voice on something.
Understanding Analysis, Second Edition, Stephen Abbott
I wanted to take on a rigorous treatment of probability theory (i.e. measure theory). The path to doing that goes through mathematical analysis, but the problem is that many of the famous texts out here (e.g. Baby Rudin) are notoriously difficult, which makes them tough for independent study. Having scoured Stack Exchange for recommendations I bought this book and was very happy with it. It’s written in a way that worked well for me. Also the publisher (Springer) has very reasonable prices, which helps when you’re self-funding.
Measure, Integration, and Real Analysis, Sheldon Axler
Still a work in progress but am enjoying it.
Programming
Advanced R, by Hadley Wickham
This has repaid my time and effort many times over. I’ve returned to it countless times. A no-brainer for anyone working with R.
Machine Learning with PyTorch and Scikit-Learn, by Sebastian Raschka et al.
This one is a work in progress for me, but I’m getting a lot from it. It also straddles the border between this section on programming and the one on maths and statistics. But since I’m using it mostly as a way to improve in Python, here it stays.
The problem I had was that I needed something beyond a beginner’s Python book, but still accessible enough. This book is working really well as a way to up my Python game but with machine learning concepts that are familiar.
Data Science Specialisation, Johns Hopkins University on Coursera
A collection of courses on data science covering programming and statistics. It was ages ago that I completed these, so the material might not be bang up to date. I learned a lot from it though.
Machine Learning, Andrew Ng on Coursera
One of the most famous MOOCs ever, although it’s now changed from its original form: among those changes is a switch to Python from Matlab/Octave. Gets you to hand-code a neural network, then build up to more complex problems. Nice mix of theory and practice. Not sure I would recommend it now, because the version I completed no longer exists. But Andrew Ng has a pretty great track record as an educator (co-founded Coursera) and practitioner (e.g. he co-wrote the paper on Latent Dirichlet Allocation) so this might be worth a go.
Footnotes
Such an abstract linear algebra course is not essential but well worth doing. I would recommend the book Linear Algebra Done Right by Sheldon Axler to anyone that’s interested. That link takes you to the free e-book but the hard copy is great value.↩︎