An Introductory Book to Machine Learning, Statistics & Programming for Immunologists

Data Science for Immunologists

About Us

  • Niclas Thomas MMath MRes PhD

    Niclas is a professional data scientist with experience at several large retailers, tackling a wide range of business problems using advanced mathematical and statistical techniques.

    He gained a PhD at UCL using machine learning to build predictive models of the immune system, combining standard experimental techniques with advanced analytical methods. These methods were applied to a variety of data types, from high throughput sequencing data to lower dimensional cell frequency data. Following completion of his PhD, he worked as a post-doctoral research associate, using machine learning to predict renal transplant failure from flow cytometry data.

    Whilst academic data science differs to corporate data science on a couple of points, such as scope for innovation and pace of delivery, there are far more similarities than differences. Both require sound theoretical knowledge, the ability to programmatically implement mathematical ideas, a collaborative mindset and the ability to communicate complex approaches and results to non-technical colleagues. The last of these points is a real passion of his.

    He consults on several immunological projects, advising on statistical analysis, experimental design and visualisation techniques amongst other things. As well as improving the quality of data analysis, he enjoys bringing data science to new audiences, and this website and associated book is written with this in mind.

    datascienceforimmunologists (@) gmail.com

  • Laura Pallett BSc PhD

    Laura is a postdoctoral research scientist at UCL, focussing on understanding the mechanisms of immune dysfunction in chronic hepatitis B infection.

    Having gained a PhD in viral immunology with first-author publications in Nature Medicine and Journal of Experimental Medicine, she is now investigating immunological mechanisms at play in the human liver, with specific interests in immunometabolism and tissue-residency.

    Whilst having trained as a laboratory scientist, she has recently taken a keen interest in advanced in silico techniques for the analysis of immunological data. In particular, she believes that strong fundamentals in statistics coupled with practical knowledge of visualisation techniques is a must for the modern immunologist.

    In addition to her academic experience, she has worked at GlaxoSmithKline and acts on immunology advisory boards for Gilead.

    She is passionate about science communication, and is the co-founder and co-author of her lab's twitter feed, with the aim of bringing their work to a broader audience. She is also an early careers representative on the British Society of Immunology Forum.

    datascienceforimmunologists (@) gmail.com

What is our book about?

Data science is a complex subject, but nevertheless one that can be made accessible to all through clear, intuitive explanations and worked examples.

Existing software that forms the backbone of an immunologist's analytical toolkit (such as FlowJo and Prism) are expensive, inflexible and promotes a narrow mindset when it comes to analysing your data. On the other hand, the Python and R programming languages are open source, free and entirely customisable, giving the user the ability to implement any analysis they wish.

Although programming languages can seem daunting to the uninitiated, it's far easier to learn than many immunologists may think. Rather than seeking to become an expert programmer, an understanding of the main concepts is more than enough to conduct your own bespoke analyses when coupled with a sound mathematical and statistical understanding.

​Our book focusses on the practical aspects of data science, providing sufficient theoretical background without delving into all of the details of each of the methods presented. Introductory chapters are presented alongside the analysis of a publicly available data set, allowing the reader to have practical hands-on experience when learning about important concepts in statistics, machine learning and programming. Topics include: -

  • How to build a predictive model
  • How to visualise high-dimensional data
  • Basics of programming in Python and R
  • What techniques exist to cluster data
  • Which statistics test to use/why/when
  • What is dimension reduction; when and how to use it

Once these fundamental topics have been covered, a number of case studies are presented, along with the underlying data, accompanying code and full explanations on topics such as automated, data-driven flow cytometry, building predictive models of disease using gene expression profiling and analysing high throughput sequencing data.

What is Data Science?

Whilst no single definition of data science exists, it broadly concerns the application of the scientific method to data. Data science is used to obtain insights from data through a combination of advanced analytic techniques.

Typically, data science brings together techniques from several fields, acting at the intersection of statistics, mathematics, machine learning and programming. Machine learning in particular has seen a huge growth in interest recently, and whilst the hype may not always live up to reality, recent advances in this field allow novel insights to be drawn from data like never before.

Knowledge of these techniques means little without the power to implement them. Fortunately, many of these techniques can be easily implemented in Python and R, open source programming languages that benefit from a wide community of experts who contribute code to these projects and make it publicly available. A little bit of knowledge is a dangerous thing, however, and it would be unwise to implement these techniques without knowing how they work, and when and when not to use them.

The best data scientists have a sound knowledge of machine learning and statistical techniques, combined with strong programming capability. The demands of data science in a corporate setting are very different to those in an immunological research setting. For example, the practice of web scraping (automating the retrieval of information from web pages) is commonly required in a business setting, but in the context of immunology, this would be an irrelevant skill.

Courses and textbooks that focus on data science as a general skill set are not particularly relevant for the researcher in immunology. For this area of research, data science requires a heavy focus on frequentist statistics (i.e. statistical significance testing) and visualisation techniques (i.e. dimension reduction, clustering and alternative methods to display information).

Rather than learning about clustering as a general method for grouping similar objects together, it's far more useful to learn about and apply clustering techniques in a more relevant context such as flow cyotmetry. Likewise, machine learning concepts such as classification models are easier to understand when introduced in the context of developing a clinical patient model.

Data science for immunologists places more emphasis on relevant techniques, ignoring methods that will likely never be needed and instead focussing on those that must form the backbone of an immunologists analytical toolkit.

Downloads

Files required to work through examples in our book are available here