8 Big Data

8.1 Big Data with R - Exercise book

by James Blair

This 2-day workshop covers how to analyze large amounts of data in R. We will focus on scaling up our analyses using the same dplyr verbs that we use in our everyday work. We will use dplyr with data.table, databases, and Spark. We will also cover best practices on visualizing, modeling, and sharing against these data sources. Where applicable, we will review recommended connection settings, security best practices, and deployment options.

Link: https://rstudio-conf-2020.github.io/big-data/

8.2 Exploring, Visualizing, and Modeling Big Data with R

by Okan Bulut, Christopher Desjardins

Working with BIG DATA requires a particular suite of data analytics tools and advanced techniques, such as machine learning (ML). Many of these tools are readily and freely available in R. This full-day session will provide participants with a hands-on training on how to use data analytics tools and machine learning methods available in R to explore, visualize, and model big data.

Link: https://okanbulut.github.io/bigdata/

8.3 Larger-Than-Memory Data Workflows with Apache Arrow

by Danielle Navarro, Jonathan Keane, Stephanie Hazlitt

In this tutorial you will learn how to use the arrow R package to create seamless engineering-to-analysis data pipelines. You’ll learn how to use interoperable data file formats like Parquet or Feather for efficient storage and data access. You’ll learn how to exercise fine control over data types to avoid common data pipeline problems. During the tutorial you’ll be processing larger-than-memory files and multi-file datasets with familiar dplyr syntax, and working with data in cloud storage.

Link: https://arrow-user2022.netlify.app

8.4 Mastering Spark with R

by Javier Luraschi, Kevin Kuo, Edgar Ruiz

In this book you will learn how to use Apache Spark with R. The book intends to take someone unfamiliar with Spark or R and help you become proficient by teaching you a set of tools, skills and practices applicable to large-scale data science.

PS the first chapter has a Jon Snow quote ;)

Link: https://therinspark.com/

8.5 Using Spark from R for performance with arbitrary code

by Jozef Hajnala

This book provides practical insights into using the sparklyr interface to gain the benefits of Apache Spark while still retaining the ability to use R code organized in custom-built functions and packages.

Link: https://sparkfromr.com


Created and maintained by Oscar Baruffa

For updates, sign up to my newsletter