18 Getting, cleaning and wrangling data
18.1 21 Recipes for Mining Twitter Data with rtweet
by Bob Rudis
The recipes contained in this book use the rtweet package by Michael W. Kearney.
18.2 A Beginner’s Guide to Clean Data
by Benjamin Greve
This book will help you to become a better data scientist by showing you the things that can go wrong when working with data - particularly low-quality data. A key difference between a junior and a senior data scientist is the awareness of potential pitfalls. The experienced data scientist will expect them, navigate around them and avoid costly iteration cycles. After reading this book, you will be able to spot data quality problems and deal with them before they can break your work, saving yourself a lot of time.
Link: https://b-greve.gitbook.io/beginners-guide-to-clean-data/
18.3 Data Wrangling Essentials
by Mark Banghart
The R and Python communities have developed a set of tools in the tidyverse and the pandas packages respectively designed to wrangle table data. The intuitive nature of these packages makes learning to use them easy and the code easy to read and understand. These tools allow researchers to quickly and accurately complete data preparation for a wide variety of analysis. It is the application of these packages and their approaches to wrangling that are the subject of this book.
The Data Wrangling Essentials title was chosen to emphasize both the use of these new tools and the importance of the work of gathering and preparing data.
18.4 Fundamentals of Wrangling Healthcare Data with R
by J. Kyle Armstrong
In this course we will review some of the tools of the trade, namely, R’s tidyverse (Wickham and Grolemund 2017; Winter 2019) - a collection of R packages designed with a common framework to aide in common data wrangling and data management tasks.
Data Wrangling is one subset set of skills within the Data Science Process. We will carefully investigate how decisions made while collecting and preparing the data have down-stream effects on model performance.
Link: https://bookdown.org/jkylearmstrong/jeff_data_wrangling/
18.5 Handling Strings With R
by Gaston Sanchez
Handling character strings in R? Wait a second… you exclaim, R is not a scripting language like Perl, Python, or Ruby. Why would you want to use R for handling and processing text? Well, because sooner or later (I would say sooner than later) you will have to deal with some kind of string manipulation for your data analysis. So it’s better to be prepared for such tasks and know how to perform them inside the R environment.
Paid: Free preview of first 4 chapters $20
18.6 Spreadsheet Munging Strategies
by Duncan Garmonsway
This is a work-in-progress book about getting data out of spreadsheets, no matter how peculiar. The book is designed primarily for R users who have to extract data from spreadsheets and who are already familiar with the tidyverse. It has a cookbook structure, and can be used as a reference, but readers who begin in the middle might have to work backwards from time to time.
Link: https://nacnudus.github.io/spreadsheet-munging-strategies/
18.7 Text Mining with R
by Julia Silge, David Robinson
This book serves as an introduction of text mining using the tidytext package and other tidy tools in R. The functions provided by the tidytext package are relatively simple; what is important are the possible applications. Thus, this book provides compelling examples of real text mining problems.
18.8 Text Mining With Tidy Data Principles
by Julia Silge
Text data sets are diverse and ubiquitous, and tidy data principles provide an approach to make text mining easier, more effective, and consistent with tools already in wide use. In this tutorial, you will develop your text mining skills using the tidytext package in R, along with other tidyverse tools.
Created and maintained by Oscar Baruffa.
Keen to support the site? You're most welcome to
For updates, sign up to my newsletter