Create Toy Survey Data • Toy Survey Data

An R package designed to simplify the process of creating synthetic survey data.

Why create this package?

This package began as a project to create a simple synthetic dataset based on Community Health Assessment surveys. I was preparing a presentation on cleaning and validating survey data, but I could not use real survey data, for which privacy and confidentiality are major concerns. I reviewed some existing survey creation packages, but they either did not allow the level of customization I needed or used models that read in real data, which I did not have, as a starting point. I also wanted a tool that did not require much knowledge in modeling or coding.

What does this package do?

The toysurveydata package is designed to be a quick, easy way to create synthetic data that mimic real data structures for the following purposes:

to test workflows and error handling
to teach data management

It generates individual responses for categorical questions based on a priori knowledge of response proportions. It contains functions to handle numeric and date variables. Most functions include an option to add missingness and one function adds error to numeric data.

Getting started

To install this package directly from GitHub, run the following code:

# install devtools if you do not already have it
install.packages("devtools")
# install toysurveydata with all required packages from CRAN
devtools::install_github("ajstamm/toysurveydata", dependencies = TRUE,
                         build_vignette = TRUE)

While functions can be used individually, the package is designed to allow you to build a settings table as described in the Settings Table Design vignette, then run most functions on that table to generate the full dataset.

Limitations of this package

This package is designed to be very simple. It would not be appropriate for research.

This package does not perform any special modeling and does not require existing data. Most functions calculate only one variable at a time and do not take into account values of or relationships with other variables. It includes optional missingness and a function to introduce random error of different kinds to data in a numeric variable.

Possible future plans

Click to expand

If you would like to see any of these, add an issue with an example of what you need them for.

Function changes

Add an option in the select-many function to require an exact number of selections
Add a function to handle ranked choice questions
Add error-creation functions for text values such as random upper/lower-case, misspellings
Add non-random missingness
Maybe make the IP function at least nominally geographically sensitive

Documentation changes

Rethink or improve instructions for percent missing and number of options in the settings table
Maybe integrate with or suggest packages that handle things like random addresses