A Step-By-Step Guide to Data Management
A Step-By-Step Guide to Data Management
In 1999 the US Office of Management and Budget (OMB) Circular A-110 was amended to require Federal awarding agencies to ensure that all data produced under an award will be made available to the public through the procedures established under the Freedom of Information Act (FOIA). The federal government strengthened its position on research data management in January 2011 when the National Science Foundation (NSF) instituted a requirement that all proposals submitted to that agency must include a supplementary document of no more than two pages labeled “Data Management Plan” (DMP). The DMP should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. Specifically, the plan should address five points: expected data, the period of data retention, data formats and dissemination, roles and responsibilities and data storage and preservation of access. Other funders and even some journal publishers are instituting similar requirements.
Here’s a step-by-step guide to thinking about and preparing a DMP. It is based primarily on the DataONE Data Life Cycle and represents a highly condensed version of the DataONE Best Practices Primer.
STEP 1. ASSEMBLE YOUR DATA MANAGEMENT TOOLKIT
Fortunately, there are online resources that can help you develop a DMP. Here are three especially useful ones:
- DMPTool (https://dmp.cdlib.org/). DMPTool contains sample DMPs from a number of institutions and for a number of federal funding agencies, as well as guidance on writing new DMPs.
- DataONE (http://www.dataone.org/). DataONE is a resource for environmental science. However, it is an excellent source of information on scientific data management in general. See especially the Best Practices Primer.
- Databib (http://databib.org/). Databib is a collaborative, annotated bibliography of primary research data repositories developed with support from the Institute of Museum and Library Services (IMLS). Specific guidelines on which of the repositories accept submissions and how to go about submitting data can be found within the bibliography itself, under “Guidelines for Bibliographers”.
In addition to these resources, you should also explore your institutional resources. Some institutions have data management plan templates, suggested institutional data centers, budget suggestions, and useful tools for planning your project. Contact your office of sponsored research, your library, or your office of information technology.
STEP 2. PLAN
You should plan for data management as your research proposal is being developed. Revisit your data management plan frequently during the project and make changes as necessary. Questions to consider include:
- Based on the hypotheses and sampling plan, how much data will your project generate?
- How will the samples be collected and analyzed?
- What instruments will be used?
- How will data be organized within a file?
- What file formats will be used? Does your community have standard formats (file formats, units, parameter names)? Consider whether a relational database or other data organization strategy might be most appropriate for your research.
- Who is in charge of managing the data? How will version control be handled? How will data be backed up, and how often?
- What types of personnel will be required to carry out your data management plan? What types of hardware, software, or other computational resources will be needed? What other expenses might be necessary, such as data center donation or payment? The budget prepared for your research project should include estimated costs for data management.
- How long do you intend to use or keep the data?
- How do you intend to provide access to the data? Develop a plan for sharing data with the project team, with other collaborators, and with the broader science community. Under what conditions will data be released to each of these groups? What are the target dates for release to these groups? How will the data be released?
- What are your plans for long-term preservation of the data?
STEP 3. COLLECT AND CHECK YOUR DATA
Tools: DataONE Best Practices
It is important to collect data in such a way as to ensure its usability later. Careful consideration of methods and documentation before collection occurs is important.
- Consider creating a template for use during data collection. This will ensure that any relevant contextual data are collected, especially if there are multiple data collectors.
- Describe the contents of your data files. Define each parameter, including its format, the units used, and codes for missing values.
- Provide examples of formats for common parameters. Data descriptions should accompany the data files as a “readme.txt” file, a metadata file using an accepted metadata standard, or both.
- Perform basic quality assurance and quality control on your data during data collection, entry, and analysis.
- Describe any conditions during collection that might affect the quality of the data.
- Identify values that are estimated, double-check data that are entered by hand (preferably entered by more than one person), and use quality level flags to indicate potential problems.
- Check the format of the data to be sure it is consistent across the data set.
- Perform statistical and graphical summaries (e.g. max/min, average, range) to check for questionable or impossible values and to identify outliers.
- Communicate data quality using either coding within the data set that indicates quality, or in the metadata or data documentation.
- Identify missing values.
- Check data using similar data sets to identify potential problems.
Additional problems with the data may also be identified during analysis and interpretation of the data prior to manuscript preparation.
STEP 4. DESCRIBE AND DOCUMENT YOUR DATA
Comprehensive data documentation is the key to future understanding of data. Without a thorough description of the context of the data file, the context in which the data were collected, the measurements that were made, and the quality of the data, it is unlikely that the data can be easily discovered, understood, or effectively used. Data descriptions should include the following elements:
- A data dictionary
- A defined data model
- A description of the research project
- Descriptive file names
- Standard terminology to enable discovery
- Formats for spatial location, space, and time
- A description of the contents of data files
- A description of the overall organization of your dataset
- The spatial extent and resolution of your dataset
- The temporal extent and resolution of your dataset
- Units of measurement for each observation
- Steps used in data processing
- Taxonomic information
- A data organization strategy
- Consistent data typing
- Citation and document provenance for your dataset
- Capabilities for tagging and annotation of your data by the community
- Identifier for dataset used
- Appropriate field delimiters
- Consistent codes
Examples of these and other elements can be found under "Describe" in the DataONE Best Practices Primer.
Using an appropriate metadata standard is a crucial part of data description. Comprehensive metadata enables others to discover, understand, and use your data. Metadata should be generated in a metadata format commonly used by the discipline or community that is most relevant to your research. There are discipline-specific metadata editing tools (e.g. for the environmental sciences, see the list at https://www.dataone.org/software-tools/tags/metadata_editor) to help researchers generate comprehensive descriptions of their data. Researchers in scientific communities that don’t have established metadata standards can find generic best practices for metadata at the NCSU Libraries or the Wikipedia “Metadata Standards” page.
STEP 5. SELECT A REPOSITORY FOR YOUR DATA
Use Databib or the OpenDOAR Directory of Open Access Repositories to identify a data repository that is most appropriate for the data you will generate and for the community that will make use of the data. Talk with colleagues, librarians, and research sponsors about the best repository for your discipline and your type of data. Check with the repository about requirements for submission, including required data documentation, metadata standards, and any possible restrictions on use (e.g. intellectual property rights).
Many universities also have institutional repositories (IRs) for publishing locally produced research. Check with your institution’s librarians, or consult the OpenDOAR Directory of Open Access Repositories.
STEP 6. STORE AND PRESERVE YOUR DATA
Tools: DataONE Best Practices
As files are created, implement a data preservation plan that ensures that data can be recovered in the event of file loss (e.g. storing data routinely in different locations).
There are a number of options for preserving your data. You can work with a data center or archiving service that is familiar with your area of research. They can provide guidance as to how to prepare formal metadata, how to preserve the data, what file formats to use, and how to provide additional services to future users of your data. Data centers can provide tools that support data discovery, access, and dissemination of data in response to users’ needs.
There are also community-based and commercial digital preservation networks and services. Among them are:
- Amazon Simple Storage Service (Amazon S3)
- Amazon Glacier
- The MetaArchive Cooperative
- UC3Merritt (University of California Curation Center)
OTHER DMP RESOURCES