Data collection

Learn how to document the data collection process in your data management plan and explore existing sources of data.

 

Data management plan

It is important to think about how much data you will collect, which types of data, which file formats you plan to use, and how you will organise it. You should consider which standards or methodologies you will use and whether the data you collect will be sensitive or confidential.

You should also consider whether there are any secondary data that might impact and influence your project. If you plan to use secondary data, you should describe the source, noting its content, type and coverage.

Types of research data

Research data is any information that has been collected, observed, generated or created to validate original research findings. A good way of thinking about what might be classed as data is to ask yourself the questions:

  • What is the information that I need to use and write about in my publication?
  • What information will I need to back up my conclusions?
  • What information is needed by others to understand and replicate my research?

There are many different types of research data. The data you collect will depend on your chosen methodology as well as your discipline. Some common examples include:

  • Documents, spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, survey responses
  • Audio recordings, videos, photographs
  • Specimens, samples, instrument measurements
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts, codebooks
  • Contents of an application (input, output, log files for analysis software, simulation software, schemas)
  • Methodologies, protocols, standard operating procedures, workflows

Methodology

Your data management plan should include a summary of your research methodology, addressing the goals of data collection and how you plan to collect data. Include information on how the consistency and quality of data collection will be controlled.

Example

An interview methodology will be employed. Semi-structured interviews and focus groups will be digitally recorded and subsequently transcribed into anonymised transcripts.

Interview recordings will be high quality by ensuring recording equipment is properly calibrated ahead of each interview. Interviewers will be trained to ensure standardised approaches to engaging with participants. A single protocol for transcription will be used for all interviews.

File formats

You should identify the most appropriate file formats for the data that you plan to collect. Your chosen file formats should enable long-term preservation and facilitate reuse where you have chosen to share your data.

Decisions may be based on expertise of project members, a preference for open formats, the standards accepted by repositories, or existing conventions within a discipline.

Recommended formats for preservation:

  • Tabular (e.g. spreadsheets): CSV, TAB
  • Textual: TXT, XML, RTF, PDF/A, HTML
  • Audio: FLAC, WAV, MP3
  • Video: MP4, OGV, MJ2
  • Image: TIF, DCM, JPEG
  • Geospatial: ESRI Shapefile, DWG, Geo-referenced TIF

The UK Data Service provides more detailed information on recommended file formats.

Data volume

It is important to estimate the overall volume of data you will collect as part of your data management plan. Consider the quantity of your data separated out by each data type (e.g. quantity of audio files, quantity of transcripts). You can use tools such as the image file size calculator and the Omni Calculator to help calculate the size of the files you will generate.

Example

The researchers will be collecting audio and textual data. Below is an estimated breakdown of the maximum volume of each type of data collected by this research study.

Audio

Interview recordings (15 1-hour WAV files, 635 MB in size each; total volume: 9.52 GB)

Focus group recordings (3 120-minute WAV files, 1.27 GB in size each; total volume: 3.81 GB)

Estimate volume of all audio files: 13.33 GB

Text

Interview questions file (1 PDF/A file, 7 MB in size)

Interview transcript files (15 TXT files, 113 KB in size each; total volume: 1.6 MB)

Interviewer protocol sheet (1 PDF/A file, 277 KB in size)

Estimate volume of all text files: 8.88 MB

Total

Estimate volume of all research data: 13.42 GB

File naming conventions

Effective file naming and organisation can improve the efficiency of searching, enables logical sorting, and allows you to quickly distinguish data. When developing a naming convention, bear these points in mind:

Consistency. Establish rules for your naming convention early in the project which are consistent and logical. You can use different conventions for different file sets.

Organisation. Order the elements within file names logically to enable sorting. You should put the most important information first. YYYYMMDD is the preferred format for dates and allows you to sort chronologically.

Context. File names should be short but sufficiently descriptive. You should identify what metadata are needed to easily locate files e.g. experiment conditions, type of data, researcher initials, date or date range of experiment. Be careful not to identify any individuals in your file names.

Example

The UK Data Service provides the following example of good file naming:

Current file name: interview 01.docx

This file name doesn’t provide any context other than it relates to an interview and may be the first in a collection of files. The space between elements makes it more difficult for a computer to find.

Recommended file name: FG1_CONS_2010-02-12.docx

This is an interview transcript of the first focus group (FG1) with consumers (CONS) that took place on 12 February 2010. Elements are clearly separated by underscores, making it easier for someone to understand and for a computer to find.

Data reuse

Data reuse, also known as secondary analysis, occurs when a researcher conducts their own analysis of data collected by others. Reusing existing data can increase the efficiency of your research by cutting down on time spent generating, processing and preserving data that you collect.

Data sources

To find relevant data repositories, you can use the re3data database which provides information on over 2,000 repositories from different academic disciplines. To search repositories for individual datasets, you can use the web search interface DataCite Commons to search records which have been assigned a DOI.

UK-based researchers working in life and health sciences, for example, might be able to make use of the UK Biobank database which contains anonymised health information of 500,000 volunteer participants.

When searching for data, you should always assess the quality of the data source as you would for any other source of research information.

Considerations

  • The licence applied to the dataset: are you permitted to adapt and reuse the data?
  • Have the original participants given consent for the data to be reused?
  • Is metadata and documentation available? Is it sufficiently detailed to support understanding and reuse of the data?
  • What data formats have been used? Is the data available in a common, open file format?

Data citation

Data citation is part of good research practice. You should cite a dataset as you would any other academic source. We also recommend that you include a citation to your own data within the text and include a full reference in your reference list.

A simple data citation format is:

  • Creator (Publication Year): Title. Version. Publisher. Resource Type. Identifier
  • Green, Nathan (2024). Sound and Sound Quality Metric Data. University of Salford. Dataset. https://doi.org/10.17866/rd.salford.24998678.v1

Further information on how to cite data is available from the Digital Curation Centre.