Finding usable data for technical research#

Data work is a major part of computer science research. This guide helps you locate and evaluate data quickly.

Common data sources#

  • Government and public agencies
  • Research group releases and project websites
  • Open repositories and data portals
  • APIs and platform data exports

What to check before using a dataset#

  • License: can you use it for academic work?
  • Format: CSV, JSON, SQL dump, or raw text?
  • Documentation: is there a data dictionary?
  • Coverage: does it match your time range and scope?
  • Bias: who created it and what might be missing?

Practical steps#

  1. Write 5 keywords about the data you need.
  2. Search for “dataset” plus each keyword.
  3. Check the data size and file format.
  4. Download a sample and inspect 20 rows.
  5. Record the source and citation details.

If the data is not available#

  • Create a small dataset through scraping, logs, or experiments.
  • Use a related dataset and justify the substitution.
  • Reframe the question to fit accessible data.

Data hygiene checklist#

  • The data is versioned and stored safely.
  • You can reproduce how it was collected.
  • You have a clear plan for cleaning and filtering.
  • You are aware of privacy or ethics constraints.