Finding usable data for technical research#
Data work is a major part of computer science research. This guide helps you locate and evaluate data quickly.
Common data sources#
- Government and public agencies
- Research group releases and project websites
- Open repositories and data portals
- APIs and platform data exports
What to check before using a dataset#
- License: can you use it for academic work?
- Format: CSV, JSON, SQL dump, or raw text?
- Documentation: is there a data dictionary?
- Coverage: does it match your time range and scope?
- Bias: who created it and what might be missing?
Practical steps#
- Write 5 keywords about the data you need.
- Search for “dataset” plus each keyword.
- Check the data size and file format.
- Download a sample and inspect 20 rows.
- Record the source and citation details.
If the data is not available#
- Create a small dataset through scraping, logs, or experiments.
- Use a related dataset and justify the substitution.
- Reframe the question to fit accessible data.
Data hygiene checklist#
- The data is versioned and stored safely.
- You can reproduce how it was collected.
- You have a clear plan for cleaning and filtering.
- You are aware of privacy or ethics constraints.