- Kaggle Dataset
Link:
https://www.kaggle.com/datasets
This is one of the most popular and widely used datasets in the field of data science. In Kaggle, each dataset corresponds to a competition, and participants can discuss the data, find public code, or create their own projects in this small community. Kaggle contains a large number of real datasets of different types, sizes, and formats. In addition, participants can also see kernels associated with each dataset, where many data scientists upload their notebooks to analyze the dataset and find algorithm implementations to solve specific dataset problems.
- Amazon Dataset
Link:
https://registry.opendata.aws/
The Amazon database contains various datasets from different fields, such as public transportation, ecological resources, satellite images, etc. The dataset website also has a search box to help users quickly find the desired dataset. Each dataset includes a corresponding description and usage examples, and the datasets are rich in data and easy to use.
In addition, relying on the Amazon Web Services (AWS) platform, such as Amazon S3, these datasets stored in the cloud have highly scalable services, which is very convenient for users who use AWS for machine learning development and experimentation. Because in the cloud, the transmission of datasets will be very fast.
- UCI Machine Learning Dataset
Link:
https://archive.ics.uci.edu/ml/datasets.html
This is a large database created by researchers from the School of Information and Computer Science at the University of California, Irvine (UCI), which contains more than 100 different types of datasets. The database classifies the datasets based on different machine learning problems. Here, users can find univariate, multivariate time series datasets, classification, regression, recommendation system datasets, and more. In addition, some datasets in this database have been cleaned and can be directly used by users.
- Google Dataset Search Engine
Link:
https://toolbox.google.com/datasetsearch
In September 2018, Google launched this service, which is a toolbox that allows users to search for corresponding datasets by name. Its goal is to integrate tens of thousands of different datasets and make them available to users.
- Microsoft Dataset
Link:
In July 2018, Microsoft, in collaboration with its external research community, announced the release of the Microsoft Research Open Data service. This cloud-based database contains a series of datasets used in published research and is dedicated to promoting research collaboration in the global research community.
- Awesome Public Datasets
Link:
Awesome is a database classified by different themes, covering important datasets in different fields such as biology, economics, education, etc. Most of the listed datasets are available for users to try for free, but users need to obtain permission through authentication before using any dataset.
- Government Datasets
Here you can find datasets related to the government. To demonstrate the transparency of government work, many national agencies have publicly released datasets in various fields, as shown below:
EU Open Data: European government datasets
Link:
https://data.europa.eu/euodp/data/dataset
US Gov Data: US government data (datasets not related to political issues, but the website's datasets are temporarily unavailable since the Trump administration's adjustment)
Link:
New Zealand's Government Dataset: New Zealand government datasets
Link:
https://catalogue.data.govt.nz/dataset
Indian Government Dataset: Indian government datasets
Link:
- Computer Vision Datasets
Link:
If you are working in the field of image processing, computer vision, or deep learning, this dataset will be the best experimental resource. Visual Data contains excellent datasets that can be used to build computer vision (CV) models. Users can find corresponding datasets for a specific CV task, such as semantic segmentation, image captioning, image generation, or even datasets required for autonomous driving solutions.