Sources of Data

Another post for data analytics students.

There are several methods available for collecting data, including utilizing public sources, collecting your own data, and automated data collection through data pipelines.

Utilizing public sources of data:

• Public databases offer various datasets for free, readily available for download. These sources may be pre-cleaned and organized.

• Open sources like Kaggle and Dataworld provide datasets posted by individuals, companies, universities, and organizations. Kaggle also offers courses and competitions.

• Application programming interfaces (APIs) and web services connect to databases with diverse information. Major tech companies like Facebook, Amazon, Apple, Netflix, and Google have APIs available.

Collecting your own data:

• Web scraping involves collecting data directly from web pages instead of databases, which includes collecting price information, stock information, or social media posts.

• Surveying involves collecting information from a sample of individuals through a set of questions. Surveys can be administered electronically, via paper forms, or through direct questioning, and are often used to collect information on demographics and customer satisfaction. Types of survey answers include text-based, single-choice, multiple-choice, drop-down, and Likert scale.

• Observing can be done through physical observation or automated observation. Automated observations involve software generating metrics, such as counting page views or link clicks on a website.

Automated data collection:

• Data pipelines automate the process of pulling data from a source, preparing it for use, and moving it to a new location. These pipelines involve three steps: extraction, transformation, and loading. Extraction involves pulling data from the original source. Transformation prepares the data for use by cleaning and organizing it. Loading moves the transformed data to a new location.

• ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two types of data pipelines. ELT is faster than ETL when working with complicated transformations but is more expensive.

• Delta load is a loading method that impacts the efficiency of ETL or ELT pipelines.