Polimetrics
About
Datasets are spreadsheets that contain rows and columns. The intersection of rows and columns creates cells. Numeric, alpha, and alphanumeric data can reside in these cells.
The image below is a screenshot of a Microsoft Excel spreadsheet, a very common software. There are four rows marked: 1, 2, 3, and 4; and there are four columns marked: A, B, C, and D. These 4 rows and 4 columns create 16 cells. Cells A1, B1, and C1 are populated with the following data: “123” (numeric), “abc” (alpha), and “123abc” (alphanumeric), respectively. Note that the remaining 13 cells are empty.
Figure 3‑1: Screenshot of Excel spreadsheet with 4 rows and 4 columns
Datasets are essential for data analysis since datasets contain the observations and variables that you are interested in analyzing.
Estimated Time
An estimated 90-120 minutes is needed to complete this activity.
Cross-Section dataset
Cross-section, or cross-sectional, datasets refers to a dataset that look at many objects in a single time period.
Observations can be persons, cities, states, countries, legislation, committees, schools, and so on. Variables are concepts that have at least two values. For example, the variable age can have values from 0 to 100+. Or the variable race can have the values African American, White, Hispanic, Asian American, and so on.
To illustrate a cross-section dataset, I decided to first update the Microsoft Excel spreadsheet. In cells A1 through E1 included the variable name. It is common to use the 1st row of cells to state the variable name of each column. In rows 2 through 5, I have four notable people listed: Cardi B, Joe Biden, Dolores Huerta, and Andrew Yang. For each person, I have information about their gender, age, race, and year the data was collected.
The data is cross-sectional because we are looking at many objects (notable persons) in a single time period (year 2020).
Figure 3‑2: Example of a cross-sectional dataset
Time Series dataset
Time series datasets refer to a dataset that looks at a single object over multiple time periods.
To illustrate a time series dataset, I decided to focus on Cardi B, one of my notable persons from the cross-section datasets. In cells A1 through F1, we see six variables: name, gender, age, race, year, and singlerecords. The variable singlerecords refers to the number of single songs with Cardi B as lead artist (Cardi B discography – Wikipedia
The data is time series because we are looking at one object (Cardi B) over multiple time periods (years 2017 to 2020). And in this case, our variables age, year, and singlerecords change for each row of data.
Figure 3‑3: Example of a time series dataset
Panel dataset
Panel datasets refer to a dataset that looks at multiple objects over multiple time periods.
To demonstrate a panel dataset, I updated the time series dataset to include a second musical artist: Harry Styles. Again, in cells A1 through F1, we see six variables: name, gender, age, race, year, and singlerecords.
The data is panel because we are looking at multiple objects (Cardi B and Harry Styles) over multiple time periods (years 2017 to 2020). And again, our variables age, year, and singlerecords change for each row of data for each artist. For example, for year 2017, both Cardi B and Harry Styles (Harry Styles discography – Wikipedia
Links to an external site.) had 3 single records. But in year 2019, Cardi B had 3 compared to Harry’s 2 singles.
Figure 3‑4: Example of a panel dataset
Mini-Assignment #1: Instructions
Step 1: Select 1 dataset type that interests you.
Your dataset choices are:
- Cross-section
- Time series
- Panel
Step 2: In 4 or more sentences, explain why you selected this dataset type.
- To help write your explanation, consider the following questions:
- What is one strength of the dataset you selected?
- What is one weakness of the dataset your selected?
- How does your dataset compare to one of the other datasets?