Assignment 2 – Intro to SQL: Data Manipulation Language (DML)

Due date: Feb 12, 2019

Introduction

  • In this exercise you’ll be working in the nys schema within the gep664_2019 database.
  • This exercise is worth 7 points. Each question is worth 1/2 a point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as documented in the SQL Homework Guidelines. Failure to do so will result in point deductions.

Background

In this exercise you’ll be working in the nys schema within the gep664_2019 database. This schema includes four tables with county-level census data that describe the economy and labor force in NY State.

metadata
a description of every column for every table in this schema
metros
counties that are part of metropolitan areas that fall within NY State
popworkers
number of residents (population) and number of workers who work in the county from the 2012-2016 American Community Survey
resworkers
number of workers who live in the county and their place of work from the 2012-2016 American Community Survey

In pgAdmin 4 you can select each table and view the Properties tab for more information about the data. Read this information, view the metadata table, and examine each of the tables prior to beginning the exercise.

“Residents” are the total population of an area. “Workers” are the number of people that work in an area regardless of where they live. Both of these variables are in the popworkers table. “Resident Workers” are the population of the area who work, regardless of where they work. The resworkers table provides information on where resident workers work: inside or outside their state of residence, and inside or outside their county of residence. The data in these tables can be used to calculate the daytime population for an area, i.e. where most people are during the day based on their residency and work status.

Metropolitan areas are defined by the US Office of Management and Budget and are used by many federal agencies, including the Census Bureau. Metro areas consist of one or more adjacent counties that meet certain thresholds for total population and density, and that are linked by commuting patterns. Metro areas can cross state boundaries, and are subdivided between large areas (called Metropolitan Areas) and smaller areas (called Micropolitan areas). Core Based Statistical Area (CBSA) is an umbrella term that includes both areas, while the term “Metro Area” is also used colloquially to refer to both.

Questions

For the following, provide the SQL statement and the record count that is automatically returned by pgAdmin (flip from the Data Output tab to the Messages tab to see the count) or psql. There may be more than one way to answer certain questions (just choose one).

Part I – Popworkers Table

1. Select all columns for all counties that have less than 50000 residents.

2. Select the id, name, and residents columns for the two Long Island counties (Nassau and Suffolk).

3. Select the id, name, residents, and workers columns, calculate the difference between residents and workers in a calculated field called difference, and sort the data by difference in descending order (largest to smallest).

4. Summarize the entire table: count the total number of counties, sum the total number of residents, and sum the total number of workers for all counties.

Part II – Resworkers Table

5. For all counties, return ids, names, resident workers, and workers who work in their county of residence, and create a calculated field that shows the percentage of resident workers who work in their county of residence. Sort the result by the calculated field in ascending order (smallest to largest), and limit the result to display the bottom ten counties (smallest values). Since these columns are integers and you need a decimal result, use the cast operator:

(CAST(column1 as numeric) / CAST(column2 as numeric))*100

6. Do a regular inner join on the resworkers and popworkers table using their id. Create a field where you calculate the daytime population (residents + workers – resident workers). Your result should just include columns that are relevant to the calculation. Create a second calculated field that’s the difference between this daytime population and the resident population, and sort by this value.

Part III – Metro Table

7. Return a list of all the distinct cbsas from the metro table and sort them in alphabetical order.

8. Select all columns and records in the metro table that are not within New York State (these are counties that are outside of NY State but are inside a metro area that’s within NY State).

9. Write a query that counts the number of counties in each cbsa.

10. Do a regular inner join between the popworkers and the metro tables to show which counties are part of which cbsas. Include ids, names, and total residents in the result. Sort by metro area name and county name.

11. Do a left outer join with popworkers on the left to illustrate which NY counties are not part of a cbsa. Include ids, names, and total residents in the result.

12. Do a regular inner join between the popworkers and metro table and summarize the number of residents and workers by cbsa. Sort the data by population in descending order and only include columns that are relevant.

Part IV – Written Responses

Please answer the following questions in two to four sentences.

13. What are the benefits for using aliases (with the AS statement) for column names and table names?

14. What is the difference between a regular (inner) join and a left-outer join? For what purpose would you use a left-outer join?