Category Archives: assignment

css.php

Extra Credit – Review

Due Date: Apr 30, 2019

Introduction

  • In this optional exercise you’ll be working in the nys schema within the gep664_2019 database.
  • This exercise is worth 3 points. Each question is worth 1 point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as documented in the SQL Homework Guidelines. Failure to do so will result in point deductions.
  • For this assignment you will need to submit a SQL file and a map as an image file. When you submit your electronic assignment to the Box, attach the SQL file and the image as attachments in the same message.

Background

In this exercise you’ll be working in the nys schema within the gep664_2019 database. This schema includes tables with county-level census data that describe the economy and labor force in NY State. You’ll be using these county-level tables:

metros
counties that are part of metropolitan areas that fall within NY State
resworkers
number of workers who live in the county and their place of work from the 2012-2016 American Community Survey

Refer to the metadata table and back to assignment 3 if you want more information about these tables. You will also need to download a shapefile of NY state counties from the Data and Software page, Extra Credit data under Data for Homework.

Questions

For the following, provide the SQL statements and the record count that is automatically returned by pgAdmin.

1. Create a view called nys.metro_work_outside where you join the metro table to the resworkers table. For the CBSAs that are classified by type as Metropolitan Statistical Area, summarize the number of resident workers in one column and the number of workers who work outside their state of residence in another column by metro area. Sort the results by metro name in alphabetical order. The final result should represent one row for each metro (NOT one row for each county).

2. Load the shapefile of the NY State counties into the nys schema. When loading it you must define the system as NAD 83 SRID 4269 and rename the table counties. After the table is loaded, write an ALTER TABLE / COLUMN statement to transform the geometry to NYSP Central Zone, SRID 2261.

3. By joining the resworker and counties tables, create a new spatial table called nys.working_incounty that contains: county ID, county name, total resident workers, resident workers who work in their county of residence, a new column with the percentage of resident workers who work in their county of residence, and geometry of the counties. Remember there are multiple methods that you can use to create a new table from existing data – choose any one. To calculate the percentage you’ll need to use the CAST operator that we’ve seen before:

(CAST(column1 as numeric) / CAST(column2 as numeric))*100

Once you’ve created the table, add it to QGIS and create a map of the percentage of workers who work in their county of residence.

Assignment 10 – Organizing Spatial Data

Due date: Apr 16, 2019

Introduction

  • This assignment is dedicated to exploring what you would like to do for your final project.
  • This exercise is worth 6 points. Each question is worth 2 points each.
  • You do not have to use the SQL template for this assignment, as the questions do not involve writing code. Submit either a text file or a PDF.

Questions

Part I – Final Project Brainstorming

For this part, refer back to the journal articles that I emailed you, and to the description of the final project. You will need to do some background research on what you are thinking about pursuing for your project. Please allocate sufficient time for doing this: this will be your opportunity to receive feedback from me on your initial ideas.

1. Skim through all of the journal articles I sent you, choose one that interests you, and read it thoroughly. For the article that you chose, in your own words write a paragraph or two that explains what the authors’ research question or problem is, and describe how they used a spatial database to answer that question.

2. Write a paragraph or two that describes two or three final project ideas that you are considering pursuing. For each idea, explain what the research question or goal for the different projects would be.

3. For the ideas you have listed above, explore and list some possible data sources that you could use. Be specific: provide names of specific datasets or even variables – don’t just list general websites or repositories. For example: “I could use TIGER Line shapefiles for census tracts and educational attainment data from the latest 5-year American Community Survey from the Census Bureau”. Not: “I would use the census website and some shapefiles”.

Assignment 9 – Proximity Analysis

Due date: Apr 9, 2019

Introduction

  • In this exercise you’ll be working with the gep664 database in the nyc schema and the sample data from Chapter 10 in PostGIS in Action. It’s assumed that you have already loaded the ZCTA shapefile and census tract population centers into the nyc schema (which we did in class 7).
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For the following, provide the SQL statement and the row count.

Questions

This assignment will give you additional practice in working with the various methods for calculating distances, and with geography data types. In answering these questions, you should refer back to the slides from the lecture for class 9 and to Chapter 10 in PostGIS in Action.

Do NOT use ST_Buffer to answer any of these questions: use the distance functions (ST_DWithin, ST_Distance, etc). You do not have to use the KNN methods or operators unless it’s specifically asked for.

Part I – Measuring Distance with Geometry

In this first part, you’ll be working with data in the nyc schema; this includes the ZCTA feature that we loaded from a shapefile and the census tract population centers we loaded from a text file back in class 7. If you don’t have these files, download them from the course website (data files used for Class 7) and load them into the gep664 database. Refer to the in-class notes for class 7 for the steps we took to import both of these layers.

Remember the features in the nyc schema are in the NYSP Long Island spatial reference system, which uses feet. 5280 feet = 1 mile.

1. Calculate the distance from every subway station in the Bronx to every census tract population center (nyc.tract_popctr) in the Bronx. The unique id code for the Bronx is 36005; it appears in a dedicated column in the subway table, but is part of the geoid in the tract population center table. Your result should include the tract and station ids and names. Sort the data by stop_id and distance.

2. Select the ids of all census tract population centers that are within a distance of 1 mile from the two Yankee Stadium subway stations. Make sure to remove duplicates (a tract should only appear once in the results).

3. Sum the population for all census tract population centers that are within 1/2 mile of the Lincoln Center subway station. Your result should just show the total population for this area.

4. Using the KNN bounding box centroid operator , find the nearest 10 subway stations to ZCTA 10010. Return station ids and names.

Part II – Measuring Distance With Geography

In this part, you’ll be working with one of the sample data tables that’s included in the data for Chapter 10 of PostGIS in Action. A link for downloading this data is available on the course website; it’s a ZIP file that contains subfolders for every chapter in the book. You will work specifically with the airports table, which has a geography column.

Remember that units returned by the geography data type are in meters. 1000 meters = 1 kilometer.

5. For all of the airports that are of the type ‘large_airport’, measure the distance from each airport in Japan to each airport in Australia. Show the distance in kilometers, and the ids and names of the airports. Remember, to perform operations within a single table, you add the table to the FROM statement twice, and give each a different alias so you can reference them as separate entities (see the lecture slides for an example).

6. For all airports in Australia that are of the type ‘large airport’, return the closest airport of any type. Return distance in kilometers and names and ids of the airports. This question also requires you to perform operations within a single table. To remove results where each airport is being measured to itself, add an item to the WHERE clause to insure the airport id is not equal to the neighbor airport id: a.id != n.id. Hint – you can consider using the DISTINCT ON approach for this question.

Assignment 8 – Spatial Relationships

Due date: Apr 2, 2019

Introduction

  • In this exercise you’ll be working with the gep664 database in the nyc schema. It’s assumed that you have also loaded the ZCTA shapefile into this schema (which we did in our previous class). You’ll also need to use QGIS.
  • This exercise is worth 8 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For this assignment you will need to submit a SQL file and a map as an image file. When you submit your electronic assignment to the Box, attach the SQL file and the image as attachments in the same message.

Questions

This assignment will give you additional practice with spatial relationships and joins, and you’ll pull together several concepts that you have learned thus far in the course.

You’ll be working with data in the nyc schema; this includes the ZCTA feature that we loaded from a shapefile in our previous class. If you don’t have the ZCTA file download it from the course website (data files used for Class 7) and load it into the gep664 database using the PostGIS shapefile loader or the database manager in QGIS. It’s in NAD83 (4269). After you load it, you’ll need to transform it to NYSP (2263) before you can use it. Refer to the in-class notes for class 7 for the steps we took to do this.

For the following, provide: the SQL statement, the row count, and any specific deliverable that the questions asks for.

Part I -Spatial Relationships

There are 339 census tracts and 25 ZCTAs in the Bronx. ZIP Codes / ZCTAs often do not align well with other census or legal geographies. In this exercise you’ll test and see how well they do or do not align with census tracts. Pay careful attention to the relationship you’re being asked to test. Use the census_tracts and zctas layers.

1. Select all census tracts (the tractid and tractnum) that are within ZCTAs in the Bronx.

2. Select all census tracts (the tractid and tractnum) that overlap ZCTAs in the Bronx. Use DISTINCT to remove tracts that overlap more than one ZCTA.

3. Select all census tracts (the tractid and tractnum) that have their geographic centroid within ZCTAs in the Bronx.

4. Modify your statement in question 3 to select just the tracts in ZCTA 10468. Create an additional field where you calculate the total area of the census tracts (i.e. the area of the entire tract, regardless of whether the area is inside or outside the ZCTA).

Part II – Spatial Join

5. Write a SELECT statement where you select the subway stations (stop id, name, and trains) for the entire city and assign the stations the ZCTA code that they are within. For this statement, use the JOIN clause and make it a left outer join so that you keep all subway records on the left whether they have a matching ZCTA or not. Sort the result in decending order by ZCTA number.

6. Create a new spatial table called nyc.bedford_subways that contains all the subway stations in ZCTAs 10458 and 10468. The table should contain: stop id, name, trains, ZCTA, and subway geometry. Be sure to assign stop_id as the primary key. Create the table using spatial relationships, not by manually selecting stations yourself. Then create a spatial index for the new layer. Provide all SQL statements in your answer.

Part III – Spatial Joins and Summaries

To refresh your memory when answering these questions, refer back to the slides and the course exercises. When writing these longer statements, construct them bit by bit, getting one piece to work before moving on to add the next piece.

In the nyc schema in the course database:

  • census_tracts is a spatial table with census tract geometry; it also contains a column that identifies the Neighborhood Tabulation Area (NTA) that each tract is part of.
  • census2010_tracts contains data from the 2010 Census for each census tract.
  • census2010_lookup contains the list of census variables and the codes used to identify them as column headings in the census2010_tracts table.

7. Identify the codes for the variables that represent the total population, and the total population who are 16 years old and over. Then write a SELECT query where you relate these variables to the census tract geographies and create a summary for NTAs, so each row in the result is for an NTA and the columns represent: total population, total population 16 and over, and the percent of the NTA’s population that is 16 and over for each NTA. To avoid division by zero, select only tracts that have an over 16 population greater than zero. In order to get decimals for your percent total, remember to use the CAST operator around your integers:

(CAST(value1 as numeric))/(CAST(value 2 as numeric))

8. Take the query you wrote in question 8 and save it as spatial view called ntas_workforce (remember to do this, you must add the geometry to the select clause and union it). Provide this query as your answer. Then, using QGIS connect to the course database and add this view to the map. Symbolize the NTAs to display a graduated map showing the percent of the population who are 16 and over. Classify the data using equal intervals with 5 categories. It’s normal if you have some “holes” in your map, as these represent unpopulated areas. You don’t have to create a fancy, finished map (with a layout etc) – just take a screenshot of the window and save the result as an image file (under Project menu – Save as Image). When naming the file, use the same convention as the sql template (i.e. use your name and the assignment number).

Assignment 7 – Spatial Reference Systems

Due date: Mar 26, 2019

Introduction

  • In this exercise you’ll be working with the gep664 database in the copper schema you created from assignment 5. You’ll also need to use QGIS.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For this assignment you will need to submit a text SQL file and a map as an image file. When you submit your electronic assignment to the Box, attach the SQL file and the image as attachments in the same message.

Questions

This assignment will give you additional practice with adding geometry columns, loading spatial data, and transforming coordinate systems.

You’ll be working with data in the copper schema; if you don’t have the files the entire schema is available as a backup file on the course website. Download the course data for class 7 on the Data and Software page, which contains the copper schema backup and a country shapefile that you’ll need to load.

Provide the SQL statements and any specific deliverable that each question asks for.

Part I – Data Loading and Transformation

1. Add a geometry column to the copper smelters table, and build geometry for the table using the longitude and latitude fields. These coordinates are currently in WGS 84 (SRID 4326). When building the geometry, transform them to Pseudo Mercator (SRID 3857).

2. In the course data files there is a shapefile of countries from Natural Earth. Import this shapefile into the copper schema using the PostGIS shapefile loader, located under the Start menu under the heading PostGIS bundle. Name it countries_temp and assign it the appropriate SRID (it is in WGS 84). Then write a SQL statement to SELECT the iso_a3 and name columns from the table. Provide that statement AND the row count as your answer.

Note – if you receive an error about UTF-8 encoding while trying to import, hit the options button and change the encoding from UTF-8 to LATIN1. If you are using a Mac and don’t have the PostGIS shapefile loader, use the DB Manager in QGIS instead to import the shapefile.

3. Create a new table called countries_bndy that will hold a cleaned-up version of countries_temp. Use the new column names (1st column in the list below) and look at the existing columns (2nd column in the list below) in the countries_temp table to assign appropriate data types. Make uid the primary key, and make sure to include a geometry column in the new table called geom that can hold multipolygons and assign it the SRID for Pseudo Mercator.

uid - adm0_a3
iso - iso_a3
name - name
name_long - name_long
ctype - type
continent - continent
subregion - subregion

4. Insert the appropriate data from countries_temp into the countries_bndy file, and as part of the insert operation transform the geometry from WGS 84 to Pseudo Mercator.

5. Create a view called smelter_sum where you join the smelters table to the countries_bndy file using the iso country codes in each table. In the view, group the data by country, count the number of smelters, and sum the smelter capacity. Make sure to include the geometry column (in SELECT and GROUP BY). In your answer, return the SQL statement AND the number of rows that are in the view.

Part II – Make a Map

6. Using QGIS, connect to the course database and add the smelter_sum view to the map. Symbolize the countries to display a graduated map showing total smelter capacity. Classify the data using equal intervals with 5 categories. You don’t have to create a fancy, finished map (with a layout etc) – just take a screenshot of the window and save the result as an image file (under Project menu – Save as Image). When naming the file, use the same convention as the sql template (i.e. use your name and the assignment number).

Need help using QGIS?

  • Consult the course handout on Readings and Docs for accessing PostgreSQL and PostGIS databases (there’s a section on QGIS)
  • Take a look at the QGIS user documentation
  • Look at my workshop manual
  • Search Youtube and the GIS Stack Exchange
  • Ask your classmates or the lab tutors

Assignment 6 – Spatial Data Basics

Due date: Mar 19, 2019

Introduction

  • In this exercise you’ll be working with the gep664 database and the sample data from Chapter 1 in PostGIS in Action.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

Questions

The first part of this assignment echoes what we did in class and covers material from PostGIS in Action Chapter 2.

The second part of this assignment is derived from Chapter 1 in PostGIS in Action. It assumes that you have read the chapter and did all of the exercises using the sample data provided with the book.

For the following, provide: the SQL statement (questions 1-6) and the record count (questions 4-6) that is automatically returned by pgAdmin.

Part I – Geometry Basics

For this exercise, you’ll be working in the course database gep664 in the nyc schema.

For the questions in this part, all of the geometry should be defined using SRID 2263, which is the local state plane zone for NYC (NAD 83 NY State Plane Long Island (ft-us)).

1. Create a point feature to represent the city’s primary train stations. Create an empty table called nyc.train_stations with 3 columns: tid integer, name text, and geom geometry(point,2263). Write an INSERT statement where you insert the VALUES for id, name, and point geometry into the table. Use the manual method for constructing geometry using ST_GeomFromText. In your answer, just provide the final statement (the INSERT).

1,Penn Station,986029,212733
2,Grand Central Station,990607,213615

2. Create a line feature to represent the Times Square Shuttle subway. Create an empty table called nyc.subway_shuttle with 3 columns: sid varchar(1), name text, and geom geometry(linestring,2263). Look in the subway stations table and find the stops for Grand Central (stop_id 901) and Times Square (stop_id 902). Return the geometry from these stations as text so you can see the coordinates (they’re in NY State Plane). Using these coordinates, write an INSERT statement where you insert: an id (S), name (Times Square Shuttle) and linestring geometry into the nyc.subway_shuttle table. Use the manual method for constructing geometry using ST_GeomFromText. In your answer, just provide the final statement (the INSERT).

3. Create point features for the NJ PATH stations in Manhattan. Create an empty table called nyc.path_stations with 5 columns: sid integer, name text, xcoord numeric(6), ycoord numeric(6), and geom geometry(point,2263). Write an INSERT statement to insert the data into the table. Then use the ST_SetSRID and ST_Point functions with an UPDATE statement to build geometry (do not use the manual record by record methods you used for the previous questions). In your answer provide the INSERT statement AND the statement used for creating the geometry.

1,33rd Street,987455,211986
2,23rd Street,986185,209902
3,14th Street,985064,207918
4,9th Street,984618,206718
5,Christopher Street,982237,206324
9,World Trade Center,981276,198593

Part II – PostGIS in Action Chapter 1

For this part, you’ll work with the sample data from Chapter 1 of PostGIS in Action. This part assumes that you read and completed the chapter’s exercises.

For the exercises in chapter 1:

A. Download the data for PostGIS In Action from the Data and Software page.

B. In section 1.4.3 you can use either the COPY command or pgAdmin to import files.

C. In section 1.4.3 to load shapefiles use the shapefile loader listed under the PostGIS tools in your start menu. Alternatively, you can load shapefiles using the DB Manager in QGIS.

D. In section 1.4.5 do NOT install OpenJump. You can use QGIS if you wish to visualize your results. See the course handout Accessing PostgreSQL and PostGIS Databases on the Readings and Docs page for details.

Once you’ve finished the chapter, answer the following questions. Your answers must include the SQL statement AND the row count.

4. Count all of the restaurants that are within 1/2 mile of a highway within the states of NY, NJ, and CT. Group by franchise, so the result shows the total count of each restaurant within all three states (i.e. do not group by state).

5. Return the coordinates in plain text of all Pizza Hut restaurants in the restaurants table.

6. Calculate and sum the lengths of all the principal highways in miles within the state of NJ by highway name. Since many of the highways exist as multiple features be sure to group them. Sort the results from longest to shortest distance.

Assignment 5 – Data Processing and SQL Review

Due date: Mar 12, 2019

Introduction

  • In this exercise you’ll be working with the copper data tables we processed in class and the gep664 database.
  • This exercise is worth 8 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

Questions

For the following, provide the SQL statement AND the record count that is automatically returned by pgAdmin (for the tables you import, verify the record count after import).

This assignment assumes that you have already processed and cleaned the copper data tables as we did in class (since we didn’t finish, I’ve posted 3 of the 4 finished tables in copper_working.xlsx on the Software and Data page). The original data files are also available for download. You will be importing data into the gep664 database.

When importing data into the database you may either use the pgAdmin GUI (create blank table to hold data, right click, choose Import) or the COPY command (create blank table to hold data, write SQL COPY statement). The safest approach is to export spreadsheet files out as CSV files (using Save As – and select comma-delimited format) and then import CSVs into the database.

Remember: 1 – be careful when assigning number types to columns, to insure that you don’t truncate data and 2 – the order and number of columns in the database table must match the order and number in the import table, otherwise the import will fail. If you receive an encoding error when trying to import any table, try setting the encoding to WIN1252 in the import screen.

Part I – Table Creation and Import

For this exercise create a schema called copper for the user postgres (CREATE SCHEMA copper AUTHORIZATION postgres;) in the gep664 database. Then create well-structured tables with appropriate data types and keys, and import the copper data that we cleaned in class into the tables. Provide the CREATE TABLE statements and row counts.

1. Create a new database table called smelters and import the copper smelter data.

2. Create a new database table called trade and import the copper ore trade data.

3. Create a new database table called mines_porcu and import the main porphyry copper mining data. As a reminder – in class we deleted several columns; these are the only ones that should be in the final main table: rec_id, depname, country, iso, stprov, latitude, longitude, oreton, cugrd, mogrd, augrd, aggrd, deptype, comments.

Part II – Data cleaning and Import

For this part, you’ll work with the sedimentary copper mine table that we did not use in class, but that is stored with the other sample data in a folder called sedcu. It is in a CSV format called main.csv and is similar to the porphyry table in structure. Import it into a spreadsheet.

4. Using the ISO country code table that we used in class (country_codes.csv), write an Excel VLOOKUP formula to assign the proper 3-letter ISO country code to each country. For codes not found, fill them in manually. Instead of SQL code, submit the VLOOKUP formula as your answer and comment it out in the template. After the formula write one sentence that indicates which columns the formula is referring to for the lookup value, range, and returned value.

5. Delete unnecessary columns and save the table. Create a new database table in the copper schema called mines_sedcu and import the main sedimentary copper mining data. Provide the CREATE TABLE statement and row count. The columns that should appear in the final main table are: rec_id, depname, country, iso, stprov, latitude, longitude, oreton, cugrd, cogrd, aggrd, cuton, deptype, comments. Most of the columns are similar to the porphyry table, but you will need to re-order them so they’re consistent.

Part III – Select and Summarize

For this part, you will create summary views in the copper schema for the data you just imported. Hint: experiment with writing your SELECT statement first to get it working, then once it’s right, create the view.

6. Create a view called exports_5yr that sums the total copper ore exports by country for the past 5 years, and display just the top 20 countries by weight exported.

7. Create a view called import_smelt that sums the capacity of smelters by country and that shows the amount of copper ore imported into that country in 2015, for countries that have smelters and imports. The capacity of the smelters is in thousands of metric tons (TMT), while the weight of imports is in kilograms. 1,000 metric tons = 1,000,000 kilograms. Divide the imported ore by a million to get the amount of ore in TMT, and sort the data by country.

8. Create two separate views, one called pmine_total for porphyry mines and one called smine_total for sedimentary mines, that counts the number of copper mines (a count of records) and sums the total ore in tons available by country. Sort each view by number of mines in descending order.

BONUS Question. Optional, and worth 1 point. Count the mines and sum the ore totals for porphyry and sedimentary sites in one query to create a single view called copper.mines_total. You may not change the underlying structure of the tables or create different tables, but you may import the ISO country code table into the database as is and use it in the query. I have identified three possible solutions to this question: can you figure out one of them?

Assignment 4 – Intro to modeling and normalization

Due date: Mar 5, 2019

Introduction

  • In this exercise you’ll be working with the two readings posted on the library’s e-reserve page that were assigned for this class.
  • This exercise is worth 6 points. Part I is worth 3 points and Part II is worth 3 points.
  • As this assignment does not require SQL statements, the normal SQL submission guidelines do not apply. You can submit the responses in MS Word format.

Part I – ER Diagram

Read and refer to the Entity Relationship Model article by Song and Chen to answer the following questions. These three questions all refer to figure 3, which depicts a model for a company.

chen_erm
Image source: Song & Chen, Encyclopedia of Database Systems, 2009

In figure 3 on page 1006, the relationship between the dependent (spouse or child of an employee) and employee is indicated by the relationship diamond “Dependent of”. Cardinality (direction of the relationship) is a one to many relationship as indicated by the number 1 for the dependent and letter N for employee. Participation is indicated by the lines that connect each entity to the relationship. The double lines from dependent indicate mandatory participation, while the single line from employee indicates optional participation. Thus, the relationship can be described:

A dependent is a dependent of an employee, because they receive health benefits through the employee. Dependents can only have one employee, because they can’t receive benefits from multiple people. Each dependent must have one employee, because they cannot receive benefits from the company if no family member works there. An employee may have many dependents, but does not have to have a dependent because they may have no immediate family members.

1. Based on this example, describe the relationship that exists between the employee and department entity. Model your answer on the quoted text above and address: what the relationship is, the logic behind it, cardinality, and participation. Use your imagination to express the logic behind the relationship, for example the reasons why an employee may or may not be assigned to a department.

2. Similar to question 1, describe the relationship that exists between the employee and project entity. Model your answer on the quoted text above and addresses: what the relationship is, the logic behind it, cardinality, and participation. Make sure to address the logic behind the realtionship.

3. The diagram has a department entity. Imagine you have to extend this model by adding an office entity, which represents distinct physical locations where all departments are located (based on the example below). Describe the relationship, cardinality, and participation between the department and office entity. In addition, provide two attributes that could be associated with the office entity.

  • Department – Office
  • Administration – New York Office
  • Accounting – New York Office
  • Sales – New York Office
  • R & D – New Brunswick Office
  • Human Resources – New York Office
  • Manufacturing – Allentown Office
  • Shipping – Allentown Office

Part II – Normalization

4-6. Referring to the slides and the article by Kreibich, take the following data on dependent territories (areas that are not independent countries but are controlled by other sovereign nations) and put it into 1st, then 2nd, then 3rd normal form. The region and continent indicate where the dependency is located (these are part of a hierarchical scheme the United Nations uses for classifying countries; regions fall within continents). Columns that begin with a ‘T’ refer to the territory while columns beginning with an ‘S’ refer to the sovereign nation that controls the territory.

Show each step of the transformation just as it’s demonstrated in the slides (1st, then 2nd, then 3rd), where the tabular outcome of each normalization phase is shown. Do not conflate any of the forms. You can copy and paste the table into Word, and use Word’s table functions to display each one, and under each result write one to two sentences describing what you did.

Territory Tcapital Tpop Tregion Tcontinent Sid Sovereign Scapital
Falkland Islands (FLK) Stanley 2,931 South America Americas GBR United Kingdom London
Faroe Islands (FRO) Torshavn 49,188 Northern Europe Europe DNK Denmark Copenhagen
Falkland Islands (FLK) Stanley 2,931 South America Americas GBR United Kingdom London
French Polynesia (PYF) Papeete 285,735 Polynesia Oceania FRA France Paris
Gibraltar (GIB) Gibraltar 29,328 Southern Europe Europe GBR United Kingdom London
Greenland (GRL) Nuuk 56,483 Northern America Americas DNK Denmark Copenhagen
Guam (GUM) Hagatna 162,742 Micronesia Oceania USA United States Washington DC
Guernsey (GGY) St Peter Port 63,026 Northern Europe Europe GBR United Kingdom London

Assignment 3 – Intro to SQL: Data Definition Language (DDL)

Due date: Feb 26, 2019

Introduction

  • In this exercise you’ll be working in the nyc and nys schemas within the gep664_2019 database.
  • This exercise is worth 7 points. Each question is work 1/2 a point, except the final question which is worth 1 point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as documented in the SQL Homework Guidelines. Failure to do so will result in point deductions.

For the following, provide the SQL statement that answers each question; provide the row count only if asked. There may be more than one way to answer certain questions (just choose one).

Part I – Weather Stations table (nyc schema)

In order to select weather stations by state, we currently have to use the LIKE statement to do pattern searching. This is extremely inefficient; let’s add a dedicated state column that will allow for more efficient queries.

1. Alter the weather stations table by adding a column called state that will hold the two-letter state code (i.e. NY, NJ).

2. Update the new state column by adding the two letter state code that indicates where the station is located. You can do this using either one or two statements.

Part II – NYC Borough Data (nyc schema)

You’ll get some practice in table creation by adding a couple of tables with data for the five boroughs of NYC. Each of the boroughs is also considered a county; a borough is a local designation for subdivisions of the City of New York, while counties are a federal designation for subdivisions of states.

3. Create a table called nyc.boroughs_area to hold the following data: codes and names for the boroughs, and their area in water and land in square miles (for column names, use the names in question 4). Designate a primary key, specify that borough names should not be null, and add a check constraint that land area must be greater than zero.

4. Insert the data into the new table with an INSERT INTO VALUES statement.

bcode,bname,area_wtr,area_lnd
bx,Bronx,15.38,42.10
bk,Brooklyn,26.10,70.82
mn,Manhattan,10.76,22.83
qn,Queens,69.68,108.53
si,Staten Island,43.92,58.37

5. Create a view called nyc.boroughs_sum that has the borough name and calculated fields that have the total land area, the percent total that is water, and the percent total that is land. Optionally – you can round the percentages to whole numbers.

6. Create a table called nyc.counties_pop to hold the county code, county name, borough code, and population for 2010 (for column names, use the names in question 7). Make county code the primary key and borough code the foreign key.

7. Insert the data into the new table with an INSERT INTO VALUES statement.

ccode,cname,bcode,pop2010
36005,Bronx County,bx,1385108
36047,Kings County,bk,2504700
36061,New York County,mn,1585873
36081,Queens County,qn,2230722
36085,Richmond County,si,468730

8. Create a new table called nyc.boroughs_pop with data from the borough area and county population tables that contains: the borough code as primary key, borough name, total population, and population density (land area divided by population).

Part III – Import Labor Force Table (nys schema)

We’ll import a table from the Census 2012-2016 American Community Survey that indicates whether resident workers in a county work in the same municipality in which they live. There is one record for each county in New York State, and like the previous ACS tables we’ve worked with each estimate column has a margin of error (me) column associated with it (for this exercise you can ignore the me columns).

Go to the Data & Software tab on the course website and download the homework data for Class 3, which is a CSV file for place of work by MCD of residence. This CSV has a header row and uses commas for delimiters.

9. In the nys schema create a table called nys.resworkers_mcd that will accommodate all of the data in the CSV file. Make sure to designate a primary key.

10. Use pgAdmin to insert the data from the CSV file into the table. Once it’s imported, write a basic select statement to select all the records to verify the result. Then write a second statement that returns all the counties where the number of workers who work in their MCD is greater than the number that work outside their MCD. In your answer provide both SELECT statements and the number of rows returned (via the Messages tab in pgAdmin).

Part IV – Written Responses

Please answer the following questions in two to four sentences.

11. What is the difference between a primary key and a foreign key?

12. What is a view and how is it different from a table? For what purpose would you create a view?

13. Imagine you have a data table with population data for places, which is a US Census term for cities and towns. The data includes: a 5-digit code that contains digits from 0 to 9 that uniquely identifies each place, a place name, the two-letter state abbreviation, a footnote that indicates whether a place is incorporated or not, the total area in square miles with two decimal places, the 2010 census population count, and a 2017 population estimate. Write a CREATE TABLE statement for this data, paying attention to best practices for creating names, assigning data types, and building well structured tables. Then, write a few sentences describing your rationale for creating the table the way that you did.

Assignment 2 – Intro to SQL: Data Manipulation Language (DML)

Due date: Feb 12, 2019

Introduction

  • In this exercise you’ll be working in the nys schema within the gep664_2019 database.
  • This exercise is worth 7 points. Each question is worth 1/2 a point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as documented in the SQL Homework Guidelines. Failure to do so will result in point deductions.

Background

In this exercise you’ll be working in the nys schema within the gep664_2019 database. This schema includes four tables with county-level census data that describe the economy and labor force in NY State.

metadata
a description of every column for every table in this schema
metros
counties that are part of metropolitan areas that fall within NY State
popworkers
number of residents (population) and number of workers who work in the county from the 2012-2016 American Community Survey
resworkers
number of workers who live in the county and their place of work from the 2012-2016 American Community Survey

In pgAdmin 4 you can select each table and view the Properties tab for more information about the data. Read this information, view the metadata table, and examine each of the tables prior to beginning the exercise.

“Residents” are the total population of an area. “Workers” are the number of people that work in an area regardless of where they live. Both of these variables are in the popworkers table. “Resident Workers” are the population of the area who work, regardless of where they work. The resworkers table provides information on where resident workers work: inside or outside their state of residence, and inside or outside their county of residence. The data in these tables can be used to calculate the daytime population for an area, i.e. where most people are during the day based on their residency and work status.

Metropolitan areas are defined by the US Office of Management and Budget and are used by many federal agencies, including the Census Bureau. Metro areas consist of one or more adjacent counties that meet certain thresholds for total population and density, and that are linked by commuting patterns. Metro areas can cross state boundaries, and are subdivided between large areas (called Metropolitan Areas) and smaller areas (called Micropolitan areas). Core Based Statistical Area (CBSA) is an umbrella term that includes both areas, while the term “Metro Area” is also used colloquially to refer to both.

Questions

For the following, provide the SQL statement and the record count that is automatically returned by pgAdmin (flip from the Data Output tab to the Messages tab to see the count) or psql. There may be more than one way to answer certain questions (just choose one).

Part I – Popworkers Table

1. Select all columns for all counties that have less than 50000 residents.

2. Select the id, name, and residents columns for the two Long Island counties (Nassau and Suffolk).

3. Select the id, name, residents, and workers columns, calculate the difference between residents and workers in a calculated field called difference, and sort the data by difference in descending order (largest to smallest).

4. Summarize the entire table: count the total number of counties, sum the total number of residents, and sum the total number of workers for all counties.

Part II – Resworkers Table

5. For all counties, return ids, names, resident workers, and workers who work in their county of residence, and create a calculated field that shows the percentage of resident workers who work in their county of residence. Sort the result by the calculated field in ascending order (smallest to largest), and limit the result to display the bottom ten counties (smallest values). Since these columns are integers and you need a decimal result, use the cast operator:

(CAST(column1 as numeric) / CAST(column2 as numeric))*100

6. Do a regular inner join on the resworkers and popworkers table using their id. Create a field where you calculate the daytime population (residents + workers – resident workers). Your result should just include columns that are relevant to the calculation. Create a second calculated field that’s the difference between this daytime population and the resident population, and sort by this value.

Part III – Metro Table

7. Return a list of all the distinct cbsas from the metro table and sort them in alphabetical order.

8. Select all columns and records in the metro table that are not within New York State (these are counties that are outside of NY State but are inside a metro area that’s within NY State).

9. Write a query that counts the number of counties in each cbsa.

10. Do a regular inner join between the popworkers and the metro tables to show which counties are part of which cbsas. Include ids, names, and total residents in the result. Sort by metro area name and county name.

11. Do a left outer join with popworkers on the left to illustrate which NY counties are not part of a cbsa. Include ids, names, and total residents in the result.

12. Do a regular inner join between the popworkers and metro table and summarize the number of residents and workers by cbsa. Sort the data by population in descending order and only include columns that are relevant.

Part IV – Written Responses

Please answer the following questions in two to four sentences.

13. What are the benefits for using aliases (with the AS statement) for column names and table names?

14. What is the difference between a regular (inner) join and a left-outer join? For what purpose would you use a left-outer join?