Category Archives: assignment

Assignment 11 – DB Management

Due Date: May 9, 2017

Introduction

  • In this exercise you’ll be working with the gep664 database.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For the following, provide SQL statements (there are no row counts to return).

Questions

This assignment will give you additional practice in working with some of the database management concepts we covered in class. You will need to use the airport_geog table that we created from a transaction in class 10. If you don’t have this table in the public schema in gep664, you will need to get the file from the course website and run this transaction to create the table.

For guidance, refer to the slides for class 11 and the links in the slides to the PostgreSQL documentation. You can also refer back to Connelly Chapter 6 (in the library e-reserve’s readings).

Part I – Transactions and Efficiencies

1. Write a transaction where you add the following values to the airport geography table and build geography for them.

(‘ORD’,’IL’,-87.9047,41.9786)
(‘ATL’,’GA’,-84.4281,33.6367)

2. Use the Explain statement to test how long it would take to execute a query that selects all airports that are within 3000 kilometers of Chicago O’Hare (IATA code ORD). Remember that distances in the geography type are in meters.

3. Create a spatial index on the geography column. Then re-run the Explain statement in the previous question and see how long it would take. Provide just the index statement in your answer.

Part II – Users

4. Write a statement to create a role and password for a new user called testuser. Then, in a second statement grant testuser SELECT and UPDATE privelages for the following tables in the nys schema: acs2014_labforce, zbp2014, zbp_flags, and zips.

5. Write a statement to create a role and password for a new user called testadmin. Then, in a second statement grant testadmin all privelages on all tables in the nys schema, with the ability to grant privileges to other users.

6. Revoke testuser’s UPDATE privileges on all tables in the nys schema.

Within pgAdmin you can see the results of these statements by selecting one of the tables in the nys schema, right clicking and choosing properties, and then select the Privileges tab. You can also disconnect from the database and log back in as one of these users, to see what they would see.

Assignment 10 – Proximity Analysis

Due Date: May 2, 2017

Introduction

  • In this exercise you’ll be working with the gep664 database in the nyc schema and the sample data from Chapter 10 in PostGIS in Action. It’s assumed that you have already loaded the ZCTA shapefile and census tract population centers into the nyc schema (which we did in class 7).
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For the following, provide the SQL statement and the row count.

Questions

This assignment will give you additional practice in working with the various methods for calculating distances, and with geography data types. In answering these questions, you should refer back to the slides from the lecture for class 10 and to Chapter 10 in PostGIS in Action.

Do NOT use ST_Buffer to answer any of these questions: use the distance functions (ST_DWithin, ST_Distance, etc). You do not have to use the KNN methods or operators unless it’s specifically asked for.

Part I – Measuring Distance with Geometry

In this first part, you’ll be working with data in the nyc schema; this includes the ZCTA feature that we loaded from a shapefile and the census tract population centers we loaded from a text file back in class 7. If you don’t have these files, download them from the course website (data files used for Class 7) and load them into the gep664 database.

Remember the features in the nyc schema are in the NYSP Long Island spatial reference system, which uses feet. 5280 feet = 1 mile.

1. Calculate the distance from every subway station in the Bronx to every census tract population center (nyc.tract_popctr) in the Bronx. The unique id code for the Bronx is 36005; it appears in a dedicated column in the subway table, but is part of the geoid in the tract population center table. Your result should include the tract and station ids and names. Sort the data by stop_id and distance.

2. Select the ids of all census tract population centers that are within a distance of 1 mile from the two Yankee Stadium subway stations. Make sure to remove duplicates (a tract should only appear once in the results).

3. Sum the population for all census tract population centers that are within 1/2 mile of the Lincoln Center subway station. Your result should just show the total population for this area.

4. Using the KNN bounding box centroid operator , find the nearest 10 subway stations to ZCTA 10010. Return station ids and names.

Part II – Measuring Distance With Geography

In this part, you’ll be working the one of the sample data tables that’s included in the data for Chapter 10 of PostGIS in Action. A link for downloading this data is available on the course website; it’s a ZIP file that contains subfolders for every chapter in the book. You will work specifically with the airports table, which has a geography column.

Remember that units returned by the geography data type are in meters. 1000 meters = 1 kilometer.

5. For all of the airports that are of the type ‘large_airport’, measure the distance from each airport in Japan to each airport in Australia. Show the distance in kilometers, and the ids and names of the airports. Remember, to perform operations within a single table, you add the table to the FROM statement twice, and give each a different alias so you can reference them as separate entities (see the lecture slides for an example).

6. For all airports in Australia that are of the type ‘large airport’, return the closest airport of any type. Return distance in kilometers and names and ids of the airports. This question also requires you to perform operations within a single table. To remove results where each airport is being measured to itself, add an item to the WHERE clause to insure the airport id is not equal to the neighbor airport id: a.id != n.id. Hint – you can consider using the DISTINCT ON approach for this question.

Assignment 9 – Organizing Spatial Data

Due Date: Apr 20, 2017

Introduction

  • Please note that this assignment is due on Thu Apr 20 by 11:59pm. Submit it electronically to the Box; you will not be printing out or handing in a paper copy.
  • In this exercise you’ll be referring back to material from Chapter 14 in PostGIS in Action, but most of the assignment is dedicated to exploring what you would like to do for your final project.
  • This exercise is worth 6 points. Each question is worth 1.5 points each.
  • You do not have to use the SQL template for this assignment, as the questions do not involve writing code.

Questions

This assignment will review some of the concepts we covered for organizing spatial data, but is primarily designed to get you to start thinking about your final project.

Part I – Spatial Storage Approaches

In this part, refer back to Chapter 14 in PostGIS in Action, section 14.1 on pages 337-346.

1. Write a paragraph in your own words that summarizes the three different methods for storing spatial data in geometry columns. In your explanation, provide some examples of the most important advantages and disadvantages of each approach.

Part II – Final Project Brainstorming

For this part, refer back to the journal articles that I emailed you, and to the description of the final project. You will need to do some background research on what you are thinking about pursuing for your final project. Please allocate sufficient time for doing this: this will be your opportunity to receive feedback from me on your initial ideas.

2. Skim through all of the journal articles I sent you, choose one that interests you, and read it thoroughly. For the article that you chose, in your own words write a paragraph or two that explains what the author’s research question or problem is, and describe how they used a spatial database to answer that question.

3. Write a paragraph or two that describes two or three final project ideas that you are considering pursuing. For each idea, explain what the research question or goal for the different projects would be.

4. For the ideas you have listed above, explore and list some possible data sources that you could use. Be specific: provide names of specific datasets or even variables – don’t just list general websites or repositories. For example: “I could use TIGER Line shapefiles for census tracts and educational attainment data from the latest 5-year American Community Survey from the Census Bureau”. Not: “I would use the census website and some shapefiles”.

Assignment 8 – Spatial Relationships

Due Date: Apr 4, 2017

Introduction

  • In this exercise you’ll be working with the gep664 database in the nyc schema. It’s assumed that you have also loaded the ZCTA shapefile into this schema (which we did in our previous class). You’ll also need to use QGIS.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For this assignment you will need to submit a SQL file and a map as an image file. When you submit your electronic assignment to the Box, attach the SQL file and the image as attachments in the same message. At the beginning of class hand in a paper copy of your SQL file; you don’t have to print out and hand in the map.

Questions

This assignment will give you additional practice with spatial relationships and joins, and you’ll pull together several concepts that you have learned thus far in the course.

You’ll be working with data in the nyc schema; this includes the ZCTA feature that we loaded from a shapefile in our previous class. If you don’t have the ZCTA file download it from the course website (under Class 7 data files) and load in into the gep664 database using the shapefile loader under PgAdmin – Plugins menu or the database manager in QGIS.

For the following, provide: the SQL statement, the row count, and any specific deliverable that the questions asks for.

Part I -Spatial Relationships

There are 339 census tracts and 25 ZCTAs in the Bronx. ZIP Codes / ZCTAs often do not align well with other census or legal geographies. In this exercise you’ll test and see how well they do or do not align with census tracts. Pay careful attention to the relationship you’re being asked to test.

1. Select all census tracts (the tractid and tractnum) that are within ZCTAs in the Bronx.

2. Select all census tracts (the tractid and tractnum) that overlap ZCTAs in the Bronx. Use DISTINCT to remove tracts that overlap more than one ZCTA.

3. Select all census tracts (the tractid and tractnum) that have their geographic centroid within ZCTAs in the Bronx.

Part II – Spatial Join

4. Write a SELECT statement where you select the subway stations (stop id, name, and trains) for the entire city and assign the stations the ZCTA code that they are within. For this statement, use the JOIN clause and make it a left outer join so that you keep all subway records on the left whether they have a matching ZCTA or not. Sort the result in descending order by ZCTA number.

Part III – Spatial Joins and Summaries

To refresh your memory when answering these questions, refer back to the slides and the Boundless tutorial. When writing these longer statements, construct them bit by bit, getting one piece to work before moving on to add the next piece.

In the nyc schema in the course database:

  • census_tracts is a spatial table with census tract geometry; it also contains a column that identifies the Neighborhood Tabulation Area (NTA) that each tract is part of.
  • census2010_tracts contains data from the 2010 Census for each census tract.
  • census2010_lookup contains the list of census variables and the codes used to identify them as column headings in the census2010_tracts table.

5. Identify the codes for the variables that represent the total population, and the total population who are 16 years old and over. Then write a SELECT query where you relate these variables to the census tract geographies and create a summary for NTAs, so each row in the result is for an NTA and the columns represent: total population, total population 16 and over, and the percent of the NTA’s population that is 16 and over for each NTA. To avoid division by zero, select only tracts that have an over 16 population greater than zero. In order to get decimals for your percent total, remember to use the CAST operator around your integers:

(CAST(value1 as numeric))/(CAST(value 2 as numeric))

6. Take the query you wrote in question 5 and save it as spatial view called ntas_workforce (remember to do this, you must add the geometry to the select clause and union it – refer back to the slides). Or if you prefer, you can create a new table and insert the data into it (instead of creating a view). Provide the full query as your answer. Then, using QGIS connect to the course database and add this view to the map. Symbolize the NTAs to display a graduated map showing the percent of the population who are 16 and over. Classify the data using equal intervals with 5 categories. It’s normal if you have some “holes” in your map, as these represent unpopulated areas. You don’t have to create a fancy, finished map (with a layout etc) – just take a screenshot of the window and save the result as an image file (under Project menu – Save as Image). When naming the file, use the same convention as the sql template (i.e. use your name and the assignment number).

Assignment 7 – Spatial Reference Systems

Due Date: Mar 28, 2017

Introduction

  • In this exercise you’ll be working with the gep664 database in the copper schema you created from assignment 5. You’ll also need to use QGIS.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.
  • For this assignment you will need to submit a SQL file and a map as an image file. When you submit your electronic assignment to the Box, attach the SQL file and the image as attachments in the same message. At the beginning of class hand in a paper copy of your SQL file; you do not have to print out and hand in the map.

Questions

This assignment will give you additional practice with adding geometry columns, loading spatial data, and transforming coordinate systems.

You’ll be working with data in the copper schema; if you don’t have the files the entire schema is available as a backup file on the course website. Download the course data for class 7 from the Data and Software page, which contains the copper schema backup and a country shapefile that you’ll need to load.

For the following, provide: the SQL statements (questions 1, 3-5) and any specific deliverable that each question asks for.

Part I – Data Loading and Transformation

1. Add a geometry column to the copper smelters table, and build geometry for the table using the longitude and latitude fields. These coordinates are currently in WGS 84 (SRID 4326). When building the geometry, transform them to Pseudo Mercator (SRID 3857).

2. In the course data files there is a shapefile of countries from Natural Earth. Import this shapefile into the copper schema using the shp2pgsql-gui, located under the plugins menu. Name it countries_temp and assign it the appropriate SRID (it is in WGS 84). Note – if you receive an error about UTF-8 encoding while trying to import, hit the options button and change the encoding from UTF-8 to LATIN1. Since there is no SQL to return, copy the output that’s printed in the Log Window of the shp2pgsql-gui after you’ve loaded the shapefile, and paste that into your text file as your answer.

3. Create a new table called countries_bndy that will hold a cleaned-up version of countries_temp. Use the new column names (1st column in the list below) and look at the existing columns (2nd column in the list below) in the countries_temp table to assign appropriate data types. Make uid the primary key, and make sure to include a geometry column in the new table that can hold multipolygons and assign it the SRID for Pseudo Mercator.

uid - adm0_a3
iso - iso_a3
name - name
name_long - name_long
ctype - type
continent - continent
subregion - subregion

4. Insert the appropriate data from countries_temp into the countries_bndy file, and as part of the insert operation transform the geometry from WGS 84 to Pseudo Mercator.

5. Create a view called smelter_sum where you join the smelters table to the countries_bndy file using the iso country codes in each table. In the view, group the data by country, count the number of smelters, and sum the smelter capacity. Make sure to include the geometry column (in SELECT and GROUP BY). In your answer, return the SQL statement AND the number of rows that are in the view.

Part II – Make a Map

6. Using QGIS, connect to the course database and add the smelter_sum view to the map. Symbolize the countries to display a graduated map showing total smelter capacity. Classify the data using equal intervals with 5 categories. You don’t have to create a fancy, finished map (with a layout etc) – just take a screenshot of the window and save the result as an image file (under Project menu – Save as Image). When naming the file, use the same convention as the sql template (i.e. use your name and the assignment number).

Need help using QGIS?

Assignment 6 – Spatial Data Basics

Due Date: Mar 14, 2017

Introduction

  • In this exercise you’ll be working with the gep664 database and the sample data from Chapter 1 in PostGIS in Action.
  • This exercise is worth 6 points. Each question is worth 1 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

Questions

The first part of this assignment echoes what we did in class in the lecture and covers material from PostGIS in Action Chapter 2 and the Boundless PostGIS tutorial (parts 9 and 10).

The second part of this assignment is derived from Chapter 1 in PostGIS in Action. It assumes that you have read the chapter and did all of the exercises using the sample data provided with the book. A link for downloading this data is available on the course website on the Data & Software page; it’s a ZIP file that contains subfolders for every chapter in the book.

For the following, provide: the SQL statement (questions 1-6) and the record count (questions 4-6) that is automatically returned by pgAdmin or psql.

Part I – Geometry Basics

For this exercise, you’ll be working in the course database gep664 in the nyc schema.

For the questions in this part, all of the geometry should be defined using SRID 2263, which is the local state plane zone for NYC (NAD 83 NY State Plane Long Island (ft-us)).

1. Create a point feature to represent the city’s primary train stations. Create an empty table called nyc.train_stations with 3 columns: tid integer, name text, and geom geometry(point,2263). Write an INSERT statement where you insert the VALUES for id, name, and point geometry into the table. Use the manual method for constructing geometry as outlined in section 9.1 in the Boundless tutorial. In your answer, just provide the final statement (the INSERT).

1,Penn Station,986029,212733
2,Grand Central Station,990607,213615

2. Create a line feature to represent the Times Square Shuttle subway. Create an empty table called nyc.subway_shuttle with 3 columns: sid varchar(1), name text, and geom geometry(linestring,2263). Look in the subway stations table and find the stops for Grand Central (stop_id 901) and Times Square (stop_id 902). Return the geometry from these stations as text so you can see the coordinates (they’re in NY State Plane). Using these coordinates, write an INSERT statement where you insert: an id (S), name (Times Square Shuttle) and linestring geometry into the nyc.subway_shuttle table. Use the manual method for constructing geometry as outlined in section 9.1 in the Boundless tutorial. In your answer, just provide the final statement (the INSERT).

3. Create point features for the NJ PATH stations in Manhattan. Create an empty table called nyc.path_stations with 5 columns: sid integer, name text, xcoord numeric(6), ycoord numeric(6), and geom geometry(point,2263). Write an INSERT statement to insert the data into the table. For this question, do NOT use the manual geometry methods you used above; create the geometry using the ST_Point function. In your answer provide the INSERT statement and the statement used for creating the geometry.

1,33rd Street,987455,211986
2,23rd Street,986185,209902
3,14th Street,985064,207918
4,9th Street,984618,206718
5,Christopher Street,982237,206324
9,World Trade Center,981276,198593

Part II – PostGIS in Action Chapter 1

For this part, you’ll work with the sample data from Chapter 1 of PostGIS in Action. This part assumes that you read and completed the chapter’s exercises. Your answer must include the SQL statement AND the row count.

4. Count all of the restaurants that are within 1/2 mile of a highway within the states of NY, NJ, and CT. Group by franchise, so the result shows the total count of each restaurant within all three states (i.e. do not group by state).

5. Return the coordinates in plain text of all Pizza Hut restaurants in the restaurants table.

6. Calculate and sum the lengths of all the principal highways in miles within the state of NJ by highway name. Since many of the highways exist as multiple features be sure to group them. Sort the results from longest to shortest distance.

Assignment 5 – Data Processing and SQL Review

Due Date: Mar 7, 2017

Introduction

  • In this exercise you’ll be working with the copper data tables we processed in class and the gep664 database.
  • This exercise is worth 6 points. Questions 1-5 are worth 1 point each, questions 6-7 are worth 1/2 point each.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

Questions

For the following, provide the SQL statement AND the record count that is automatically returned by pgAdmin or psql (for the tables you import, verify the record count after import).

This assignment assumes that you have already processed and cleaned the copper data tables as we did in class. The original data files are available for download on the course website on the Data & Software page. You will be importing data into the gep664 database.

When importing data into the database you may use the \copy command or the pgAdmin GUI (create blank table to hold data, right click, choose Import).

Remember: 1 – be careful when assigning number types to columns, to insure that you don’t truncate data and 2 – in order for the \copy command or GUI Import to work, the order and number of columns in the database table must match the order and number in the import table.

Part I – Table Creation and Import

For this exercise, create a schema called copper for the user postgres (CREATE SCHEMA copper AUTHORIZATION postgres;) in the gep664 database, and import the copper data tables that we cleaned in class.

1. Create a new database table called smelter and import the copper smelter data.

2. Create a new database table called trade and import the copper ore trade data (you don’t have to do anything with the pivot table we created in class).

3. Create a new database table called mines_porcu and import the main porphyry copper mining data (from main.csv). As a reminder – in class we deleted several columns; these are the only ones that should be in the final table: rec_id, depname, country, ccode, stprov, latitude, longitude, oreton, cugrd, mogrd, augrd, aggrd, deptype, comments.

Part II – Data cleaning and Import

For this part, you’ll work with the main sedimentary copper mine data (main.csv in the sedcu folder) that we did not use in class, but that is stored with the other sample data. It is in a CSV format and is similar to the porphyry table in structure.

4. Using the ISO country code table that we used in class, write an Excel VLOOKUP formula to assign the proper 3-letter ISO country code to each country. For codes not found, fill them in manually. Instead of SQL code, submit the VLOOKUP formula as your answer and comment it out in the template. After the formula write one sentence that indicates which columns the formula is referring to for the lookup value, range, and returned value.

5. Delete unnecessary columns and save the table. Create a new database table in the copper schema called mines_sedcu and import the main sedimentary copper mining data. The columns that should appear in the final table are: rec_id, depname, country, ccode, stprov, latitude, longitude, oreton, cugrd, cogrd, aggrd, deptype, comments. Most of the columns are similar to the porphyry table, but you will need to re-order them so they’re consistent.

Part III – Select and Summarize

For this part, you will create summary views in the copper schema for the data you just imported.

6. Create a view called exports_5yr that sums the total copper ore exports (just exports, not re-exports) by country for the past 5 years, and display just the top 20 countries.

7. Create two separate views, one called pmine_total for porphyry mines and one called smine_total for sedimentary mines, that counts the number of copper mines (a count of records) and sums the total ore in tons available by country. Sort each view by number of mines in descending order.

BONUS Question. Optional, and worth 1 point. Count the mines and sum the ore totals for porphyry and sedimentary sites in one query to create a single view called copper.mines_total. Counts from both tables must appear, even if they don’t have a corresponding match in the other table. You may not change the underlying structure of the tables or create different tables, but (hint) you may import the ISO country code table into the database as is and use it in the query.

Assignment 4 – Modelling and Normalization

Due Date: Feb 28, 2017

Introduction

  • In this exercise you’ll be working with data posted on the course website.
  • This exercise is worth 6 points. The diagram is worth 4 points, the write-up is worth 1 point, and the normalization question is worth 1 point.
  • As this assignment does not require SQL statements, the normal submission guidelines do not apply. Sketch your ER diagram by hand, type up the questions and print them out. Submit both to me as hardcopies at the beginning of class.

Part I – ER Diagram – Modeling GIS Workshops

For this part, you will sketch an entity relationship diagram to model a database to store information about GIS workshops that I teach. Use pencil, ruler, and paper and sketch an ER diagram by hand using Chen’s ER methods that we covered in class and in the readings. If you prefer to use software or online tools to do the sketching you may, but it’s purely optional.

I have placed a spreadsheet on the course website that contains data related to the workshops – on the Data and Software page under Individual class exercises. Use this information and the following description to create the diagram:

Every semester I offer two or three day-long workshops called Introduction to GIS Using Open Source Software. Students, faculty, and staff from throughout CUNY are able to register. Each participant is required to submit their name and email address, and identify their: status, department or field of study, and CUNY school affiliation (for each option they must choose one; i.e. they can't identify multiple fields or schools).

I've held the workshops in a few different rooms at different campuses (but so far, only at Baruch and Lehman). If the rooms are equipped with computers then participants can use them; if the room does not have computers then the participants must bring their own laptops.

At the end of the workshop many of the participants fill out a course evaluation form. This form is anonymous and cannot be tied back to the participant list. For the evaluators all I know is their status, and the status category is more generic than what appears on the participant list.

The goal is to create a database where I can store all of this information in one place, and create queries where I can summarize the data in different ways. Up until now I have used individual spreadsheets, but as I teach these workshops year after year there is no efficient way for me to pull all the data together.

Some pointers:

  • Do NOT worry about the difference between registrants (people who sign up) and participants (people who actually show up). Ignore the registration aspect completely and just look at people who participated.
  • The participant and evaluation data cannot be directly tied together, as evaluations are anonymous. Treat participants and evaluators separately.
  • The evaluation data in the spreadsheet is summarized from individual paper forms that people fill in. For the database model, you can imagine that you have an anonymous record for each individual respondent with their responses.
  • A participant can’t take the workshop more than once.
  • I am the only instructor. There’s no need to store any instructor information.
  • Keep normalization guidelines in mind. If a lot of standardized information would be repeated within an entity, create a separate entity for it.
  • In your diagram you MUST show cardinality and participation, but you do NOT have to show whether an entity or relationship is strong or weak.

Part II – Diagram Write-up

In a short paragraph, summarize how you created the ER diagram. Discuss your decision making process and describe some of the key factors in creating the various entities, relationships, and attributes in the model.

Part III – Normalization

This part is completely separate from the preceding exercise – base your decisions solely on the data provided below. Take the following workshop participant data and put it into 1st, then 2nd, then 3rd normal form. Show each step of the transformation (note – you can keep the person’s name as a single field).

name email department / college workshop_date workshop_room
Darth Vader dv@deathstar.org Economics, Baruch 9/30/2016 951
Eleanor Roosevelt omg@gmail.com Marketing, Baruch 10/31/2016 322
Peter Parker spidey@yahoo.com Criminal Justice, John Jay 9/30/2016 951
Genghis Khan gkhan@aol.com Geography, Lehman 10/31/2016 322
Roy Rogers kingcowboy@gmail.com Geography, Lehman 10/31/2016 322
Eleanor Roosevelt omg@gmail.com Marketing, Baruch 10/31/2016 322

Assignment 3 – Intro to SQL DDL

Due Date: Feb 21, 2017

Introduction

  • In this exercise you’ll be working in the nyc and nys schemas within the gep664 database.
  • This exercise is worth 6 points. Each question is work 1/2 a point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

For the following, provide the SQL statement that answers each question; provide the row count only if asked. There may be more than one way to answer certain questions (just choose one).

Part I – Weather Stations table (nyc schema)

In order to select weather stations by state, we currently have to use the LIKE statement to do pattern searching. This is extremely inefficient; let’s add a dedicated state column that will allow for more efficient queries.

1. Alter the weather stations table by adding a column called state that will hold the two-letter state code (i.e. NY, NJ).

2. Update the new state column by adding the two letter state code that indicates where the station is located. You can do this using either one or two statements; you may not insert the codes manually (i.e. by typing each one in individually).

Part II – NYC Borough Data (nyc schema)

You’ll get some practice in table creation by adding a couple of tables with data for the five boroughs of NYC. Each of the boroughs is also considered a county; a borough is a local designation for subdivisions of the City of New York, while counties are a federal designation for subdivisions of states.

3. Create a table called nyc.borough_area to hold the following data: codes and names for the boroughs, and their area in water and land in square miles (for column names, use the names in question 4). Designate a primary key, specify that borough names should not be null, and add a check constraint that land area must be greater than zero.

4. Insert the data into the new table

bcode,bname,area_wtr,area_lnd
bx,Bronx,15.38,42.10
bk,Brooklyn,26.10,70.82
mn,Manhattan,10.76,22.83
qn,Queens,69.68,108.53
si,Staten Island,43.92,58.37

5. Create a view called nyc.borough_sum that has the borough name and calculated fields that have the total land area, the percent total that is water, and the percent total that is land. Optionally – you can round the percentages to whole numbers.

6. Create a table called nyc.counties_pop to hold the county code, county name, borough code, and population for 2010 (for column names, use the names in question 7). Make county code the primary key and borough code the foreign key.

7. Insert the data into the new table

ccode,cname,bcode,pop2010
36005,Bronx County,bx,1385108
36047,Kings County,bk,2504700
36061,New York County,mn,1585873
36081,Queens County,qn,2230722
36085,Richmond County,si,468730

8. Create a new table called nyc.boroughs_pop from the borough area and county population tables that contains: the borough code as primary key, borough name, total population, and population density (number of people per sq mile of land).

Part III – Import Labor Force Table (nys schema)

We’ll import a table from the Census 2010-2014 American Community Survey that indicates whether workers work in the same municipality in which they live. There is one record for each ZCTA in New York State, and like the previous ACS tables we’ve worked with each estimate column has a margin of error column associated with it.

Go to the Data tab on the course website and download the CSV file for place of work by MCD of residence. This CSV has a header row and uses commas for delimiters.

9. Create a table called nys.acs2014_mcdwork that will accommodate all of the data in the CSV file. Make sure to designate a primary key.

10. Use the \copy command to insert the data from the CSV file into the table. Once it’s imported, write a select statement to verify the results. Provide the copy command, the select statement, and the row count (returned by pgAdmin) in your answer.

Part IV – Written Responses

Please answer the following questions in two to four sentences.

11. What is the difference between a primary key and a foreign key?

12. What is a view and how is it different from a table? For what purpose would you create a view?

Assignment 2 – Intro to SQL DML

Due Date: Feb 14, 2017

Introduction

  • In this exercise you’ll be working in the nys schema within the gep664 database.
  • This exercise is worth 6 points. Each question is work 1/2 a point.
  • Be sure to follow the proper conventions for SQL statements and the guidelines for submitting assignments, as posted on the course website. Failure to do so will result in point deductions.

Background

In this exercise you’ll be working in the nys schema within the gep664 database. This schema includes four tables with census data that describe the economy and labor force in NY State.

nys.acs2014_labforce
resident labor force participation and employment from the 2010-2014 American Community Survey
nys.zbp2014
number of business establishments with employees and payroll from the 2014 ZIP Code Business Patterns
nys.zbp_flags
codes and descriptions for footnotes in the ZBP table
nys.zips
list of all ZIP Codes, with their postal cities and the ZCTA that each ZIP is correlated with (for residential ZIPs) or located in (for non-residential ZIPs)

The ACS data on the labor force is reported by ZIP Code Tabulation Area (ZCTAs), while the ZBP data on business establishments is reported by ZIP Code. What’s the difference?

Even though we think of ZIP Codes as areas, they are not. ZIP Codes are identifiers assigned by the US Postal Service to addresses along street segments. The Census Bureau, as well as private agencies and services like Google maps, will take this data and attempt to create areas based on concentrations of addresses that share the same ZIP. The Census Bureau does this by aggregating census blocks where the majority of addresses share the same ZIP Code; the resulting areas are called ZCTAs. Some ZIP Codes cannot be aggregated into areas, either because they represent one large organization or building that has its own ZIP, or because the code represents a large cluster of PO Boxes at a post office. As a result, there are more ZIP Codes than ZCTAs.

In our database, the ACS labor force data is reported by ZCTA, while the ZBP business establishment data is summarized by ZIP Code. You can aggregate ZIP Code data to the ZCTA level by assigning the non-residential ZIPs to the ZCTA where they are physically located. The zips table in our database allows you to do this, as each ZIP Code is related to a ZCTA.

Lastly, the data in the ACS tables represent estimates (at a 90% confidence interval) with margins of error. So for every estimate there are two columns: the estimate itself, and a margin of error for the estimate. The latter are stored in columns with the suffix _me. For example the labor force (labforce) for ZCTA 10001 was 14,547, plus or minus 947 (labforce_me).

Questions

For the following, provide the SQL statement and the record count that is automatically returned by pgAdmin or psql. There may be more than one way to answer certain questions (just choose one).

Part I – ZIP Code Business Patterns 2014 (ZBP) Table

1. Select all ZIP Codes with all columns that have more than 5000 employees.

2. Select all ZIP Codes in the Bronx that have more than 5000 employees.

3. Calculate the average number of employees for the entire table in a column called avg_emp.

4. For all ZIP Codes calculate the average payroll for employees in a new column called avg_pay. Sort the data and display the top ten ZIPs. Note: the payroll data is rounded to 1,000s of dollars, so multiply the result by 1,000 to get the right value. Make sure to exclude records where the employee value is null.

Part II – ZBP 2014 and USPS ZIP Code (ZIPs) Tables

5. Do a regular join between the zbp table and the zips table, to show which ZIPs are affiliated with which ZCTAs. Select just the zipcode and zipname from the zbp table and zcta from the zips table, and use aliases for the table names.

6. Do a left-outer join from the ZIP table to the ZBP table to identify which ZCTAs have no matching zipcode data. Sort the data in descending order by zcta. Select just the zipcode and zipname from the zbp table and zcta from the zips table, and use aliases for the table names.

7. Go back and do a regular join between the zbp table and the zips table. Aggregate and sum the establishments and employees from ZBP table by zcta. Use aliases for the table names and for the new columns.

Part III – American Community Survey (ACS) Table

8. Calculate the unemployment rate – the unemployed divided by total civilian labor force (lab_civilian) – for each ZCTA in a new column called unemp_rate. Do this for records where the number of unemployed is greater than 0. Since these columns are integers and you need a decimal result, use the cast operator:

(cast(lab_civunemp as numeric)/ cast(lab_civilian as numeric))*100

9. Select all records where the margin of error (_me) for the unemployed population is less than the unemployed population.

10. Now, calculate the unemployment rate for every ZCTA where the margin of error for the ZCTA’s unemployed population does not exceed the unemployed population.

Part IV – Written Responses

Please answer the following questions in two to four sentences.

11. What are the benefits for using aliases (with the AS statement) for column names and table names?

12. What is the difference between a regular (inner) join and a left-outer join? For what purpose would you use a left-outer join?