2  Introduction to SQL

2.1 Spatial databases

A database management system (DBMS) allows users to store, insert, delete, and update information in a database. Spatial databases go a step further because they record data with geographic coordinates.

From Esri Geodatabase to PostGIS, spatial databases have quickly become the primary method of managing spatial data.

To learn more about spatial databases, check out the resources below:

2.2 DuckDB

DuckDB is an in-process SQL OLAP database management system. It is designed to be used as an embedded database in applications, but it can also be used as a standalone SQL database.

  • In-process SQL means that DuckDB’s features run in your application, not an external process to which your application connects. In other words: there is no client sending instructions nor a server to read and process them. SQLite works the same way, while PostgreSQL, MySQL…, do not.
  • OLAP stands for OnLine Analytical Processing, and Microsoft defines it as a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.

DuckDB is a great option if you’re looking for a serverless data analytics database management system.

3 Getting Started

This week we’ll be learning how to use SQL to query data from a database. To get started, work your way through the following two notebooks:

  1. Loading Data
  2. Introduction to SQL

Clicking the link will take you to Google Colab, where you can run the notebook in your browser. You can also download the notebook and run it locally if you prefer.

Once you’ve completed these you can test your knowledge by answering the questions in the lab below. You can also access it on Google Colab here.

4 Lab

Datasets:

The following datasets are used in this lab. You don’t need to download them manually, they can be accessed directly from the notebook.

# %pip install duckdb duckdb-engine jupysql
import duckdb

%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

4.1 Question 1: Creating Tables

Create a database, then write a SQL query to create a table named nyc_subway_stations and load the data from the file nyc_subway_stations.tsv into it. Similarly, create a table named nyc_neighborhoods and load the data from the file nyc_neighborhoods.tsv into it.

4.2 Question 2: Column Filtering

Write a SQL query to display the ID, NAME, and BOROUGH of each subway station in the nyc_subway_stations dataset.

4.3 Question 3: Row Filtering

Write a SQL query to find all subway stations in the nyc_subway_stations dataset that are located in the borough of Manhattan.

4.4 Question 4: Sorting Results

Write a SQL query to list the subway stations in the nyc_subway_stations dataset in alphabetical order by their names.

4.5 Question 5: Unique Values

Write a SQL query to find the distinct boroughs represented in the nyc_subway_stations dataset.

4.6 Question 6: Counting Rows

Write a SQL query to count the number of subway stations in each borough in the nyc_subway_stations dataset.

4.7 Question 7: Aggregating Data

Write a SQL query to list the number of subway stations in each borough, sorted in descending order by the count.

4.8 Question 8: Joining Tables

Write a SQL query to join the nyc_subway_stations and nyc_neighborhoods datasets on the borough name, displaying the subway station name and the neighborhood name.

4.9 Question 9: String Manipulation

Write a SQL query to display the names of subway stations in the nyc_subway_stations dataset that contain the word “St” in their names.

4.10 Question 10: Filtering with Multiple Conditions

Write a SQL query to find all subway stations in the nyc_subway_stations dataset that are in the borough of Brooklyn and have routes that include the letter “R”.

5 References