2 Introduction to SQL
2.1 Spatial databases
A database management system (DBMS) allows users to store, insert, delete, and update information in a database. Spatial databases go a step further because they record data with geographic coordinates.
From Esri Geodatabase to PostGIS, spatial databases have quickly become the primary method of managing spatial data.
To learn more about spatial databases, check out the resources below:
2.2 DuckDB
DuckDB is an in-process SQL OLAP database management system. It is designed to be used as an embedded database in applications, but it can also be used as a standalone SQL database.
- In-process SQL means that DuckDB’s features run in your application, not an external process to which your application connects. In other words: there is no client sending instructions nor a server to read and process them. SQLite works the same way, while PostgreSQL, MySQL…, do not.
- OLAP stands for OnLine Analytical Processing, and Microsoft defines it as a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.
DuckDB is a great option if you’re looking for a serverless data analytics database management system.
3 Getting Started
This week we’ll be learning how to use SQL to query data from a database. To get started, work your way through the following two notebooks:
Clicking the link will take you to Google Colab, where you can run the notebook in your browser. You can also download the notebook and run it locally if you prefer.
Once you’ve completed these you can test your knowledge by answering the questions in the lab below. You can also access it on Google Colab here.
4 Lab
Datasets:
The following datasets are used in this lab. You don’t need to download them manually, they can be accessed directly from the notebook.
# %pip install duckdb duckdb-engine jupysql
import duckdb
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
4.1 Question 1: Creating Tables
Create a database, then write a SQL query to create a table named nyc_subway_stations
and load the data from the file nyc_subway_stations.tsv
into it. Similarly, create a table named nyc_neighborhoods
and load the data from the file nyc_neighborhoods.tsv
into it.
4.2 Question 2: Column Filtering
Write a SQL query to display the ID
, NAME
, and BOROUGH
of each subway station in the nyc_subway_stations
dataset.
4.3 Question 3: Row Filtering
Write a SQL query to find all subway stations in the nyc_subway_stations
dataset that are located in the borough of Manhattan.
4.4 Question 4: Sorting Results
Write a SQL query to list the subway stations in the nyc_subway_stations
dataset in alphabetical order by their names.
4.5 Question 5: Unique Values
Write a SQL query to find the distinct boroughs represented in the nyc_subway_stations
dataset.
4.6 Question 6: Counting Rows
Write a SQL query to count the number of subway stations in each borough in the nyc_subway_stations
dataset.
4.7 Question 7: Aggregating Data
Write a SQL query to list the number of subway stations in each borough, sorted in descending order by the count.
4.8 Question 8: Joining Tables
Write a SQL query to join the nyc_subway_stations
and nyc_neighborhoods
datasets on the borough name, displaying the subway station name and the neighborhood name.
4.9 Question 9: String Manipulation
Write a SQL query to display the names of subway stations in the nyc_subway_stations
dataset that contain the word “St” in their names.
4.10 Question 10: Filtering with Multiple Conditions
Write a SQL query to find all subway stations in the nyc_subway_stations
dataset that are in the borough of Brooklyn and have routes that include the letter “R”.