学了Coursera上的Databases and SQL for Data Science,里面介绍了SQL基本用法以及相关Python API的用法,这里做个笔记,归纳下SQL常用的syntax,放丢失~
P.S.
coursera课程Quiz和Assignment链接:
Coursera | Databases and SQL for Data Science(IBM) | Quiz
Coursera | Databases and SQL for Data Science(IBM) | Assignment
SQL
Introduction 2 SQL
What is data?
- Facts
- Pictures
- One of the most critical assets of any business
- Needs to be secure
What is database?
- A repository of data
- Provides the functionality for adding, modifying and querying that data
What is SQL?
- A language used for relational databases
- Query data
Relational Database
- Data stored in tabular form - columns and rows
- Columns contain item properties e.g. Last Name, First Name, etc.
- Table is collection of related things e.g.
Employees, Authors, etc. - Relationships can exist betwee tables
(hence: “relational”)
DBMS
- Database: repository of data
- DBMS: Database Management System - software to manage databases
- Database, Database Server, Database System, Data Server, DBMS - often used interchangeably
What is RDBMS?
- RDBMS = Relational database management system
- A set of software tools that controls the data
- access, organization, and storage
- Examples are: MySQL, Oracle Database, IBM Db2, etc.
Cloud databases
- Ease of Use and Access
- API
- Web Interface
- Cloud or Remote Applications
- Scalability & Economics
- Expand/Shrink Storage & Compute Resources
- Pay per use
- Disaster Recovery
Examples:
- IBM Db2
- Databases for PostgreSQL
- Oracle Database Cloud Service
- Microsoft Azure SQL Database
- Amazon Reltaional Database Services(RDS)
Available as:
- VMs or Managed Service
- Single or Multi-tenant
Database service instances
- DBaaS provides users with access to Database resources in cloud without setting up hardware and installing software
- Database service instance holds data in data objects / tables
- Once data is loaded, it can be queired using web interfaces and applications
Relational Model
- Most used data model
- Allows for data independence
- Data is stored in tables
Enitity-Relationship Model
- Used as a tool to design relational databases
Types of SQL Statements (DDL & DML)
DDL (Data Definition Language) statements:
- Define, change, or drop data
- Common DDL:
- CREATE
- ALTER
- TRUNCATE
- DROP
DML (Data Manipulation Language) statements:
Read and modify data
CRUD operations (Create, Read, Upate & Delete rows)
Common DML:
- INSERT
- SELECT
- UPDATE
- DELETE
CHAR VS VARCHAR
CHAR | VARCHAR | |
---|---|---|
总结 | 定长,效率高,一般用于固定长度的表单提交数据存储,例如:身份证号 | 不定长,效率偏低 |
转载:
1、CHAR的长度是不可变的,而VARCHAR的长度是可变的,也就是说,定义一个CHAR[10]和VARCHAR[10],如果存进去的是‘ABCD’, 那么CHAR所占的长度依然为10,除了字符‘ABCD’外,后面跟六个空格,而VARCHAR的长度变为4了,取数据的时候,CHAR类型的要用trim()去掉多余的空格,而VARCHAR类型是不需要的。
2、CHAR的存取速度要比VARCHAR快得多,因为其长度固定,方便程序的存储与查找;但是CHAR为此付出的是空间的代价,因为其长度固定,所以难免会有多余的空格占位符占据空间,可以说是以空间换取时间效率,而VARCHAR则是以空间效率为首位的。
3、CHAR的存储方式是,一个英文字符(ASCII)占用1个字节,一个汉字占用两个字节;而VARCHAR的存储方式是,一个英文字符占用2个字节,一个汉字也占用2个字节。
4、两者的存储数据都是非unicode的字符数据。
Basic SQL
CREATE
Syntax:
1 | CREATE TABLE table_name |
Example:
1 | CREATE TABLE provinces( |
INSERT
Syntax:
1 | INSERT INTO [TableName] |
Example:
1 | INSERT INTO AUTHOR |
UPDATE
Syntax:
1 | UPDATE [TableName] |
Example:
1 | UPDATE AUTHOR |
DELETE
Syntax:
1 | DELETE FROM [TableName] |
Example:
1 | DELETE FROM AUTHOR |
SELECT
1 | SELECT * from <tablename> |
WHERE
1 | select book_id, title from Book |
1 | WHERE firstname LIKE R% |
1 | WHERE pages >= 290 AND pages <= 300 |
1 | WHERE country='AU' OR country='BR' |
COUNT
1 | select COUNT(COUNTRY) as CountC from MEDALS |
DISTINCT
1 | select DISTINCT COUNTRY from MEDALS |
LIMIT
1 | select * from MEDALS |
ORDER BY
1 | select title from Book |
GROUP BY
1 | select country, count(country) |
Built-in Functions
SUM, MIN, MAX, AVG
1 | select SUM(COST) as SUM_OF_COST, MAX(QUANTITY), MIN(ID), AVG(COST) |
ROUND
1 | select ROUND(COST) from PETRESCUE |
LENGTH
1 | select LENGTH(ANIMAL) from PETRESCUE |
UCASE, LCASE
1 | select UCASE(ANIMAL), LCASE(ANIMAL) from PERTRESCUE |
Date and Time
DATE: YYYYMMDD
TIME: HHMMSS
TIMESTAMP: YYYYXXDDHHMMSSZZZZZZ
DATE / Time functions:
1 | YEAR(), MONTH(), DAY(), DAYOFMONTH(), DAYOFWEEK(), DAYOFYEAR(), WEEK(), HOUR(), MINUTE(), SECOND() |
Example:
1 | select DAY(RESCUEDATE) from PETRESCUE |
Specail Registers:
1 | CURRENT_DATE, CURRENT_TIME |
Example:
1 | select (CURRENT_DATE - RESCUDEDATE) from PETRESCUE |
REPLACE
Example:
Remove the ‘%’ sign from the above result set for Average Student Attendance column
Use the REPLACE() function to replace ‘%’ with ‘’
See documentation for this function.
1 | SELECT Name_of_School, REPLACE(Average_Student_Attendance, '%', '') |
CAST AS, DECIMAL
The datatype of the “Average_Student_Attendance” column is varchar.
So you cannot use it as is in the where clause for a numeric comparison.
First use the CAST() function to cast it as a DECIMAL or DOUBLE
e.g. CAST("Column_Name" as DOUBLE)
or simply: DECIMAL("Column_Name")
Don’t forget the ‘%’ age sign needs to be removed before casting
1 | %sql SELECT Name_of_School, Average_Student_Attendance \ |
Multiple Tables
A full join / Cartesian join:
Evey row in the first table is joined with every row in the second table
1 | select * from [Table1, Table2, ...] |
Implicit Join
1 | select E.EMP_ID, D.DEP_ID_DEP from |
Inner Join
1 | SELECT B.BORROWER_ID, B.LASTNAME, B.COUNTRY, L.BORROWER_ID, L.LOAN_DATE |
Outer Join: LEFT JOIN, RIGHT JOIN, FULL JOIN
1 | SELECT B.BORROWER_ID, B.LASTNAME, B.COUNTRY, L.BORROWER_ID, L.LOAN_DATE |
IBM Cloud
Open Service
Click Services
Open Console
Click on your Db2 service -> Mangae
-> Open Console
RUN SQL
Open Console
-> bar menu icon
(in the top left corner) -> Click on RUN SQL
Look at tables
bar menu icon
-> EXPLORE
-> Tables
-> Schema
(corresponding to your Db2 userid)
P.S. The Schema is typically starts with 3 letters (not SQL) followed by 5 numbers.
Load Data
bar menu icon
-> LOAD -> Load Data
-> Select File -> Select schema -> Select Table / Click New Table
NOTE: if you only see 2-3 schemas and not your Db2 schema then scroll down in that list till you see the desired one in which you previously created the tables.
Pay attention to the first row and related formats.
If there is an issue, it is usually identified with an Warning icon (red triangle with an exclamation mark) next to the datatype of the column
First check if there is a pre-defined format in the drop down list that matches the format the date/time/timestamp is in the source dataset. If not, type the correct format. Upon doing so, the Mismatch Warning (and exclamation sign) should disappear. In this example below we changed/overwrote the default Timestamp format of YYYY-MM-DD HH:MM:SS to MM/DD/YYYY HH:MM:SS TT to match the value of 08/28/2004 05:50:56 PM in the dataset.
Date / Time / Timestamp format
Elements
YYYY | Year (four digits ranging from 0000 - 9999) |
---|---|
M | Month (one or two digits ranging from 1 - 12) |
MM | Month (two digits ranging from 01 - 12; mutually exclusive with M and MMM) |
MMM | Month (first three letters of English month name; case of letters does not matter; mutually exclusive with M and MM) |
D | Day (one or two digits ranging from 1 - 31) |
DD | Day (two digits ranging from 01 - 31; mutually exclusive with D) |
DDD | Day of the year (three digits ranging from 001 - 366; mutually exclusive with other day or month elements) |
H | Hour (one or two digits ranging from 0 - 12 for a 12 hour system, and 0 – 24 for a 24 hour system) |
HH | Hour (two digits ranging from 00 - 12 for a 12 hour system, and 00 - 24 for a 24 hour system; mutually exclusive with H) |
M | Minute (one or two digits ranging from 0 - 59) |
MM | Minute (two digits ranging from 00 - 59; mutually exclusive with M, minute) |
S | Second (one or two digits ranging from 0 - 59) |
SS | Second (two digits ranging from 00 - 59; mutually exclusive with S) |
SSSSS | Second of the day after midnight (5 digits ranging from 00000 – 86400; mutually exclusive with other time elements) |
U | (1 to 12 times) Fractional seconds (number of occurrences of U represent the number of digits with each digit ranging from 0 to 9) |
TT | Meridian indicator (AM or PM) |
The valid values for the MMM element include: ‘jan’, ‘feb’, ‘mar’, ‘apr’, ‘may’, ‘jun’, ‘jul’, ‘aug’, ‘sep’, ‘oct’, ‘nov’ and ‘dec’. These values are case insensitive.
Separators
The separator between timestamp elements can be any character you want except:
It can’t be a character in the ranges a-z, A-Z, or 0-9.
It can’t be the same as the separator between columns.
It can’t be the same character you use to indicate strings.
If the date in your timestamp format uses DD and MM elements, a separator character is optional. If your date format uses D or M elements, a separator character is required.
If the time in your timestamp format uses HH, MM, and SS elements, a separator character is optional. If your time format uses H, M, or S elements, a separator character is required.
Examples
If the timestamp in your data looks like Jan 09 2017 8.49.15.008, your custom timestamp format would be MMM DD YYYY H.MM.SS.UUU. However, if the timestamp in your data looks like 1.9.2017 08.49.15, then your custom timestamp format would be M.D.YYYY HH.MM.SS.
A default value of 1 is assigned for unspecified YYYY, M, MM, D, DD, or DDD elements. A default value of ‘Jan’ is assigned to an unspecified MMM element. A default value of 0 is assigned for all other unspecified elements. The following is another example of a timestamp format: YYYY/MM/DD HH:MM:SS.UUUUUU
Create credentials
Click on your Db2 service -> Service crendentials
-> New credential
SQL API
Introduction
APIs used by popular SQL-based DBMS systems
Application or Database | SQL API |
---|---|
MySQL | MySQL C API |
PostgreSQL | psycopg2 |
IBM DB2 | ibm_db |
SQL Server | dblib API |
Database access for Microsoft Windows OS | ODBC |
Oracle | OCI |
Java | JDBC |
DB-API
- Python’s standard API for accessing relational databases
- Allows a single program that to work with multiple kinds of relational databases
- Learn DB-API functions once, use them with any database
Concepts
Connection Objects
- Database connections
- Manage transactions
Cursor Objects
- Database Queries
- Scroll through result set
- Retrieve results
Cursor
Cursor methods:
- .callproc()
- .execture()
- .executemany()
- .fetchone()
- .fetchmany()
- .fetchall()
- .nextset()
- .arraysize()
- .close()
Usage
1 | from dbmodule import connect |
Import the ibm_db
Python library
The ibm_db
API provides a variety of useful Python functions for accessing and manipulating data in an IBM® data server database, including functions for connecting to a database, preparing and issuing SQL statements, fetching rows from result sets, calling stored procedures, committing and rolling back transactions, handling errors, and retrieving metadata.
We first import the ibm_db library into our Python Application
1 | import ibm_db |
Identify the database connection credentials
Connecting to dashDB or DB2 database requires the following information:
- Driver Name
- Database name
- Host DNS name or IP address
- Host port
- Connection protocol
- User ID (or username)
- User Password
Notice: To obtain credentials please refer to the instructions given in the past
1 | #Replace the placeholder values with your actual Db2 hostname, username, and password: |
Create the DB2 database connection
Ibm_db API uses the IBM Data Server Driver for ODBC and CLI APIs to connect to IBM DB2 and Informix.
Lets build the dsn connection string using the credentials you entered above
1 | #DO NOT MODIFY THIS CELL. Just RUN it with Shift + Enter |
1 | DRIVER={IBM DB2 ODBC DRIVER};DATABASE=BLUDB;HOSTNAME=dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net;PORT=50000;PROTOCOL=TCPIP;UID=wls55462;PWD=2ls9wcg2s5l7n@08; |
1 | #DO NOT MODIFY THIS CELL. Just RUN it with Shift + Enter |
1 | Connected to database: BLUDB as user: wls55462 on host: dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net |
1 | #Retrieve Metadata for the Database Server |
Close the Connection
We free all resources by closing the connection. Remember that it is always important to close connections so that we can avoid unused connections taking up resources.
1 | ibm_db.close(conn) |
Create a table
In this step we will create a table in the database with following details:
Ibm_db.exec_immediate(Connection, Statement, Options)
1 | #Lets first drop the table INSTRUCTOR in case it exists from a previous attempt |
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-5-83413676a2ca> in <module>
3
4 #Now execute the drop statment
----> 5 dropStmt = ibm_db.exec_immediate(conn, dropQuery)
Exception: [IBM][CLI Driver][DB2/LINUXX8664] SQL0204N "WLS55462.INSTRUCTOR" is an undefined name. SQLSTATE=42704 SQLCODE=-204
Dont worry if you get this error:
If you see an exception/error similar to the following, indicating that INSTRUCTOR is an undefined name, that’s okay. It just implies that the INSTRUCTOR table does not exist in the table - which would be the case if you had not created it previously.
1 | #Construct the Create Table DDL statement - replace the ... with rest of the statement |
Insert data into the table
In this step we will insert some rows of data into the table.
The INSTRUCTOR table we created in the previous step contains 3 rows of data:
We will start by inserting just the first row of data, i.e. for instructor Rav Ahuja
1 | #Construct the query - replace ... with the insert statement |
Now use a single query to insert the remaining two rows of data
1 | #replace ... with the insert statement that inerts the remaining two rows of data |
Query data in the table
In this step we will retrieve data we inserted into the INSTRUCTOR table.
1 | #Construct the query that retrieves all rows from the INSTRUCTOR table |
{'ID': 1,
0: 1,
'FNAME': 'Rav',
1: 'Rav',
'LNAME': 'Ahuja',
2: 'Ahuja',
'CITY': 'TORONTO',
3: 'TORONTO',
'CCODE': 'CA',
4: 'CA'}
1 | #Fetch the rest of the rows and print the ID and FNAME for those rows |
ID: 2 FNAME: Raul
ID: 3 FNAME: Hima
Bonus: now write and execute an update statement that changes the Rav’s CITY to MOOSETOWN
1 | #Enter your code below |
Pay attention some cases. Look Other tips
-> Use quotes in Jupyter notebooks
for more details
Retrieve data into Pandas
In this step we will retrieve the contents of the INSTRUCTOR table into a Pandas dataframe
1 | import pandas |
1 | #query statement to retrieve all rows in INSTRUCTOR table |
'Ahuja'
1 | #print the entire data frame |
ID | FNAME | LNAME | CITY | CCODE | |
---|---|---|---|---|---|
0 | 1 | Rav | Ahuja | MOOSETOWN | CA |
1 | 2 | Raul | Chong | Markham | CA |
2 | 3 | Hima | Vasudevan | Chicago | US |
For example you can use the shape method to see how many rows and columns are in the dataframe
1 | pdf.shape |
(3, 5)
SQL Magic
Introduction
Jupyter notebooks have a concept of Magic commands that can simplify working with Python, and are particularly useful for data analysis. Your notebooks can have two types of magic commands:
- Cell magics: start with a double
%%
sign and apply to the entire cell - Line magics: start with a single
%
(percent) sign and apply to a particular line in a cell
Their usage is of the format:%magicname arguments
So far in the course you learned to accessed data from a database using the Python DB-API (and specifically ibm_db). With this API execution of queries and fetching their results involves multiple steps. You can use the SQL Magic commands to execute queries more easily.
For example if you want to execute the a query to select some data from a table and fetch its results, you can simply enter a command like the following in your Jupyter notebook cell:%sql select * from tablename
Although SQL magic simplifies working with databases, it has some limitations. For example, unlike DB-API, there are no explicit methods to close a connection and free up resources.
Below, we’ll use the load___ext magic to load the ipython-sql extension.
1 | %load_ext sql |
Now we have access to SQL magic. With our first SQL magic command, we’ll connect to a Db2 database. However, in order to do that, you’ll first need to retrieve or create your credentials to access your Db2 database.
This image shows the location of your connection string if you're using Db2 on IBM Cloud. If you're using another host the format is: username:password@hostname:port/database-name
1 | # Enter your Db2 credentials in the connection string below |
'Connected: wls55462@BLUDB'
For convenience, we can use %%sql
(two %’s instead of one) at the top of a cell to indicate we want the entire cell to be treated as SQL. Let’s use this to create a table and fill it with some test data for experimenting.
1 | %%sql |
* ibm_db_sa://wls55462:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
99 rows affected.
Splitting into multiple lines in Jupyter
Use backslash\
to split the query into multiplr lines:
1 | %sql select "Id", "Name_of_Dog", \ |
Or use %%sql
in the first row of the cell in the notebook:
1 | %%sql |
Using Python Variables in your SQL Statements
You can use python variables in your SQL statements by adding a “:” prefix to your python variable names.
For example, if I have a python variable country
with a value of "Canada"
, I can use this variable in a SQL query to find all the rows of students from Canada.
1 | country = "Canada" |
* ibm_db_sa://wls55462:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
country | first_name | last_name | test_score |
---|---|---|---|
Canada | Cristionna | Wadmore | 46 |
Canada | Wilhelm | Deeprose | 54 |
Canada | Carma | Schule | 49 |
Assigning the Results of Queries to Python Variables
You can use the normal python assignment syntax to assign the results of your queries to python variables.
For example, I have a SQL query to retrieve the distribution of test scores (i.e. how many students got each score). I can assign the result of this query to the variable test_score_distribution
using the =
operator.
1 | test_score_distribution = %sql SELECT test_score as "Test Score", count(*) as "Frequency" from INTERNATIONAL_STUDENT_TEST_SCORES GROUP BY test_score; |
* ibm_db_sa://wls55462:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
Test Score | Frequency |
---|---|
38 | 2 |
40 | 1 |
... | ... |
Converting Query Results to DataFrames
You can easily convert a SQL query result to a pandas dataframe using the DataFrame()
method. Dataframe objects are much more versatile than SQL query result objects. For example, we can easily graph our test score distribution after converting to a dataframe.
1 | dataframe = test_score_distribution.DataFrame() |
Now you know how to work with Db2 from within JupyterLab notebooks using SQL “magic”!
1 | %%sql |
Store the dataset in a Table
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.
We will first read the dataset source .CSV from the internet into pandas dataframe
Then we need to create a table in our Db2 database to store the dataset. The PERSIST
command in SQL “magic” simplifies the process of table creation and writing the data from a pandas
dataframe into the table
Connect to the database
1 | %load_ext sql |
Store the dataset in a Table
1 | import pandas |
1 | * ibm_db_sa://wls55462:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB |
You can verify that the table creation was successful by making a basic query like:
1 | %sql SELECT * FROM chicago_socioeconomic_data limit 5; |
* ibm_db_sa://wls55462:***@dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net:50000/BLUDB
Done.
index | ca | community_area_name | percent_of_housing_crowded | percent_households_below_poverty | percent_aged_16_unemployed | percent_aged_25_without_high_school_diploma | percent_aged_under_18_or_over_64 | per_capita_income_ | hardship_index |
---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | Rogers Park | 7.7 | 23.6 | 8.7 | 18.2 | 27.5 | 23939 | 39.0 |
1 | 2.0 | West Ridge | 7.8 | 17.2 | 8.8 | 20.8 | 38.5 | 23040 | 46.0 |
2 | 3.0 | Uptown | 3.8 | 24.0 | 8.9 | 11.8 | 22.2 | 35787 | 20.0 |
3 | 4.0 | Lincoln Square | 3.4 | 10.9 | 8.2 | 13.4 | 25.5 | 37524 | 17.0 |
4 | 5.0 | North Center | 0.3 | 7.5 | 5.2 | 4.5 | 26.2 | 57123 | 6.0 |
Query system catalog to get a list of tables & their properties:
1 | %load_ext sql |
You can verify that the table creation was successful by retrieving the list of all tables in your schema and checking whether the SCHOOLS table was created
1 | %sql select * from syscat.tables LIMIT 3 |
tabschema | tabname | owner | ownertype | TYPE | status | base_tabschema | base_tabname | rowtypeschema | rowtypename | create_time | alter_time | invalidate_time | stats_time | colcount | tableid | tbspaceid | card | npages | mpages | fpages | npartitions | nfiles | tablesize | overflow | tbspace | index_tbspace | long_tbspace | parents | children | selfrefs | keycolumns | keyindexid | keyunique | checkcount | datacapture | const_checked | pmap_id | partition_mode | log_attribute | pctfree | append_mode | REFRESH | refresh_time | LOCKSIZE | VOLATILE | row_format | property | statistics_profile | compression | rowcompmode | access_mode | clustered | active_blocks | droprule | maxfreespacesearch | avgcompressedrowsize | avgrowcompressionratio | avgrowsize | pctrowscompressed | logindexbuild | codepage | collationschema | collationname | collationschema_orderby | collationname_orderby | encoding_scheme | pctpagessaved | last_regen_time | secpolicyid | protectiongranularity | auditpolicyid | auditpolicyname | auditexceptionenabled | definer | oncommit | logged | onrollback | lastused | control | temporaltype | tableorg | extended_row_size | pctextendedrows | remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SYSIBM | SYSTABLES | SYSIBM | S | T | N | None | None | None | None | 2018-03-13 03:17:18.009191 | 2018-03-13 03:17:18.009191 | 2018-03-13 03:30:12.705161 | 2021-01-26 20:35:22.785259 | 83 | 5 | 0 | 14489 | 554 | 0 | 786 | -1 | -1 | -1 | 73 | SYSCATSPACE | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY | 0 | 0 | -1 | N | None | R | N | None | N | F | None | 0 | 999 | 0 | 0.0 | 512 | 0.0 | None | 1208 | SYSIBM | IDENTITY | SYSIBM | IDENTITY | 0 | 2018-03-13 03:17:18.009191 | 0 | None | None | N | SYSIBM | 2021-01-25 | R | N | R | N | -1.0 | None | |||||||||||
SYSIBM | SYSCOLUMNS | SYSIBM | S | T | N | None | None | None | None | 2018-03-13 03:17:18.009191 | 2018-03-13 03:17:18.009191 | 2018-03-13 03:30:02.043364 | 2021-01-26 12:25:23.950308 | 45 | 6 | 0 | 206681 | 3126 | 0 | 5862 | -1 | -1 | -1 | 1373 | SYSCATSPACE | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY | 0 | 0 | -1 | N | None | R | N | None | N | F | None | 0 | 999 | 0 | 0.0 | 248 | 0.0 | None | 1208 | SYSIBM | IDENTITY | SYSIBM | IDENTITY | 0 | 2018-03-13 03:17:18.009191 | 0 | None | None | N | SYSIBM | 2021-01-26 | R | N | R | N | -1.0 | None | |||||||||||
SYSIBM | SYSINDEXES | SYSIBM | S | T | N | None | None | None | None | 2018-03-13 03:17:18.009191 | 2018-03-13 03:17:18.009191 | 2018-03-13 03:30:05.393131 | 2021-01-26 18:55:22.824781 | 71 | 7 | 0 | 9909 | 303 | 0 | 404 | -1 | -1 | -1 | 32 | SYSCATSPACE | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY | 0 | 0 | -1 | N | None | R | N | None | N | F | None | 0 | 999 | 0 | 0.0 | 402 | 0.0 | None | 1208 | SYSIBM | IDENTITY | SYSIBM | IDENTITY | 0 | 2018-03-13 03:17:18.009191 | 0 | None | None | N | SYSIBM | 2021-01-26 | R | N | R | N | -1.0 | None |
Type in your query to retrieve list of all tables in the database for your db2 schema (username)
1 | #In Db2 the system catalog table called SYSCAT.TABLES contains the table metadata |
tabschema | tabname | create_time |
---|---|---|
WLS55462 | EMPLOYEES | 2021-01-24 11:14:20.259934 |
WLS55462 | JOB_HISTORY | 2021-01-24 11:14:20.445895 |
WLS55462 | LOCATIONS | 2021-01-24 11:14:20.941819 |
Or, you can retrieve list of all tables where the schema name is not one of the system created ones:
1 | %sql select TABSCHEMA, TABNAME, CREATE_TIME from SYSCAT.TABLES \ |
tabschema | tabname | create_time |
---|---|---|
WLS55462 | EMPLOYEES | 2021-01-24 11:14:20.259934 |
WLS55462 | JOB_HISTORY | 2021-01-24 11:14:20.445895 |
WLS55462 | LOCATIONS | 2021-01-24 11:14:20.941819 |
Or, just query for a specifc table that you want to verify exists in the database
1 | %sql select * from SYSCAT.TABLES where TABNAME = 'SCHOOLS' |
tabschema | tabname | owner | ownertype | TYPE | status | base_tabschema | base_tabname | rowtypeschema | rowtypename | create_time | alter_time | invalidate_time | stats_time | colcount | tableid | tbspaceid | card | npages | mpages | fpages | npartitions | nfiles | tablesize | overflow | tbspace | index_tbspace | long_tbspace | parents | children | selfrefs | keycolumns | keyindexid | keyunique | checkcount | datacapture | const_checked | pmap_id | partition_mode | log_attribute | pctfree | append_mode | REFRESH | refresh_time | LOCKSIZE | VOLATILE | row_format | property | statistics_profile | compression | rowcompmode | access_mode | clustered | active_blocks | droprule | maxfreespacesearch | avgcompressedrowsize | avgrowcompressionratio | avgrowsize | pctrowscompressed | logindexbuild | codepage | collationschema | collationname | collationschema_orderby | collationname_orderby | encoding_scheme | pctpagessaved | last_regen_time | secpolicyid | protectiongranularity | auditpolicyid | auditpolicyname | auditexceptionenabled | definer | oncommit | logged | onrollback | lastused | control | temporaltype | tableorg | extended_row_size | pctextendedrows | remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WLS55462 | SCHOOLS | WLS55462 | U | T | N | None | None | None | None | 2021-01-26 03:34:00.898765 | 2021-01-26 03:34:00.898765 | 2021-01-26 03:34:00.898765 | None | 78 | 13 | 833 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | wls55462space1 | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | N | YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY | 1 | 0 | -1 | N | None | R | N | None | N | F | None | -1 | N | 999 | -1 | -1.0 | -1 | -1.0 | None | 1208 | SYSIBM | IDENTITY | SYSIBM | IDENTITY | -1 | 2021-01-26 03:34:00.898765 | 0 | None | None | N | WLS55462 | 0001-01-01 | N | R | N | -1.0 | None |
To obtain the column names query syscat.columns:
1 | %sql select * from SYSCAT.COLUMNS where TABNAME = 'SCHOOLS' LIMIT 3 |
tabschema | tabname | colname | colno | typeschema | typename | length | scale | typestringunits | stringunitslength | DEFAULT | NULLS | codepage | collationschema | collationname | logged | compact | colcard | high2key | low2key | avgcollen | keyseq | partkeyseq | nquantiles | nmostfreq | numnulls | target_typeschema | target_typename | scope_tabschema | scope_tabname | source_tabschema | source_tabname | dl_features | special_props | hidden | inline_length | pctinlined | IDENTITY | rowchangetimestamp | GENERATED | text | compress | avgdistinctperpage | pagevarianceratio | sub_count | sub_delim_length | avgcollenchar | implicitvalue | seclabelname | rowbegin | rowend | transactionstartid | pctencoded | avgencodedcollen | qualifier | func_path | randdistkey | remarks |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WLS55462 | SCHOOLS | ZIP_Code | 6 | SYSIBM | INTEGER | 4 | 0 | None | None | None | Y | 0 | None | None | 51 | 60707 | 60607 | 5 | None | 0 | 20 | 10 | 0 | None | None | None | None | None | None | None | None | 0 | -1 | N | N | None | O | None | -1.0 | -1 | -1 | -1 | None | None | N | N | N | -1 | -1.0 | None | None | N | None | ||||
WLS55462 | SCHOOLS | Phone_Number | 7 | SYSIBM | VARCHAR | 14 | 0 | OCTETS | 14 | None | Y | 1208 | SYSIBM | IDENTITY | 566 | '(773) 535-9930' | '(773) 534-0146' | 19 | None | 0 | 20 | 10 | 0 | None | None | None | None | None | None | None | None | 0 | -1 | N | N | None | O | None | -1.0 | -1 | -1 | 14 | None | None | N | N | N | -1 | -1.0 | None | None | N | None | ||||
WLS55462 | SCHOOLS | Link | 8 | SYSIBM | VARCHAR | 78 | 0 | OCTETS | 78 | None | Y | 1208 | SYSIBM | IDENTITY | 2 | '' | 'http://schoolreports.cps.edu/Scho' | 83 | None | 0 | 20 | 10 | 1 | None | None | None | None | None | None | None | None | 0 | -1 | N | N | None | O | None | -1.0 | -1 | -1 | 78 | None | None | N | N | N | -1 | -1.0 | None | None | N | None |
To obtain specific column properties:
1 | %sql select distinct(name), coltype, length \ |
name | coltype | length |
---|---|---|
10th Grade PLAN (2009) | VARCHAR | 4 |
10th Grade PLAN (2010) | VARCHAR | 4 |
11th Grade Average ACT (2011) | VARCHAR | 4 |
Other tips
Query column names with mixed case
Retrieve Id column from DOGS table. Try:
1 | select id from DOGS |
If you run this query, you will get this error:
1 | Error: "ID" is not valid in the context where it is used .. SQLCODE=-206, SQLSTATE=42703, DRIVER=4.22.36 |
Use double quotes to specify mixed-case column names:
1 | select "Id" from DOGS |
Query column names with spaces and special characters
By deafult, spaces are mapped to underscores:
e.g. Name of Dog
-> Name_of_Dog
Other characters may also get mapped to underscores:
e.g. Breed (dominant breed if not pure breed)
-> Breed__dominant_breed_if_not_pure_breed_
1 | select "Id", "Name_of_Dog", "Breed__dominant_breed_if_not_pure_breed_" from dogs |
Use quotes in Jupyter notebooks
First assign queries to variables:
1 | selectQuery = 'select "Id" from dogs' |
Use a backslash \
as the escape character in cases where the query contains single quotes:
1 | selectQuery = 'select * from dogs |