I am trying to describe the schema of this kaggle dataset.
Are there primary keys?
If there are primary keys, how do I figure out which they are?
Which columns are foreign keys?
# Reconnect to the database
db_path = os.path.join(path, db_filename)
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Loop through tables and print schema
for i, table_name in enumerate(tables):
print(f"\n Schema for {table_name[0]} (table {i+1}):")
#finding PKs
cursor.execute(f"PRAGMA table_info({table_name[0]})")
columns = cursor.fetchall()
# Filter out columns where the primary key flag (5th column) is non-zero
primary_keys = [col[1] for col in columns if col[5] > 0]
print(f"Primary keys for {table_name}: {primary_keys}")
cursor.execute(f"PRAGMA table_info({table_name[0]});")
schema = cursor.fetchall()
for column in schema:
print(column)
conn.close()
sample of the output: (with empty lists as PKs) (I looked at the create table statements for each of these tables and they also have no PKs)
Schema for game (table 1):
Primary keys for ('game',): []
(0, 'season_id', 'TEXT', 0, None, 0)
(1, 'team_id_home', 'TEXT', 0, None, 0)
(2, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(3, 'team_name_home', 'TEXT', 0, None, 0)
(4, 'game_id', 'TEXT', 0, None, 0)
(5, 'game_date', 'TIMESTAMP', 0, None, 0)
(6, 'matchup_home', 'TEXT', 0, None, 0)
(7, 'wl_home', 'TEXT', 0, None, 0)
(8, 'min', 'INTEGER', 0, None, 0)
(9, 'fgm_home', 'REAL', 0, None, 0)
(10, 'fga_home', 'REAL', 0, None, 0)
(11, 'fg_pct_home', 'REAL', 0, None, 0)
(12, 'fg3m_home', 'REAL', 0, None, 0)
(13, 'fg3a_home', 'REAL', 0, None, 0)
(14, 'fg3_pct_home', 'REAL', 0, None, 0)
(15, 'ftm_home', 'REAL', 0, None, 0)
(16, 'fta_home', 'REAL', 0, None, 0)
(17, 'ft_pct_home', 'REAL', 0, None, 0)
(18, 'oreb_home', 'REAL', 0, None, 0)
(19, 'dreb_home', 'REAL', 0, None, 0)
(20, 'reb_home', 'REAL', 0, None, 0)
(21, 'ast_home', 'REAL', 0, None, 0)
(22, 'stl_home', 'REAL', 0, None, 0)
(23, 'blk_home', 'REAL', 0, None, 0)
(24, 'tov_home', 'REAL', 0, None, 0)
(25, 'pf_home', 'REAL', 0, None, 0)
(26, 'pts_home', 'REAL', 0, None, 0)
(27, 'plus_minus_home', 'INTEGER', 0, None, 0)
(28, 'video_available_home', 'INTEGER', 0, None, 0)
(29, 'team_id_away', 'TEXT', 0, None, 0)
(30, 'team_abbreviation_away', 'TEXT', 0, None, 0)
(31, 'team_name_away', 'TEXT', 0, None, 0)
(32, 'matchup_away', 'TEXT', 0, None, 0)
(33, 'wl_away', 'TEXT', 0, None, 0)
(34, 'fgm_away', 'REAL', 0, None, 0)
(35, 'fga_away', 'REAL', 0, None, 0)
(36, 'fg_pct_away', 'REAL', 0, None, 0)
(37, 'fg3m_away', 'REAL', 0, None, 0)
(38, 'fg3a_away', 'REAL', 0, None, 0)
(39, 'fg3_pct_away', 'REAL', 0, None, 0)
(40, 'ftm_away', 'REAL', 0, None, 0)
(41, 'fta_away', 'REAL', 0, None, 0)
(42, 'ft_pct_away', 'REAL', 0, None, 0)
(43, 'oreb_away', 'REAL', 0, None, 0)
(44, 'dreb_away', 'REAL', 0, None, 0)
(45, 'reb_away', 'REAL', 0, None, 0)
(46, 'ast_away', 'REAL', 0, None, 0)
(47, 'stl_away', 'REAL', 0, None, 0)
(48, 'blk_away', 'REAL', 0, None, 0)
(49, 'tov_away', 'REAL', 0, None, 0)
(50, 'pf_away', 'REAL', 0, None, 0)
(51, 'pts_away', 'REAL', 0, None, 0)
(52, 'plus_minus_away', 'INTEGER', 0, None, 0)
(53, 'video_available_away', 'INTEGER', 0, None, 0)
(54, 'season_type', 'TEXT', 0, None, 0)
Schema for game_summary (table 2):
Primary keys for ('game_summary',): []
(0, 'game_date_est', 'TIMESTAMP', 0, None, 0)
(1, 'game_sequence', 'INTEGER', 0, None, 0)
(2, 'game_id', 'TEXT', 0, None, 0)
(3, 'game_status_id', 'INTEGER', 0, None, 0)
(4, 'game_status_text', 'TEXT', 0, None, 0)
(5, 'gamecode', 'TEXT', 0, None, 0)
(6, 'home_team_id', 'TEXT', 0, None, 0)
(7, 'visitor_team_id', 'TEXT', 0, None, 0)
(8, 'season', 'TEXT', 0, None, 0)
(9, 'live_period', 'INTEGER', 0, None, 0)
(10, 'live_pc_time', 'TEXT', 0, None, 0)
(11, 'natl_tv_broadcaster_abbreviation', 'TEXT', 0, None, 0)
(12, 'live_period_time_bcast', 'TEXT', 0, None, 0)
(13, 'wh_status', 'INTEGER', 0, None, 0)
Schema for other_stats (table 3):
Primary keys for ('other_stats',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'league_id', 'TEXT', 0, None, 0)
(2, 'team_id_home', 'TEXT', 0, None, 0)
(3, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(4, 'team_city_home', 'TEXT', 0, None, 0)
(5, 'pts_paint_home', 'INTEGER', 0, None, 0)
(6, 'pts_2nd_chance_home', 'INTEGER', 0, None, 0)
(7, 'pts_fb_home', 'INTEGER', 0, None, 0)
(8, 'largest_lead_home', 'INTEGER', 0, None, 0)
(9, 'lead_changes', 'INTEGER', 0, None, 0)
(10, 'times_tied', 'INTEGER', 0, None, 0)
(11, 'team_turnovers_home', 'INTEGER', 0, None, 0)
(12, 'total_turnovers_home', 'INTEGER', 0, None, 0)
(13, 'team_rebounds_home', 'INTEGER', 0, None, 0)
(14, 'pts_off_to_home', 'INTEGER', 0, None, 0)
(15, 'team_id_away', 'TEXT', 0, None, 0)
(16, 'team_abbreviation_away', 'TEXT', 0, None, 0)
(17, 'team_city_away', 'TEXT', 0, None, 0)
(18, 'pts_paint_away', 'INTEGER', 0, None, 0)
(19, 'pts_2nd_chance_away', 'INTEGER', 0, None, 0)
(20, 'pts_fb_away', 'INTEGER', 0, None, 0)
(21, 'largest_lead_away', 'INTEGER', 0, None, 0)
(22, 'team_turnovers_away', 'INTEGER', 0, None, 0)
(23, 'total_turnovers_away', 'INTEGER', 0, None, 0)
(24, 'team_rebounds_away', 'INTEGER', 0, None, 0)
(25, 'pts_off_to_away', 'INTEGER', 0, None, 0)
Schema for officials (table 4):
Primary keys for ('officials',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'official_id', 'TEXT', 0, None, 0)
(2, 'first_name', 'TEXT', 0, None, 0)
(3, 'last_name', 'TEXT', 0, None, 0)
(4, 'jersey_num', 'TEXT', 0, None, 0)
Schema for inactive_players (table 5):
Primary keys for ('inactive_players',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'player_id', 'TEXT', 0, None, 0)
(2, 'first_name', 'TEXT', 0, None, 0)
(3, 'last_name', 'TEXT', 0, None, 0)
(4, 'jersey_num', 'TEXT', 0, None, 0)
(5, 'team_id', 'TEXT', 0, None, 0)
(6, 'team_city', 'TEXT', 0, None, 0)
(7, 'team_name', 'TEXT', 0, None, 0)
(8, 'team_abbreviation', 'TEXT', 0, None, 0)
Schema for game_info (table 6):
Primary keys for ('game_info',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'game_date', 'TIMESTAMP', 0, None, 0)
(2, 'attendance', 'INTEGER', 0, None, 0)
(3, 'game_time', 'TEXT', 0, None, 0)
Schema for line_score (table 7):
Primary keys for ('line_score',): []
(0, 'game_date_est', 'TIMESTAMP', 0, None, 0)
(1, 'game_sequence', 'INTEGER', 0, None, 0)
(2, 'game_id', 'TEXT', 0, None, 0)
(3, 'team_id_home', 'TEXT', 0, None, 0)
(4, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(5, 'team_city_name_home', 'TEXT', 0, None, 0)
(6, 'team_nickname_home', 'TEXT', 0, None, 0)
(7, 'team_wins_losses_home', 'TEXT', 0, None, 0)
(8, 'pts_qtr1_home', 'TEXT', 0, None, 0)
(9, 'pts_qtr2_home', 'TEXT', 0, None, 0)
(10, 'pts_qtr3_home', 'TEXT', 0, None, 0)
(11, 'pts_qtr4_home', 'TEXT', 0, None, 0)
(12, 'pts_ot1_home', 'INTEGER', 0, None, 0)
(13, 'pts_ot2_home', 'INTEGER', 0, None, 0)
(14, 'pts_ot3_home', 'INTEGER', 0, None, 0)
(15, 'pts_ot4_home', 'INTEGER', 0, None, 0)
(16, 'pts_ot5_home', 'INTEGER', 0, None, 0)
(17, 'pts_ot6_home', 'INTEGER', 0, None, 0)
...
I am trying to describe the schema of this kaggle dataset.
Are there primary keys?
If there are primary keys, how do I figure out which they are?
Which columns are foreign keys?
# Reconnect to the database
db_path = os.path.join(path, db_filename)
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Loop through tables and print schema
for i, table_name in enumerate(tables):
print(f"\n Schema for {table_name[0]} (table {i+1}):")
#finding PKs
cursor.execute(f"PRAGMA table_info({table_name[0]})")
columns = cursor.fetchall()
# Filter out columns where the primary key flag (5th column) is non-zero
primary_keys = [col[1] for col in columns if col[5] > 0]
print(f"Primary keys for {table_name}: {primary_keys}")
cursor.execute(f"PRAGMA table_info({table_name[0]});")
schema = cursor.fetchall()
for column in schema:
print(column)
conn.close()
sample of the output: (with empty lists as PKs) (I looked at the create table statements for each of these tables and they also have no PKs)
Schema for game (table 1):
Primary keys for ('game',): []
(0, 'season_id', 'TEXT', 0, None, 0)
(1, 'team_id_home', 'TEXT', 0, None, 0)
(2, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(3, 'team_name_home', 'TEXT', 0, None, 0)
(4, 'game_id', 'TEXT', 0, None, 0)
(5, 'game_date', 'TIMESTAMP', 0, None, 0)
(6, 'matchup_home', 'TEXT', 0, None, 0)
(7, 'wl_home', 'TEXT', 0, None, 0)
(8, 'min', 'INTEGER', 0, None, 0)
(9, 'fgm_home', 'REAL', 0, None, 0)
(10, 'fga_home', 'REAL', 0, None, 0)
(11, 'fg_pct_home', 'REAL', 0, None, 0)
(12, 'fg3m_home', 'REAL', 0, None, 0)
(13, 'fg3a_home', 'REAL', 0, None, 0)
(14, 'fg3_pct_home', 'REAL', 0, None, 0)
(15, 'ftm_home', 'REAL', 0, None, 0)
(16, 'fta_home', 'REAL', 0, None, 0)
(17, 'ft_pct_home', 'REAL', 0, None, 0)
(18, 'oreb_home', 'REAL', 0, None, 0)
(19, 'dreb_home', 'REAL', 0, None, 0)
(20, 'reb_home', 'REAL', 0, None, 0)
(21, 'ast_home', 'REAL', 0, None, 0)
(22, 'stl_home', 'REAL', 0, None, 0)
(23, 'blk_home', 'REAL', 0, None, 0)
(24, 'tov_home', 'REAL', 0, None, 0)
(25, 'pf_home', 'REAL', 0, None, 0)
(26, 'pts_home', 'REAL', 0, None, 0)
(27, 'plus_minus_home', 'INTEGER', 0, None, 0)
(28, 'video_available_home', 'INTEGER', 0, None, 0)
(29, 'team_id_away', 'TEXT', 0, None, 0)
(30, 'team_abbreviation_away', 'TEXT', 0, None, 0)
(31, 'team_name_away', 'TEXT', 0, None, 0)
(32, 'matchup_away', 'TEXT', 0, None, 0)
(33, 'wl_away', 'TEXT', 0, None, 0)
(34, 'fgm_away', 'REAL', 0, None, 0)
(35, 'fga_away', 'REAL', 0, None, 0)
(36, 'fg_pct_away', 'REAL', 0, None, 0)
(37, 'fg3m_away', 'REAL', 0, None, 0)
(38, 'fg3a_away', 'REAL', 0, None, 0)
(39, 'fg3_pct_away', 'REAL', 0, None, 0)
(40, 'ftm_away', 'REAL', 0, None, 0)
(41, 'fta_away', 'REAL', 0, None, 0)
(42, 'ft_pct_away', 'REAL', 0, None, 0)
(43, 'oreb_away', 'REAL', 0, None, 0)
(44, 'dreb_away', 'REAL', 0, None, 0)
(45, 'reb_away', 'REAL', 0, None, 0)
(46, 'ast_away', 'REAL', 0, None, 0)
(47, 'stl_away', 'REAL', 0, None, 0)
(48, 'blk_away', 'REAL', 0, None, 0)
(49, 'tov_away', 'REAL', 0, None, 0)
(50, 'pf_away', 'REAL', 0, None, 0)
(51, 'pts_away', 'REAL', 0, None, 0)
(52, 'plus_minus_away', 'INTEGER', 0, None, 0)
(53, 'video_available_away', 'INTEGER', 0, None, 0)
(54, 'season_type', 'TEXT', 0, None, 0)
Schema for game_summary (table 2):
Primary keys for ('game_summary',): []
(0, 'game_date_est', 'TIMESTAMP', 0, None, 0)
(1, 'game_sequence', 'INTEGER', 0, None, 0)
(2, 'game_id', 'TEXT', 0, None, 0)
(3, 'game_status_id', 'INTEGER', 0, None, 0)
(4, 'game_status_text', 'TEXT', 0, None, 0)
(5, 'gamecode', 'TEXT', 0, None, 0)
(6, 'home_team_id', 'TEXT', 0, None, 0)
(7, 'visitor_team_id', 'TEXT', 0, None, 0)
(8, 'season', 'TEXT', 0, None, 0)
(9, 'live_period', 'INTEGER', 0, None, 0)
(10, 'live_pc_time', 'TEXT', 0, None, 0)
(11, 'natl_tv_broadcaster_abbreviation', 'TEXT', 0, None, 0)
(12, 'live_period_time_bcast', 'TEXT', 0, None, 0)
(13, 'wh_status', 'INTEGER', 0, None, 0)
Schema for other_stats (table 3):
Primary keys for ('other_stats',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'league_id', 'TEXT', 0, None, 0)
(2, 'team_id_home', 'TEXT', 0, None, 0)
(3, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(4, 'team_city_home', 'TEXT', 0, None, 0)
(5, 'pts_paint_home', 'INTEGER', 0, None, 0)
(6, 'pts_2nd_chance_home', 'INTEGER', 0, None, 0)
(7, 'pts_fb_home', 'INTEGER', 0, None, 0)
(8, 'largest_lead_home', 'INTEGER', 0, None, 0)
(9, 'lead_changes', 'INTEGER', 0, None, 0)
(10, 'times_tied', 'INTEGER', 0, None, 0)
(11, 'team_turnovers_home', 'INTEGER', 0, None, 0)
(12, 'total_turnovers_home', 'INTEGER', 0, None, 0)
(13, 'team_rebounds_home', 'INTEGER', 0, None, 0)
(14, 'pts_off_to_home', 'INTEGER', 0, None, 0)
(15, 'team_id_away', 'TEXT', 0, None, 0)
(16, 'team_abbreviation_away', 'TEXT', 0, None, 0)
(17, 'team_city_away', 'TEXT', 0, None, 0)
(18, 'pts_paint_away', 'INTEGER', 0, None, 0)
(19, 'pts_2nd_chance_away', 'INTEGER', 0, None, 0)
(20, 'pts_fb_away', 'INTEGER', 0, None, 0)
(21, 'largest_lead_away', 'INTEGER', 0, None, 0)
(22, 'team_turnovers_away', 'INTEGER', 0, None, 0)
(23, 'total_turnovers_away', 'INTEGER', 0, None, 0)
(24, 'team_rebounds_away', 'INTEGER', 0, None, 0)
(25, 'pts_off_to_away', 'INTEGER', 0, None, 0)
Schema for officials (table 4):
Primary keys for ('officials',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'official_id', 'TEXT', 0, None, 0)
(2, 'first_name', 'TEXT', 0, None, 0)
(3, 'last_name', 'TEXT', 0, None, 0)
(4, 'jersey_num', 'TEXT', 0, None, 0)
Schema for inactive_players (table 5):
Primary keys for ('inactive_players',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'player_id', 'TEXT', 0, None, 0)
(2, 'first_name', 'TEXT', 0, None, 0)
(3, 'last_name', 'TEXT', 0, None, 0)
(4, 'jersey_num', 'TEXT', 0, None, 0)
(5, 'team_id', 'TEXT', 0, None, 0)
(6, 'team_city', 'TEXT', 0, None, 0)
(7, 'team_name', 'TEXT', 0, None, 0)
(8, 'team_abbreviation', 'TEXT', 0, None, 0)
Schema for game_info (table 6):
Primary keys for ('game_info',): []
(0, 'game_id', 'TEXT', 0, None, 0)
(1, 'game_date', 'TIMESTAMP', 0, None, 0)
(2, 'attendance', 'INTEGER', 0, None, 0)
(3, 'game_time', 'TEXT', 0, None, 0)
Schema for line_score (table 7):
Primary keys for ('line_score',): []
(0, 'game_date_est', 'TIMESTAMP', 0, None, 0)
(1, 'game_sequence', 'INTEGER', 0, None, 0)
(2, 'game_id', 'TEXT', 0, None, 0)
(3, 'team_id_home', 'TEXT', 0, None, 0)
(4, 'team_abbreviation_home', 'TEXT', 0, None, 0)
(5, 'team_city_name_home', 'TEXT', 0, None, 0)
(6, 'team_nickname_home', 'TEXT', 0, None, 0)
(7, 'team_wins_losses_home', 'TEXT', 0, None, 0)
(8, 'pts_qtr1_home', 'TEXT', 0, None, 0)
(9, 'pts_qtr2_home', 'TEXT', 0, None, 0)
(10, 'pts_qtr3_home', 'TEXT', 0, None, 0)
(11, 'pts_qtr4_home', 'TEXT', 0, None, 0)
(12, 'pts_ot1_home', 'INTEGER', 0, None, 0)
(13, 'pts_ot2_home', 'INTEGER', 0, None, 0)
(14, 'pts_ot3_home', 'INTEGER', 0, None, 0)
(15, 'pts_ot4_home', 'INTEGER', 0, None, 0)
(16, 'pts_ot5_home', 'INTEGER', 0, None, 0)
(17, 'pts_ot6_home', 'INTEGER', 0, None, 0)
...
Share
Improve this question
asked Mar 13 at 2:27
Carolyn RyanCarolyn Ryan
91 silver badge1 bronze badge
2
- No, there aren't any primary keys. The "pk" column will be zero for columns that aren't part of the primary key. Your output shows all zeros, so you can safely assume the db doesn't make use of them. (Primary keys are only required for without rowid tables). You can find a table's foreign keys (which don't require primary key references) using pragma foreign_key_list. – ekhumoro Commented Mar 13 at 12:52
- This post should not have been moved out to the Staging Ground. Please act on all the comments in there. – philipxy Commented Mar 14 at 0:38
1 Answer
Reset to default 1It seems the dataset in question has tables directly imported from CSV, without any column being marked as primary key.
As such, you would need to detect the primary key yourself. One heuristics that you can use is that, if there is a single column that is intended to be a primary key, most people will name it id
or {table}_id
. So, for example, it would make sense that game_id
was the primary key in the game
table.
However, the real proof is a successful creation of a unique index (which is almost interchangeable in SQLite3 with a primary key).
CREATE UNIQUE INDEX pk ON game(game_id);
Unfortunately, you will find that game_id
is not eligible:
Runtime error: UNIQUE constraint failed: game.game_id (19)
If you investigate where the uniqueness fails:
SELECT game_id FROM game GROUP BY game_id HAVING COUNT(game_id) > 1;
you will see that e.g. 0030100001
repeats itself. Let's check it out:
SELECT * FROM game WHERE game_id = '0030100001';
It seems that the two rows are completely identical. This means there is no column, or even combination of columns, that can function as a primary key. The only possible primary key is the one SQLite is using: ROWID
. In order to actually be able to use any other column(s) as a primary key, you would have to deduplicate the rows first.
Some other tables cannot be solved this way. For example, game_summary
seems to have four entries for game_id = '0011600001'
, which differ by natl_tv_broadcaster_abbreviation
and live_period_time_bcast
. You would need to figure out by the semantics of the dataset whether a composite primary key makes sense, which columns to use for it (presumably game_id
and one of the other two columns), or if a new column should be created as a primary key (in effect, doing the same work as ROWID
is doing right now).
tl;dr: No magic answer. You need to investigate the database semantics and decide on the primary key strategy yourself.