scripts and baselines for Spider: Yale complex and cross-domain semantic parsing and text-to-SQL challenge

taoyds, updated 🕥 2022-03-22 19:54:50

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Spider is a large human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task (natural language interfaces for relational databases). It is released along with our EMNLP 2018 paper: Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. This repo contains all code for evaluation, preprocessing, and all baselines used in our paper. Please refer to the task site for more general introduction and the leaderboard.

:+1: 03/20/2022: We open-sourced a simple but SOTA model (just T5) for the task! Please check out our code in the UnifiedSKG repo!!


-11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. - 08/03/2020 Corrected column_name and column_name_original mismatches in 2 dbs (scholar and formula_1) in tables.json, and reparsed SQL queries (this only affects some models (e.g. RATSQL) which use our parsed SQL as the SQL input). Please download the Spider dataset from the page again. - 06/07/2020 We corrected some annotation errors and label mismatches (not errors) in Spider dev and test sets (~4% of dev examples updated, click here for more details). Please download the Spider dataset from the page again. - 01/16/2020 For value prediction (in order to compute the execution accuracy), your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3"). - 1/14/2019 The submission toturial is ready! Please follow it to get your results on the unreleased test data. - 12/17/2018 We updated 7 sqlite database files. Please download the Spider data from the official website again. Please refer to the issue 14 for more details. - 10/25/2018: evaluation script is updated so that the table in count(*)cases will be evaluated as well. Please check out the issue 5 for more info. Results of all baselines and syntaxSQL on the papers are updated as well. - 10/25/2018: to get the latest SQL parsing results (a few small bugs fixed), please use preprocess/ to update. Please refer to the issue 3 for more details.


The dataset is annotated by 11 college students. When you use the Spider dataset, we would appreciate it if you cite the following:

@inproceedings{Yu&al.18c, title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task}, author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev} booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", year = 2018 }

Installation and are written in Python 3. Enviroment setup for each baseline is in README under each baseline directory.

Data Content and Format

Question, SQL, and Parsed SQL

Each file intrain.json and dev.json contains the following fields: - question: the natural language question - question_toks: the natural language question tokens - db_id: the database id to which this question is addressed. - query: the SQL query corresponding to the question. - query_toks: the SQL query tokens corresponding to the question. - sql: parsed results of this SQL query using Please refer to parsed_sql_examples.sql in thepreprocess directory for the detailed documentation.

``` { "db_id": "world_1", "query": "SELECT avg(LifeExpectancy) FROM country WHERE Name NOT IN (SELECT T1.Name FROM country AS T1 JOIN countrylanguage AS T2 ON T1.Code = T2.CountryCode WHERE T2.Language = \"English\" AND T2.IsOfficial = \"T\")", "query_toks": ["SELECT", "avg", "(", "LifeExpectancy", ")", "FROM", ...], "question": "What is average life expectancy in the countries where English is not the official language?", "question_toks": ["What", "is", "average", "life", ...], "sql": { "except": null, "from": { "conds": [], "table_units": [ ... }, "groupBy": [], "having": [], "intersect": null, "limit": null, "orderBy": [], "select": [ ... ], "union": null, "where": [ [ true, ... { "except": null, "from": { "conds": [ [ false, 2, [ ... }, "groupBy": [], "having": [], "intersect": null, "limit": null, "orderBy": [], "select": [ false, ... "union": null, "where": [ [ false, 2, [ 0, ... } },



tables.json contains the following information for each database: - db_id: database id - table_names_original: original table names stored in the database. - table_names: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed] - column_names_original: original column names stored in the database. Each column looks like: [0, "id"]. 0 is the index of table names in table_names, which is city in this case. "id" is the column name. - column_names: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed] - column_types: data type of each column - foreign_keys: foreign keys in the database. [3, 8] means column indices in the column_names. These two columns are foreign keys of two different tables. - primary_keys: primary keys in the database. Each number is the index of column_names.

{ "column_names": [ [ 0, "id" ], [ 0, "name" ], [ 0, "country code" ], [ 0, "district" ], . . . ], "column_names_original": [ [ 0, "ID" ], [ 0, "Name" ], [ 0, "CountryCode" ], [ 0, "District" ], . . . ], "column_types": [ "number", "text", "text", "text", . . . ], "db_id": "world_1", "foreign_keys": [ [ 3, 8 ], [ 23, 8 ] ], "primary_keys": [ 1, 8, 23 ], "table_names": [ "city", "sqlite sequence", "country", "country language" ], "table_names_original": [ "city", "sqlite_sequence", "country", "countrylanguage" ] }


All table contents are contained in corresponding SQLite3 database files.


Update 11/15/20: We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. Our evaluation metrics include Component Matching, Exact Matching, and Execution Accuracy. For component and exact matching evaluation, instead of simply conducting string comparison between the predicted and gold SQL queries, we decompose each SQL into several clauses, and conduct set comparison in each SQL clause.

For Execution Accuracy, our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, you can assume that a list of gold values for each question is given. Your model has to fill them into the right slots in the SQL.

Please refer to our paper and this page for more details and examples.

``` python --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file]

arguments: [gold file] gold.sql file where each line is a gold SQL \t db_id [predicted file] predicted sql file where each line is a predicted SQL [evaluation type] "match" for exact set matching score, "exec" for execution score, and "all" for both [database dir] directory which contains sub-directories where each SQLite3 database is stored [table file] table.json file which includes foreign key info of each database




I couldn't find any measure of query difficulty in the Spider dataset

opened on 2023-02-24 01:50:56 by Atlamtiz

"I didn't find any difficulty metric in Spider, but in the latest paper, Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing, I saw that they distinguished different difficulty levels. However, there seems to be no difficulty measurement in the dataset. Why is that?"

Exact Match for JOINs without the ON clause

opened on 2022-10-23 13:02:56 by hrshtv

Consider the below case: predicted = "SELECT professionals.first_name , professionals.last_name FROM professionals JOIN treatments ON professionals.professional_id = treatments.professional_id WHERE treatments.cost_of_treatment<(SELECT AVG( treatments.cost_of_treatment ) FROM treatments)" gold = "SELECT DISTINCT T1.first_name , T1.last_name FROM Professionals AS T1 JOIN Treatments AS T2 WHERE cost_of_treatment < ( SELECT avg(cost_of_treatment) FROM Treatments )" I get exact match = 1 for this (using eval_exact_match). Is this by design? Since the predicted SQL has an additional ON clause after the JOIN.

Tokenize new data

opened on 2022-09-06 10:07:14 by eliotwalt

Hi, I am trying to use the model on new data and struggle to reproduce the tokenization method to obtain the query_toks and query_toks_no_value fields. I tried using process_sql.tokenize which does not produce the same results as the dataset.
Is the code for this provided somewhere? Thanks.

The results are inconsistent with the notes

opened on 2022-03-23 08:27:59 by vcbeaut

the part of sql parser is inconistent about col_unit. the notes show it's a tuple, but result is a number or string. 1 2

Fix bibtex for citation

opened on 2022-01-21 04:47:34 by dylan-slack

Part of the citation used brackets and the other half didn't, so it wasn't working in latex

[Bug] Training Instance 3153, gold query refers to a table that doesn't exist in the db.

opened on 2022-01-12 21:31:12 by hXtreme

Training Instance 3153:

json { "db_id": "assets_maintenance", "query": "SELECT T1.company_name FROM Third_Party_Companies AS T1 JOIN Maintenance_Contracts AS T2 ON T1.company_id = T2.maintenance_contract_company_id JOIN Ref_Company_Types AS T3 ON T1.company_type_code = T3.company_type_code ORDER BY T2.contract_end_date DESC LIMIT 1", "query_toks": [...], "query_toks_no_value": [...], "question": "What is the description of the type of the company who concluded its contracts most recently?", "question_toks": [...], "sql": {...} }

The table Ref_Company_Types doesn't exist in assets_maintenance db.

assets_maintenance schema:

json { "db_id": "assets_maintenance", "table_names": [ "third party companies", "maintenance contracts", "parts", "skills", "staff", "assets", "asset parts", "maintenance engineers", "engineer skills", "fault log", "engineer visits", "part faults", "fault log parts", "skills required to fix" ], "table_names_original": [ "Third_Party_Companies", "Maintenance_Contracts", "Parts", "Skills", "Staff", "Assets", "Asset_Parts", "Maintenance_Engineers", "Engineer_Skills", "Fault_Log", "Engineer_Visits", "Part_Faults", "Fault_Log_Parts", "Skills_Required_To_Fix" ], "column_names": [...], "column_names_original": [...], "column_types": [...], "primary_keys": [...], "foreign_keys": [...] }