Convert CSV files into a SQLite database

simonw, updated 🕥 2022-03-02 20:40:17

csvs-to-sqlite

PyPI Changelog Tests License

Convert CSV files into a SQLite database. Browse and publish that SQLite database with Datasette.

Basic usage:

csvs-to-sqlite myfile.csv mydatabase.db

This will create a new SQLite database called mydatabase.db containing a single table, myfile, containing the CSV content.

You can provide multiple CSV files:

csvs-to-sqlite one.csv two.csv bundle.db

The bundle.db database will contain two tables, one and two.

This means you can use wildcards:

csvs-to-sqlite ~/Downloads/*.csv my-downloads.db

If you pass a path to one or more directories, the script will recursively search those directories for CSV files and create tables for each one.

csvs-to-sqlite ~/path/to/directory all-my-csvs.db

Handling TSV (tab-separated values)

You can use the -s option to specify a different delimiter. If you want to use a tab character you'll need to apply shell escaping like so:

csvs-to-sqlite my-file.tsv my-file.db -s $'\t'

Refactoring columns into separate lookup tables

Let's say you have a CSV file that looks like this:

county,precinct,office,district,party,candidate,votes
Clark,1,President,,REP,John R. Kasich,5
Clark,2,President,,REP,John R. Kasich,0
Clark,3,President,,REP,John R. Kasich,7

(Real example taken from the Open Elections project)

You can now convert selected columns into separate lookup tables using the new --extract-column option (shortname: -c) - for example:

csvs-to-sqlite openelections-data-*/*.csv \
    -c county:County:name \
    -c precinct:Precinct:name \
    -c office -c district -c party -c candidate \
    openelections.db

The format is as follows:

column_name:optional_table_name:optional_table_value_column_name

If you just specify the column name e.g. -c office, the following table will be created:

CREATE TABLE "office" (
    "id" INTEGER PRIMARY KEY,
    "value" TEXT
);

If you specify all three options, e.g. -c precinct:Precinct:name the table will look like this:

CREATE TABLE "Precinct" (
    "id" INTEGER PRIMARY KEY,
    "name" TEXT
);

The original tables will be created like this:

CREATE TABLE "ca__primary__san_francisco__precinct" (
    "county" INTEGER,
    "precinct" INTEGER,
    "office" INTEGER,
    "district" INTEGER,
    "party" INTEGER,
    "candidate" INTEGER,
    "votes" INTEGER,
    FOREIGN KEY (county) REFERENCES County(id),
    FOREIGN KEY (party) REFERENCES party(id),
    FOREIGN KEY (precinct) REFERENCES Precinct(id),
    FOREIGN KEY (office) REFERENCES office(id),
    FOREIGN KEY (candidate) REFERENCES candidate(id)
);

They will be populated with IDs that reference the new derived tables.

Installation

$ pip install csvs-to-sqlite

csvs-to-sqlite now requires Python 3. If you are running Python 2 you can install the last version to support Python 2:

$ pip install csvs-to-sqlite==0.9.2

csvs-to-sqlite --help

``` Usage: csvs-to-sqlite [OPTIONS] PATHS... DBNAME

PATHS: paths to individual .csv files or to directories containing .csvs

DBNAME: name of the SQLite database file to create

Options: -s, --separator TEXT Field separator in input .csv -q, --quoting INTEGER Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

--skip-errors Skip lines with too many fields instead of stopping the import

--replace-tables Replace tables if they already exist -t, --table TEXT Table to use (instead of using CSV filename) -c, --extract-column TEXT One or more columns to 'extract' into a separate lookup table. If you pass a simple column name that column will be replaced with integer foreign key references to a new table of that name. You can customize the name of the table like so: state:States:state_name

                              This will pull unique values from the 'state'
                              column and use them to populate a new 'States'
                              table, with an id column primary key and a
                              state_name column containing the strings from
                              the original column.

-d, --date TEXT One or more columns to parse into ISO formatted dates

-dt, --datetime TEXT One or more columns to parse into ISO formatted datetimes

-df, --datetime-format TEXT One or more custom date format strings to try when parsing dates/datetimes

-pk, --primary-key TEXT One or more columns to use as the primary key -f, --fts TEXT One or more columns to use to populate a full- text index

-i, --index TEXT Add index on this column (or a compound index with -i col1,col2)

--shape TEXT Custom shape for the DB table - format is csvcol:dbcol(TYPE),...

--filename-column TEXT Add a column with this name and populate with CSV file name

--fixed-column ... Populate column with a fixed string --fixed-column-int ... Populate column with a fixed integer --fixed-column-float ... Populate column with a fixed float --no-index-fks Skip adding index to foreign key columns created using --extract-column (default is to add them)

--no-fulltext-fks Skip adding full-text index on values extracted using --extract-column (default is to add them)

--just-strings Import all columns as text strings by default (and, if specified, still obey --shape, --date/datetime, and --datetime-format)

--version Show the version and exit. --help Show this message and exit.

```

Issues

Don't touch tables that already exist?

opened on 2022-12-21 21:58:44 by chrismp

If I run something like this

csvs-to-sqlite some-data.csv database.db

How do I make the script skip this command if table 'some-data' already exists?

Column `filing_id` not found despite its manifest existence

opened on 2022-07-01 09:05:55 by MichaelTiemannOSC

I am trying to load 18 CSV files from CorpWatch into a SQLite database using csvs-to-sqlite. I will spare you the deprecation warnings for error_bad_lines.

``` michael$ time /Users/michael/Library/Python/3.8/bin/csvs-to-sqlite ~/Downloads/corpwatch_api_tables_csv ~/Temp/corpwatch.db -s $'\t'

return pd.read_csv( /Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (8,10) have mixed types. Specify dtype option on import or set low_memory=False.

return pd.read_csv( /Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (17) have mixed types. Specify dtype option on import or set low_memory=False.

return pd.read_csv( /Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (5,6,7,8,9) have mixed types. Specify dtype option on import or set low_memory=False.

return pd.read_csv( /Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (3,6,7) have mixed types. Specify dtype option on import or set low_memory=False. return pd.read_csv( Loaded 18 dataframes Traceback (most recent call last): File "/Users/michael/Library/Python/3.8/bin/csvs-to-sqlite", line 8, in sys.exit(cli()) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/click/core.py", line 610, in invoke return callback(args, **kwargs) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/csvs_to_sqlite/cli.py", line 251, in cli df.to_sql(df.table_name, conn, if_exists="append", index=False) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/core/generic.py", line 2951, in to_sql return sql.to_sql( File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 697, in to_sql return pandas_sql.to_sql( File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 2190, in to_sql return table.insert(chunksize, method) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 950, in insert num_inserted = exec_insert(conn, keys, chunk_iter) File "/Users/michael/Library/Python/3.8/lib/python/site-packages/pandas/io/sql.py", line 1896, in _execute_insert conn.executemany(self.insert_statement(num_rows=1), data_list) sqlite3.OperationalError: table ./company_filings has no column named filing_id

real 1m30.052s user 1m12.763s sys 0m16.640s ```

So the complaint is about company_filings.csv not having a filing_id. Yet:

michael$ head ~/Downloads/corpwatch_api_tables_csv/company_filings.csv filing_id cik year quarter period_of_report filing_date form_10k_url sec_21_url 47 1000180 2008 1 20071230 2008-02-25 http://www.sec.gov/Archives/edgar/data/1000180/0000950134-08-003259.txt http://www.sec.gov/Archives/edgar/data/1000180/000095013408003259/f38194exv21w1.htm 104 1000209 2008 1 20071231 2008-03-13 http://www.sec.gov/Archives/edgar/data/1000209/0001193125-08-055644.txt http://www.sec.gov/Archives/edgar/data/1000209/000119312508055644/dex211.htm 121 1000228 2008 1 20071229 2008-02-26 http://www.sec.gov/Archives/edgar/data/1000228/0000950123-08-002119.txt http://www.sec.gov/Archives/edgar/data/1000228/000095012308002119/y50229exv21w1.htm 152 1000229 2008 1 0 2008-02-22 http://www.sec.gov/Archives/edgar/data/1000229/0001000229-08-000005.txt NULL 174 1000230 2008 1 0 2008-01-29 http://www.sec.gov/Archives/edgar/data/1000230/0001193125-08-014289.txt NULL 186 1000232 2008 1 20071231 2008-03-31 http://www.sec.gov/Archives/edgar/data/1000232/0001000232-08-000005.txt http://www.sec.gov/Archives/edgar/data/1000232/000100023208000005/exhibit21subsidiaries.txt 213 1000234 2008 1 20070924 2008-03-17 http://www.sec.gov/Archives/edgar/data/1000234/0001000234-08-000003.txt http://www.sec.gov/Archives/edgar/data/1000234/000100023408000003/ex21.htm 214 1000234 2008 1 20071231 2008-03-17 http://www.sec.gov/Archives/edgar/data/1000234/0001000234-08-000004.txt http://www.sec.gov/Archives/edgar/data/1000234/000100023408000004/ex21.htm 462 1000623 2008 1 20071231 2008-03-07 http://www.sec.gov/Archives/edgar/data/1000623/0001047469-08-002365.txt http://www.sec.gov/Archives/edgar/data/1000623/000104746908002365/a2183428zex-21.htm ```

Why is this column not being found?

low_memory=False not a valid option

opened on 2022-05-26 22:12:31 by frafra

Hi, csvs-to-sqlite suggests to use low_memory=False, but there is no way to use such option from the command line, and it is not mentioned in the documentation.

/home/user/.local/pipx/venvs/csvs-to-sqlite/lib/python3.8/site-packages/csvs_to_sqlite/utils.py:38: DtypeWarning: Columns (14) have mixed types. Specify dtype option on import or set low_memory=False. return pd.read_csv(

https://github.com/simonw/csvs-to-sqlite/blob/a8a37a016790dc93270c74e32d0a5051bc5a0f4d/csvs_to_sqlite/utils.py#L43

error_bad_lines argument has been deprecated

opened on 2022-05-26 09:48:24 by frafra

I get a deprecation warning on the latest stable.

csvs_to_sqlite/utils.py:38: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.

minor version 1.3.1: adds click 8.0 dep

opened on 2022-03-02 20:40:17 by dannguyen

As per issue #80: https://github.com/simonw/csvs-to-sqlite/issues/80

Tested in 3.x, including 3.10. Seems fine.

option to lower case column names, replace space with underscore, and remove reservered character from column names

opened on 2022-02-23 18:34:37 by fgregg

it's generally easier to if the column names of a sqlite table are lower case, have no spaces, and have no reserved characters. It would be nice if there was an option in this tool to make transform the columns names thus.

Releases

1.3 2021-11-18 16:33:36

  • New options for populating a fixed column - so every inserted row will have the same column with the same value. Options are --fixed-column column-name string-value, --fixed-column-int column-name integer-value and --fixed-column-float column-name float-value. Thanks, William Rowell. #81

1.2 2020-11-03 23:27:20

  • Upgraded dependencies, including Pandas. #73

1.1 2020-08-09 18:45:30

  • --just-strings feature to disable type detection and import columns as strings by default. Thanks, Dan Nguyen!. #58

csvs-to-sqlite 1.0 2019-08-03 10:50:48

This release drops support for Python 2.x #55

csvs-to-sqlite 0.9.2 2019-07-03 04:36:26

Bumped dependencies and pinned pytest to version 4 (5 is incompatible with Python 2.7).

csvs-to-sqlite 0.9.1 2019-06-24 15:16:54

  • Fixed bug where -f option used FTS4 even when FTS5 was available (#41)
Simon Willison
GitHub Repository

sqlite python csv pandas click datasette datasette-io datasette-tool