CartoDB, updated 🕥 2022-07-06 19:31:56

SDK for ETL with CARTO's SQL API

Installation and usage

Ideally, the SDK should be installed on a separate Python virtual environment, for instance:

$ virtualenv env

Before working with the SDK, the environment needs to be activated:

$ source env/bin/activate

The first time, you need to install the dependencies:

$ pip install -r requirements.txt

Now, you can start an interactive session with IPython or create a .py file to be run separately.

Configuration file

There is a template file etl.conf.example that can be used to get the final etl.conf file. Example of a etl.conf file:

``` [carto] base_url=https://cartouser.carto.com/ api_key=5565dgfgfd2b8ajjhgjgfa94d311aa60lk89vnber45errfg5gb table_name=samples delimiter=, columns=object_id,privacy,resource_type,country_code,date date_columns=date x_column=lon y_column=lat srid=4326

[etl] chunk_size=500 max_attempts=3 file_encoding=utf-8 force_no_geometry=false force_the_geom= date_format= datetime_format= float_comma_separator= float_thousand_separator=

[log] file=etl.log level=30

[geocoding] input_delimiter=, output_delimiter=, output_columns=recId,displayLatitude,displayLongitude,locationLabel,houseNumber,street,district,city,postalCode,county,state,country,relevance max_results=1 ```

Parameters:

  • Related to the CARTO account:
  • base_url: CARTO API endpoint root for the user.
  • api_key: API key for the CARTO user.
  • table_name: Name of the target table in CARTO.
  • delimiter: character used as delimiter in the CSV file, tipycally a comma
  • columns: Columns of the CSV file that will be transferred to CARTO.
  • date_columns: Columns of the CSV file that represent a date or timestamp and have a different format than the CARTO date format (%Y-%m-%d %H:%M:%S+00), so that they need to be transformed. Columns in date_columns must also appear in the columns key. If date_columns is set, then either date_format or datetime_format must be properly set to indicate the format of the date_columns in the CSV file
  • x_column: Name of the column that contains the x coordinate
  • y_column: Name of the column that contains the y coordinate
  • srid: The SRID of the geometry
  • Related to ETL:
  • chunk_size: Number of items to be grouped on a single INSERT or DELETE request. POST requests can deal with several MBs of data (i.e. characters), so this number can go quite high if you wish.
  • max_attempts: Number of attempts before giving up on a API request to CARTO.
  • file_encoding: Encoding of the file. By default it's utf-8, if your file contains accents or it's in spanish it may be ISO-8859-1
  • force_no_geometry: Set this to true if your destination table does not have a geometry column
  • force_the_geom: Indicate the name of the geometry column in the CSV file in case it's an hexstring value that has to be inserted directly into PostGIS
  • date_format: Format of the date_columns expressed in the datetime Python module supported formats
  • datetime_format: Format of the date_columns in case they are timestamps expressed in the datetime Python module supported formats
  • float_comma_separator: Character used as comma separator in float columns
  • float_thousand_separator: Character used as thousand separator in float columns
  • Related to logging:
  • file: File name (or path) to the log file.
  • level: numeric log level for the log file, as in | Level | Numeric value | |--------|---------------| | CRITICAL | 50 | | ERROR | 40 | | WARNING | 30 | | INFO | 20 | | DEBUG | 10 | | NOTSET | 0 |

  • Related to geocoding:

  • input_delimiter: The field delimiter in the input CSV for the batch geocoding job
  • output_delimiter: The field delimiter to be used for the output geocoded CSV
  • output_columns: The output columns that will appear in the output geocoded CSV. See (HERE API docs)[https://developer.here.com/rest-apis/documentation/batch-geocoder/topics/data-output.html]
  • max_results: Max number of results per address in the input CSV

ETL

Important notice: The ETL script assumes the target table already exists in CARTO, and has the appropriate column definitions. The best way to achieve this is by uploading a small sample directly with Builder and then truncating the table before actually using the ETL script.

Insert new items into CARTO

```python from etl import *

job = InsertJob("my_new_samples.csv") job.run() ```

InsertJob can be created with these parameters: * csv_file_path: Path to the CSV file. * x_column: CSV column where the X coordinate can be found. Defaults to "longitude". * y_column: CSV column where the Y coordinate can be found. Defaults to "latitude". * srid: SRID of the coordinates. Defaults to "4326".

The run method can be called with this parameters: * start_chunk: First chunk to load from the CSV file. Defaults to "1", i.e., start from the beginning. * end_chunk: Last chunk to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.

Update existing items in CARTO

```python from etl import *

job = UpdateJob("object_id", "my_existing_samples.csv") job.run() ```

UpdateJob can be created with these parameters: * id_column: Name of the column that will be used to match the records in CARTO. * csv_file_path: Path to the CSV file. * x_column: CSV column where the X coordinate can be found. Defaults to "longitude". * y_column: CSV column where the Y coordinate can be found. Defaults to "latitude". * srid: SRID of the coordinates. Defaults to "4326".

The run method can be called with this parameters: * start_row: First row to load from the CSV file. Defaults to "1", i.e., start from the beginning. * end_row: Last row to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.

It is recommended for the column referred to in id_column to be indexed in CARTO.

Delete existing items in CARTO

```python from etl import *

job = DeleteJob("object_wid", "my_existing_samples_to_be_deleted.csv") job.run() ```

DeleteJob can be created with these parameters: * id_column: Name of the column that will be used to match the records in CARTO. Actually, only this column needs to be present in the file, although it does not hurt if there are others. * csv_file_path: Path to the CSV file.

The run method can be called with this parameters: * start_chunk: First chunk to load from the CSV file. Defaults to "1", i.e., start from the beginning. * end_chunk: Last chunk to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.

It is recommended for the column referred to in id_column to be indexed in CARTO.

Creating and regenerating overviews

There is a small utility to create or regenerate overviews for large point datasets. Once the ETL job is finished you can run the following methods:

  • job.regenerate_overviews(): will start an asynchronous CARTO SQL execution, returning the identifier of the corresponding batch job.
  • job.check_job(batch_job_id): will return a dictionary with information about the batch SQL execution, including a status property.

An example of running this process and waiting until it is finished:

```python import time

.... job.run(), etc

batch_job_id = job.regenerate_overviews()

while job.check_job(batch_job_id)['status'] != 'done': time.sleep(5)

...

```

Caveats:

  • If you are going to run more than one ETL job, overviews should be regenerated only after all of them have finished.
  • Mind that generating overviews can take a long time, that's the reason of using CARTO's Batch SQL PI so this process is run asynchronously.

Geocoding

See test_geocoding.py for a usage example

There is a sample input csv file in test_files/sample.csv. Columns of the input CSV are fixed, that means that any input CSV to geocode has to have the same structure. Field delimiters can be configured via etl.conf file.

To run tests do the following:

``` cp etl.conf.example etl.conf

you should configure properly the etl.conf file, specially your HERE API keys

virtualenv env source env/bin/activate pip install -r requirements.txt pip install pytest pip install . py.test tests ```

Issues

Bump lxml from 3.7.2 to 4.9.1

opened on 2022-07-06 19:31:53 by dependabot[bot]

Bumps lxml from 3.7.2 to 4.9.1.

Changelog

Sourced from lxml's changelog.

4.9.1 (2022-07-01)

Bugs fixed

  • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

4.9.0 (2022-06-01)

Bugs fixed

  • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

Other changes

  • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

  • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

  • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

4.8.0 (2022-02-17)

Features added

  • GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

  • The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

Bugs fixed

  • GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

... (truncated)

Commits
  • d01872c Prevent parse failure in new test from leaking into later test runs.
  • d65e632 Prepare release of lxml 4.9.1.
  • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
  • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
  • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
  • b9f7074 Remove debug print from test.
  • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
  • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
  • 853c9e9 Prepare release of 4.9.0.
  • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/CartoDB/carto-etl/network/alerts).

Bump ipython from 5.1.0 to 7.16.3

opened on 2022-01-21 19:26:00 by dependabot[bot]

Bumps ipython from 5.1.0 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/CartoDB/carto-etl/network/alerts).

"parse_float_value" parsing values incorrectly

opened on 2017-12-19 04:02:39 by allthesignals

Somehow, Python is parsing values such as: 385.606955914854 into: 385606955914854.0

This will throw an "integer too large" error in Postgres but is also very wrong! Not sure what's causing this.

Parses strings as floats for SQL insert queries

opened on 2017-12-18 21:10:51 by allthesignals

When possible, carto-etl seems to parse strings as floats. For example, "50170121000" becomes 50170121000.0. It's not clear why it's not honoring the CSV's double quote, which implies a string. Thoughts?

AttributeError: 'InsertJob' object has no attribute 'columns'

opened on 2017-12-18 20:30:31 by allthesignals

Following the example, I get an error that InsertJob object has no attribute called columns. I assume this has something to do with my config file - maybe it doesn't see the file? I'm not sure. Which version of Python should I be using? Is there a step somewhere else?

Also, in the readme, you say: from etl import *

Which gives access to InsertJob, but using Python3, I'm not seeing this behavior. Instead, I need to use use: from etl.etl import InsertJob

Was I supposed to cd into the etl folder?

Releases

v1.0.1 2018-01-02 17:06:01

  • Avoid parsing reserved strings (such us INFINITY) as float numbers

v1.0 2017-10-26 15:39:48

CARTO

The leading platform for Location Intelligence and Spatial Data Science

GitHub Repository