Ideally, the SDK should be installed on a separate Python virtual environment, for instance:
$ virtualenv env
Before working with the SDK, the environment needs to be activated:
$ source env/bin/activate
The first time, you need to install the dependencies:
$ pip install -r requirements.txt
Now, you can start an interactive session with IPython or create a .py file to be run separately.
There is a template file etl.conf.example
that can be used to get the final etl.conf
file. Example of a etl.conf
file:
``` [carto] base_url=https://cartouser.carto.com/ api_key=5565dgfgfd2b8ajjhgjgfa94d311aa60lk89vnber45errfg5gb table_name=samples delimiter=, columns=object_id,privacy,resource_type,country_code,date date_columns=date x_column=lon y_column=lat srid=4326
[etl] chunk_size=500 max_attempts=3 file_encoding=utf-8 force_no_geometry=false force_the_geom= date_format= datetime_format= float_comma_separator= float_thousand_separator=
[log] file=etl.log level=30
[geocoding] input_delimiter=, output_delimiter=, output_columns=recId,displayLatitude,displayLongitude,locationLabel,houseNumber,street,district,city,postalCode,county,state,country,relevance max_results=1 ```
Parameters:
base_url
: CARTO API endpoint root for the user.api_key
: API key for the CARTO user.table_name
: Name of the target table in CARTO.delimiter
: character used as delimiter in the CSV file, tipycally a commacolumns
: Columns of the CSV file that will be transferred to CARTO.date_columns
: Columns of the CSV file that represent a date or timestamp and have a different format than the CARTO date format (%Y-%m-%d %H:%M:%S+00), so that they need to be transformed. Columns in date_columns
must also appear in the columns
key. If date_columns
is set, then either date_format
or datetime_format
must be properly set to indicate the format of the date_columns
in the CSV filex_column
: Name of the column that contains the x coordinatey_column
: Name of the column that contains the y coordinatesrid
: The SRID of the geometrychunk_size
: Number of items to be grouped on a single INSERT or DELETE request. POST requests can deal with several MBs of data (i.e. characters), so this number can go quite high if you wish.max_attempts
: Number of attempts before giving up on a API request to CARTO.file_encoding
: Encoding of the file. By default it's utf-8
, if your file contains accents or it's in spanish it may be ISO-8859-1
force_no_geometry
: Set this to true
if your destination table does not have a geometry columnforce_the_geom
: Indicate the name of the geometry column in the CSV file in case it's an hexstring value that has to be inserted directly into PostGISdate_format
: Format of the date_columns
expressed in the datetime
Python module supported formatsdatetime_format
: Format of the date_columns
in case they are timestamps expressed in the datetime
Python module supported formatsfloat_comma_separator
: Character used as comma separator in float columnsfloat_thousand_separator
: Character used as thousand separator in float columnsfile
: File name (or path) to the log file.level
: numeric log level for the log file, as in
| Level | Numeric value |
|--------|---------------|
| CRITICAL | 50 |
| ERROR | 40 |
| WARNING | 30 |
| INFO | 20 |
| DEBUG | 10 |
| NOTSET | 0 |
Related to geocoding:
input_delimiter
: The field delimiter in the input CSV for the batch geocoding joboutput_delimiter
: The field delimiter to be used for the output geocoded CSVoutput_columns
: The output columns that will appear in the output geocoded CSV. See (HERE API docs)[https://developer.here.com/rest-apis/documentation/batch-geocoder/topics/data-output.html]max_results
: Max number of results per address in the input CSVImportant notice: The ETL script assumes the target table already exists in CARTO, and has the appropriate column definitions. The best way to achieve this is by uploading a small sample directly with Builder and then truncating the table before actually using the ETL script.
```python from etl import *
job = InsertJob("my_new_samples.csv") job.run() ```
InsertJob
can be created with these parameters:
* csv_file_path
: Path to the CSV file.
* x_column
: CSV column where the X coordinate can be found. Defaults to "longitude".
* y_column
: CSV column where the Y coordinate can be found. Defaults to "latitude".
* srid
: SRID of the coordinates. Defaults to "4326".
The run
method can be called with this parameters:
* start_chunk
: First chunk to load from the CSV file. Defaults to "1", i.e., start from the beginning.
* end_chunk
: Last chunk to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.
```python from etl import *
job = UpdateJob("object_id", "my_existing_samples.csv") job.run() ```
UpdateJob
can be created with these parameters:
* id_column
: Name of the column that will be used to match the records in CARTO.
* csv_file_path
: Path to the CSV file.
* x_column
: CSV column where the X coordinate can be found. Defaults to "longitude".
* y_column
: CSV column where the Y coordinate can be found. Defaults to "latitude".
* srid
: SRID of the coordinates. Defaults to "4326".
The run
method can be called with this parameters:
* start_row
: First row to load from the CSV file. Defaults to "1", i.e., start from the beginning.
* end_row
: Last row to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.
It is recommended for the column referred to in id_column
to be indexed in CARTO.
```python from etl import *
job = DeleteJob("object_wid", "my_existing_samples_to_be_deleted.csv") job.run() ```
DeleteJob
can be created with these parameters:
* id_column
: Name of the column that will be used to match the records in CARTO. Actually, only this column needs to be present in the file, although it does not hurt if there are others.
* csv_file_path
: Path to the CSV file.
The run
method can be called with this parameters:
* start_chunk
: First chunk to load from the CSV file. Defaults to "1", i.e., start from the beginning.
* end_chunk
: Last chunk to load from the CSV file. Defaults to "None", i.e., keep going until the end of the file.
It is recommended for the column referred to in id_column
to be indexed in CARTO.
There is a small utility to create or regenerate overviews for large point datasets. Once the ETL job is finished you can run the following methods:
job.regenerate_overviews()
: will start an asynchronous CARTO SQL execution, returning the identifier of the corresponding batch job.job.check_job(batch_job_id)
: will return a dictionary with information about the batch SQL execution, including a status
property.An example of running this process and waiting until it is finished:
```python import time
batch_job_id = job.regenerate_overviews()
while job.check_job(batch_job_id)['status'] != 'done': time.sleep(5)
```
Caveats:
See test_geocoding.py
for a usage example
There is a sample input csv file in test_files/sample.csv
. Columns of the input CSV are fixed, that means that any input CSV to geocode has to have the same structure. Field delimiters can be configured via etl.conf
file.
To run tests do the following:
``` cp etl.conf.example etl.conf
virtualenv env source env/bin/activate pip install -r requirements.txt pip install pytest pip install . py.test tests ```
Bumps lxml from 3.7.2 to 4.9.1.
Sourced from lxml's changelog.
4.9.1 (2022-07-01)
Bugs fixed
- A crash was resolved when using
iterwalk()
(orcanonicalize()
) after parsing certain incorrect input. Note thatiterwalk()
can crash on valid input parsed with the same parser after failing to parse the incorrect input.4.9.0 (2022-06-01)
Bugs fixed
- GH#341: The mixin inheritance order in
lxml.html
was corrected. Patch by xmo-odoo.Other changes
Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.
Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).
GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.
4.8.0 (2022-02-17)
Features added
GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.
The
ElementMaker
now supportsQName
values as tags, which always override the default namespace of the factory.Bugs fixed
- GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.
... (truncated)
d01872c
Prevent parse failure in new test from leaking into later test runs.d65e632
Prepare release of lxml 4.9.1.86368e9
Fix a crash when incorrect parser input occurs together with usages of iterwa...50c2764
Delete unused Travis CI config and reference in docs (GH-345)8f0bf2d
Try to speed up the musllinux AArch64 build by splitting the different CPytho...b9f7074
Remove debug print from test.b224e0f
Try to install 'xz' in wheel builds, if available, since it's now needed to e...897ebfa
Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...853c9e9
Prepare release of 4.9.0.d3f77e6
Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 5.1.0 to 7.16.3.
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Somehow, Python is parsing values such as:
385.606955914854
into:
385606955914854.0
This will throw an "integer too large" error in Postgres but is also very wrong! Not sure what's causing this.
When possible, carto-etl seems to parse strings as floats. For example, "50170121000"
becomes 50170121000.0
. It's not clear why it's not honoring the CSV's double quote, which implies a string. Thoughts?
Following the example, I get an error that InsertJob object has no attribute called columns
. I assume this has something to do with my config file - maybe it doesn't see the file? I'm not sure. Which version of Python should I be using? Is there a step somewhere else?
Also, in the readme, you say:
from etl import *
Which gives access to InsertJob
, but using Python3, I'm not seeing this behavior. Instead, I need to use use:
from etl.etl import InsertJob
Was I supposed to cd
into the etl folder?