Tap Configuration
Adding a new tap to your project is as simple as adding a new key to the taps
section of the alto.toml
file. The key is a user defined name for the tap and the value is a dictionary of configuration options for that tap.
Example Tap Configuration
Here is an example of 1 tap configuration, as you can see it can be quite concise:
- TOML
- YAML
- JSON
[default.taps.tap-postgres]
pip_url = "pipelinewise-tap-postgres"
load_path = "raw_pg"
capabilities = ["state", "catalog"]
# These can all be in the .env file or in the environment
config.host = "@format {env[PG_HOST]}"
config.port = "@format {env[PG_PORT]}"
config.user = "@format {env[PG_USER]}"
config.password = "@format {env[PG_PASSWORD]}"
config.dbname = "prod"
default:
taps:
tap-postgres:
pip_url: pipelinewise-tap-postgres
load_path: raw_pg
capabilities:
- state
- catalog
config:
# These can all be in the .env file or in the environment
host: "@format {env[PG_HOST]}"
port: "@format {env[PG_PORT]}"
user: "@format {env[PG_USER]}"
password: "@format {env[PG_PASSWORD]}"
dbname: prod
{
"default": {
"taps": {
"tap-postgres": {
"pip_url": "pipelinewise-tap-postgres",
"load_path": "raw_pg",
"capabilities": [
"state",
"catalog"
],
"config": {
"host": "@format {env[PG_HOST]}",
"port": "@format {env[PG_PORT]}",
"user": "@format {env[PG_USER]}",
"password": "@format {env[PG_PASSWORD]}",
"dbname": "prod"
}
}
}
}
}
Below, we will go over each of the configuration options for a tap.
Settings
pip_url
The pip_url
is the pip installable URL of the tap. This is the same URL that you would use to install the tap via pip install
. This is the only required field for a tap configuration.
Remember pip installable URLs can be a local path to a directory, tarball, zip file, or a git repo. Anything that can be used by pip
can be used by alto
.
load_path
The load_path
of a tap overrides the project-level load path during configuration rendering. This means that targets can uniformly access the context-specific load_path
by using the this.load_path
variable. It would specifically look like this:
- TOML
- YAML
- JSON
[default.targets.target-snowflake]
config.schema = "@format {this.load_path}"
default:
targets:
target-snowflake:
config:
schema: "@format {this.load_path}"
{
"default": {
"targets": {
"target-snowflake": {
"config": {
"schema": "@format {this.load_path}"
}
}
}
}
}
This is useful since targets almost always need to have some context of the tap that is feeding them data in order to write the data to the correct location. This makes it as convenient as possible. If you don't specify a load_path
for a tap, the project-level load_path
will be used.
capabilities
The capabilities
of a tap are a list of capabilities that the tap has. These flags determine how alto
should interface with the tap. The available capabilities are:
state
- The tap supports state managementcatalog
- The tap supports catalog managementproperties
- The tap supports (legacy) property managementabout
- The tap supports theabout
commandtest
- The tap supports thetest
command
state
The state
capability indicates that the tap supports state management. This means that the tap can be used to extract data incrementally. If not specified, the --state
flag will not be passed by alto
.
catalog
or properties
These 2 capabilities are very similar. They both indicate that the tap supports catalog management. The difference is that catalog
is the preferred capability and properties
is a legacy capability. I recommend going with catalog
if you are unsure.
about
and test
These capabilities enable the about
and test
commands for the tap. These commands are useful for debugging and testing the tap. They are not required for the tap to work and are only implemented by Meltano SDK taps.
The 2 most common capabilities you will specify are state
+ catalog
. These are not set by default as we prefer explicitness here.
config
This is the configuration for the tap. This is exactly what you would put in the config.json
if running the tap manually. The only difference is that you can use the @format
directive to reference environment variables and other config values via this
. This is useful for sensitive information like passwords and API keys. Both the Meltano Hub and most GitHub repos will tell you exactly what to put here.
executable
and entrypoint
By default, alto
will assume that the tap exposes a script named after the tap key name in the alto config. For example, if the tap is named tap-postgres
, alto
will assume that the tap exposes a script named tap-postgres
. If this is not the case, you can specify the executable
explicitly. You can alternatively specify the entrypoint
if the package does not expose a script or you want to use something other than the default.
Up until now, we have always used tap-name
as the name of our plugins. This is convenient because often it means we don't have to specifiy the executable
however there is nothing stopping you from using a completely different name. You could call a tap platform_data or marketing_saas or master_extractor. It doesn't matter. As long as you specify the executable
or entrypoint
correctly, alto
will be able to run it. It also means commands might be more meaningful to you and your team.
IE, alto pg_platform_data:sf_staging_db
may be more meaningful than alto tap-postgres:target-snowflake
.
select
The select
field is a list of patterns that will be used to prune the catalog. This allows you to selectively replicate streams. This is useful if you want to replicate a subset of the data in a database. The patterns are matched against the stream name. The patterns are matched using the fnmatch module. This means that you can use *
to match any number of characters and ?
to match a single character. It is also possible to use !
to negate a pattern. This is useful if you want to replicate everything except a few streams. It should be functionally similar to Meltano as is documented here.
PII Hashing
alto
has extended the select
syntax to allow for PII hashing. It works by prefixing a pattern with ~
. This will cause the tap to hash any fields that match the pattern. This is useful if you want to replicate a subset of the data but don't want to replicate any PII.
metadata
The metadata
field is again familiar to users of Meltano. It allows you to mutate the catalog more efficiently. It is a dictionary where each key is a stream name (glob syntax is supported) and the value is a dictionary to merge into the catalog entry. This is useful if you want to change the replication method or key properties of a stream. It is also useful if you want to add custom metadata to a stream. It should be functionally similar to Meltano as is documented here.
environment
The environment
field is a dictionary of environment variables that will be set when running the tap. It is fully scoped to the tap and will not affect other processes.
stream_maps
The stream_maps
field is a dictionary of stream maps. Stream maps are a way to mutate the JSON objects moving between a tap and a target. They are useful if you want to rename a field or add a field to every record. They are also useful if you want to add custom metadata to every record. alto
has opted for the approach of, excluding the PII hash feature, having users create their own stream maps. This gives users the most flexibility and allows them to create stream maps that are specific to their use case.
Adding a stream map looks like this:
- TOML
- YAML
- JSON
[[default.targets.tap-salesforce.stream_maps]]
path = "./path/to/custom_map.py"
select = ["*.*"]
default:
targets:
tap-salesforce:
stream_maps:
- path: ./path/to/custom_map.py
select: ["*.*"]
{
"default": {
"targets": {
"tap-salesforce": {
"stream_maps": [
{
"path": "./path/to/custom_map.py",
"select": ["*.*"]
}
]
}
}
}
}
You will notice that the select
field is similar to the select
field at the top level of the tap. This is because stream maps need to be able to be selectively applied. You can alternatively just supply a path and alto
will assume that you want to apply the stream map to all streams.
inherit_from
The inherit_from
key is a way to inherit config from another tap. This is useful if you have a tap that is very similar to another tap and you want to reuse the config. It supports chaining. For example, if you have a tap named tap-salesforce
and a tap named tap-salesforce-rest
and you want tap-salesforce-rest
to inherit from tap-salesforce
, you can do this:
- TOML
- YAML
- JSON
[default.taps.tap-salesforce]
pip_url = "tap-salesforce"
config.api_type = "bulk"
config.username = "..."
config.security_token = "..."
[default.taps.tap-salesforce-rest]
inherit_from = "tap-salesforce"
config.api_type = "rest"
default:
taps:
tap-salesforce:
pip_url: tap-salesforce
config:
api_type: "bulk"
username: "..."
security_token: "..."
tap-salesforce-rest:
inherit_from: tap-salesforce
config:
api_type: "rest"
{
"default": {
"taps": {
"tap-salesforce": {
"pip_url": "tap-salesforce",
"config": {
"api_type": "bulk",
"username": "...",
"security_token": "..."
}
},
"tap-salesforce-rest": {
"inherit_from": "tap-salesforce",
"config": {
"api_type": "rest",
}
}
}
}
}
Accents
Alto supports the idea of accents. An accent is a way for a tap to override target config when used in combination with the target. This is useful if you are using a target that supports different load methods and you want a particular tap to use a particular method.
Accents are searched for during configuration rendering based on the tap containing a key of the same name as the target. For example, if you have a tap named tap-salesforce
and a target named target-bigquery
, an accent may look like this:
- TOML
- YAML
- JSON
[default.taps.tap-salesforce]
pip_url = "tap-salesforce"
# This key will override the target config
# when the tap is used with a target matching the key
target-bigquery.denormalized = true
[default.targets.target-bigquery]
pip_url = "z3-target-bigquery"
config.project = "my-project"
config.denormalized = false
default:
taps:
tap-salesforce:
pip_url: tap-salesforce
target-bigquery:
# This key will override the target config
# when the tap is used with a target matching the key
denormalized: true
targets:
target-bigquery:
pip_url: z3-target-bigquery
config:
project: "my-project"
denormalized: false
{
"default": {
"taps": {
"tap-salesforce": {
"pip_url": "tap-salesforce",
"target-bigquery": {
"denormalized": true
}
}
},
"targets": {
"target-bigquery": {
"pip_url": "z3-target-bigquery",
"config": {
"project": "my-project",
"denormalized": false
}
}
}
}
}
This says, when the tap-salesforce
tap is used with the target-bigquery
target, the denormalized
key will be set to true
in the target config. Ex. alto tap-salesforce:target-bigquery
.