If you’re a B2B developer building a product, one of the earliest product development phases is creating a data integration pipeline to import customer data.
In this article, I’ll show you how to leverage Singer’s tap-salesforce to extract data from Salesforce. From there I’ll walk you through how to parse the JSON output data from Singer using target-csv and standardize it using a simple Python script.
Notes
The code for these examples is available publicly on GitHub here, along with descriptions that mirror the information I’ll walk you through.
These samples rely on a few open source Python packages:
- **tap-salesforce:**a Singer tap to extract data from Salesforce. More info on GitHub.
- **target-csv:**a Singer target which converts input JSON data to CSV files. More info on GitHub. We’ll use the hotglue fork which uses updated dependencies.
- singer-discover: an open source utility to select streams from a Singer catalog. More info on GitHub.
- **pandas:**a widely used open source data analysis and manipulation tool. More info on their site and PyPi.
- **gluestick:**a small open source Python package containing util functions for ETL maintained by the hotglue team. More info on PyPi and GitHub.
Without further ado, let’s dive in!
Step 1: Setup our environment
Create the virtualenv
Singer taps tend to have a lot of dependency conflicts with each other — to avoid dependency hell, I highly recommend running through this example in a virtual environment.
# Install JupyterLab if you don't have it already
$ pip3 install jupyterlab# Create the virtual env
$ python3 -m venv ~/env/tap-salesforce# Activate the virtual env
$ source ~/env/tap-salesforce/bin/activate# Install the dependencies
$ pip install tap-salesforce git+https://github.com/hotgluexyz/target-csv.git gluestick pandas ipykernel singer-python==5.3.1 https://github.com/chrisgoddard/singer-discover/archive/master.zip# Make our venv available to JupyterLab
$ python -m ipykernel install --user --name=tap-salesforce# Create a workspace for this
$ mkdir salesforce-integration# Enter the directory
$ cd salesforce-integration
These commands may vary depending on your OS and Python version. For more info on venvs with Jupyter, check out this TowardsDataScience article.
Step 2: Configure the Singer tap
Get the OAuth credentials
First off, you’re going to need Salesforce OAuth credentials. This process is already well-documented by Salesforce, so I’ll assume you can follow that guide.
Create the Singer tap config
Now we have to create a Singer config. This will specify our OAuth credentials and some Singer specific settings. Their example config is of the following format:
{
"client_id": "secret_client_id",
"client_secret": "secret_client_secret",
"refresh_token": "abc123",
"start_date": "2017-11-02T00:00:00Z",
"api_type": "BULK",
"select_fields_by_default": true
}
Fill in your credentials, and save this to a file called config.json
in the local directory.
Run Singer discover
The first step of getting data from Salesforce is to figure out what data is actually available. Singer taps offer a discover command which prints a JSON object describing all of this. Let’s run it now:
# Do the Singer discover and save to catalog.json
$ tap-salesforce --config config.json --discover > catalog.json
If this worked successfully, your catalog.json should resemble this:
# Check discover output
$ less catalog.json
{
"streams": [
{
"stream": "UserProvisioningRequestShare",
"tap_stream_id": "UserProvisioningRequestShare",
"schema": {
"type": "object",
"additionalProperties": false,
"properties": {
"Id": {
"type": "string"
},
...
Tell Singer what we want
From here, we want to select what objects we actually want to sync. We’ll use the singer-discover utility we downloaded earlier for this.
# Switch singer-python version to meet singer-discover dep
$ pip install singer-python==5.4.1# Build our selected catalog
$ singer-discover --input catalog.json --output properties.json
This will launch an interactive utility to select what streams(objects) you want from Salesforce. I am going to select Lead (space) and press enter. This will prompt you the option to select specific fields. I’ll accept the default and press enter.
This should give you the following output
? Select fields from stream: `Lead` done (55 selections)
INFO Catalog configuration saved.
Run Singer sync
We can now finally get the data from Salesforce using the files we’ve generated, using the following command:
# Get Lead data from Salesforce and save as a CSV
$ tap-salesforce --config config.json --properties properties.json | target-csv > state.json
This will output two files:
- the CSV containing the data from Salesforce (something like
Lead-20210128T125258.csv
) - a JSON file
state.json
tellingtap-salesforce
what it last synced. This can be fed back to the tap-salesforce in the future to avoid syncing the same data again.
Finally! We’ve pulled our data from Salesforce! Not too bad, right? If you wanted to use this in production, you’d have to automate the process of creating the properties.json
and likely stick all of this into a Docker container (very similar to how hotglue and Airbyte work).
Step 3: Standardize the data
You can follow along with this part directly in the Jupyter Notebook (feel free to clone and try your own transformations).
You can check out recipes here.
Look at the data
Let’s take a peek at what tap-salesforce
gave us.
Not too bad, right? Let’s load the data into a Jupyter Notebook and clean the data up a bit. For this article, I’ll keep it very simple but if you’d like to learn about other ETL operations check out my TowardsDataScience article.
Launch Jupyter
Let’s launch Jupyter
# You may have some issues in Jupyter if you don't do this
$ pip install prompt-toolkit==3.0.14# Deactivate the virtualenv
$ deactivate# Start Jupyter Lab
$ jupyter lab
This should start Jupyter in the current directory and open the browser.
JupyterLab started
If all the setup commands worked, you should see tap-salesforce available under the Notebook sections. Let’s create a new Notebook with the tap-salesforce kernel. I am going to name mine salesforce.ipynb
Load the data
Let’s use the gluestick and pandas libraries to load the data and take a look. Our goal here is to be able to easily manipulate the output from tap-salesforce.
# We'll use gluestick to read the data, and pandas to manipulate it
import gluestick as gs
import pandas as pd
# Read input data
input_data = gs.read_csv.folder(".", index_cols={'Lead': 'Id'})
#Take a peek
input_df = input_data['Lead']
input_df.head()
Clean up the data
Now that we have the data in a Panda’s dataframe, you can transform it however you like. Of course, you’re not limited to use Pandas — you could use Spark, or any other Python based data transformation tool you like.
# Let's only select the columns we want
leads = input_df[["Id", "Name", "Title", "Phone", "Email"]]
leads.head()
Conclusion
Next steps
This is really just a starting point for a data integration pipeline. If you’re looking to take this further (orchestrating this on the cloud, connecting it to your product) it’s worth taking a look at tools like hotglue and Meltano, both of which aim to make data integration easier for developers.
Considerations
I recently published an article on TowardsDataScience about the pros and cons of building off Singer. I would recommend checking out Airbyte before resolving to build your pipeline off Singer.
Feel free to check out the open source hotglue recipes for more samples in the future. Thanks for reading! I’d love to answer any comments or questions below.