Data Engineering

Airbyte: The Open-Source Alternative to Proprietary Data Integration Solutions
Sangita Paul
March 17, 2023

In today's data-driven world, organizations gather a vast amount of data from various sources, but having data scattered across different systems and departments can make it difficult to access and use effectively. A robust data integration strategy is necessary to collect and store data in a centralized location, which can be accomplished by developing custom codes leveraging APIs, different ETL tools, implementing database replication.

Scenario prior to the introduction of Airbyte:

Before Airbyte emerged, organizations used various methods for data integration. Some of the most frequently used methods include:

• ETL tools :

ETL tools were commonly used for data integration. But these tools with GUI for data integration workflows require costly software licenses and infrastructure. Tools like Talend, Informatica, IBM infosphere DataStage, Pentaho etc. were used previously.

APIs :

APIs can be used to extract data programmatically. But creating and managing custom integrations with APIs can be a time-consuming process that may require continual maintenance.

Custom Code :

Earlier, companies made custom data integration solutions by writing code to extract data from one system and transfer it to another. But this approach required a lot of development effort and ongoing maintenance.

Nevertheless, each of these approaches had its own set of limitations, and to address these issues, new ETL tools such as Airbyte, Fivetran, and others were launched.

What is Airbyte?

Airbyte is an open-source platform that enables hassle free and reliable data transfer between different systems, with flexibility and ease of use. With this tool, you can easily link your source and destination systems by simply clicking a button. A key benefit of Apache Airbyte is its capability to connect to a diverse range of data sources and destinations. It supports popular databases (MySQL, PostgreSQL) and cloud storage solutions (Amazon S3, Google Cloud Storage). Along with its extensive range of connectivity options, Apache Airbyte offers various features that make it a perfect data integration solution.

Data Ingestion in Airbyte:

The process of ingesting data in Airbyte consists of the following steps:

1. Establish a connection to the data source from which the data is to be extracted.

2. Extract data from the source using specific connectors.

3. Transform the data to fit the destination schema.

4. Establish a connection to the destination database where the data will be inserted.

5. Insert the transformed data into the destination database.

6. Validate the data to ensure accuracy and completeness.

Overall, Airbyte makes data ingestion easy and flexible, allowing users to extract data from different sources and integrate it into their preferred destination systems.

The use case:

We tackled a straightforward scenario where data resided in a Salesforce account, and our objective was to migrate it to a PostgreSQL database.

Set up an Airbyte Salesforce Source:

• Create a Salesforce source.

• Enter source name(unique), Client ID, Client Secret (these two are given by salesforce itself), Refresh Token (generated through API).

Set up an Airbyte Postgres destination:

• Now we will create a destination for our source.

• We need to give required parameters to connect with destination.

Adding Connection: Salesforce --> Postgres:

A connection between Salesforce and PostgreSQL will be formed following the setup of the source and destination in the Airbyte Interface. There, we can activate the streams that we want to sync from Salesforce to Postgres and select the replication frequency, which determines how frequently we want the streams to sync.

Extracted Data size and time taken:

From salesforce, we synced a lead table, resulting in the extraction of 3.22 GB of data and the commitment of 463909 records to Postgres, all accomplished within a time frame of 1 hour and 19 minutes.

Why organizations should consider Airbyte as their data integration platform:

• Versatile and scalable architecture:

  Airbyte's architecture is versatile and scalable, making it capable of handling a variety of data sources and use cases.

• Efficient integration of data from various sources :

  Airbyte simplifies multi-source data integration, saving time and resources over manual methods.

• Seamless connectivity:

  The platform simplifies the connection to multiple data sources, data extraction, transformation, and loading into a target    warehouse.

• Enhanced management and protection:

   Airbyte simplifies management and enhances security by using a service account as the project owner.

To further enhance security and management in Airbyte for organizations, the following practices can be implemented:

1. Encryption at rest:

   By using a dedicated secrets store (KMS) rather than database storage, Airbyte ensures that credentials are encrypted     and stored independently of Airbyte instances.

2. Encryption in transit:

     Airbyte implements HTTPS security across all its services.

3. SOC 2 Type II assessment:

  Airbyte has undergone an independent third-party SOC 2 Type 2 assessment and has affirmed its commitment to its     Security and Data Privacy Policy.

4. Network Security:

   To prevent unauthorized access to its systems and data, Airbyte employs a strong set of network security measures that     includes firewalls and intrusion detection and prevention systems.

5. Access Controls:

  Airbyte makes sure that only authorized people can access data by using strict access controls. These controls include   multi-factor authentication, role-based access control, and auditing of user activity. But Airbyte Open  Source does not    currently have any user management or role-based access controls (RBAC)in place to prevent unauthorized access to its     API or UI.

6. Data privacy policies:

    Airbyte has developed extensive data privacy policies that define the procedures for collecting, using, and storing data.

Easy integration with external tools:

Airbyte offers a convenient solution for organizations dealing with several data repositories, with easy integration into other tools and systems.

With these features and capabilities, Airbyte empowers organizations to streamline their data integration processes, boost data quality, and achieve their business goals.

We hope you found our blog post informative. If you have any project inquiries or would like to discuss your data and analytics needs, please don't hesitate to contact us at We're here to help! Thank you for reading.