The Visitor

Though looking up at the clear night sky and my memories the same for a moment it seems that I remember whence I came But these visions are far between I feel desperate, gasping for time and as I…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Data Versioning for Customer Reports Using DataBricks and lakeFS

Similarweb offers an AI-based market intelligence platform that helps monitor web and mobile app traffic. The company gives a global multi-device market intelligence to understand, track, and grow digital market share. We analyze data from 3 Million applications and 80 Million websites to support thousands of clients ranging from small businesses to global enterprises to gain insight into any website’s statistics and strategy.

We’ve been using DataBricks since 2015. Initially, as a means to reduce DataOps efforts in enabling a spark environment for data scientists. It also supported the migration of code from research to production. Over the years we have shifted to running production workloads on DataBricks.

The data services group at Similarweb is responsible for producing data reports customized to the client’s requirements. These reports include competitive analysis of funnel conversion, competitive lost opportunities reports, etc.

To perform the analysis, we rely on data generated by other groups at Similarweb and data that we deduce from publicly available information. The data we use is constantly changing as websites constantly evolve, algorithms improve and our training sets continue to grow. Since a report is provided using the best data we have at the time it was produced, consecutive reports may use different data sets as input. This is expected, but also requires a good system to maintain the versions of data used to create a report, to ensure auditing, reproducibility, and troubleshooting.

Similarweb’s internal reporting application is named 360. It has a front-end interface to configure the report’s parameters and their scheduling. 360 manages a queue of jobs on Redis that are executed by workers who initiate a DataBricks cluster for the analysis. The DataBricks jobs run over an S3 bucket. The input is read from S3 and the output is written to S3 and can be consumed through the 360 application interface. We also maintain and run a daily airflow DAG that does all our pre-processing. The DAG tasks are also executed on a DataBricks cluster that is spun up for that purpose. Input data is read from S3 buckets owned by other groups, and the output is written to our data bucket to be consumed by the 360 application reports.

For data versioning and manageability we decided to test lakeFS, an open-source tool that provides git-operations over S3. It’s compatibility and seamless integration with DataBricks was an important consideration.

When we run a report for a customer, it is usually a recurring report produced once a day, a week or a month. The report depends on input data that can be updated by backfills or replaced due to accuracy improvements. For example, the analysis code can change to improve accuracy or fix bugs, and the complementary data may change due to the evolution of digital assets analyzed. . Consequently, when we run a report we want the ability to follow versions of code, input data, and output results.

Introducing data versioning capabilities provides us with the following:

For every report issued a lakeFS commit ID is added. This ID marks the report version. We can access any version of the report our retention policy allows us, and analyze it if customers require clarification or have follow up questions.

We can time travel between different versions of the report and for each version we have not only the report itself, but also a snapshot of our data repository at the time the report was calculated. The commit metadata includes the code version we used. We can now run the report again and receive the same result. We can open a lakeFS branch from that commit and test a new code version on the same input data, etc’

For some of the industries we serve, being able to present the data used for the analysis is required as part of the business audit. With lakeFS we can easily retain this information and share it upon request.

As a first phase we decided to test a single Spark job, running on DataBricks 7.6 and using lakeFS. It was as easy as advertised.

Afterwards, we could use our existing code to read data from lakeFS. The only necessary modification was to include the lakeFS branch name in the paths. Here is how we created a Spark DataFrame from our lakeFS data:

df = spark.read.parquet(“s3a://example-repo/example-branch/example.parquet”)

We were now able to operate on this DataFrame:

results = df.groupBy(“example-column”).count()

Then we wrote the results back to lakeFS:

results.write.parquet(“s3a://example-repo/example-branch/output/”)

The results now appeared as uncommitted changes on our lakeFS repository:

We could then commit these changes. After verifying the results, we merged our branch into our main branch, making them visible to everyone.

Since we had not yet refactored the Python code to version 3.0, we couldn’t upgrade DataBricks for that pipeline, but we still wanted lakeFS to work.

To solve this problem, the team at lakeFS implemented a fallback mechanism. For a given request, this feature checks whether the request refers to an existing repository in lakeFS, forwarding it to S3 if it doesn’t. This way we could simply set the S3 endpoint to the lakeFS API, which would handle only relevant requests, forwarding the rest to S3.

The Visitor

Data Versioning for Customer Reports Using DataBricks and lakeFS

Add a comment

Related posts:

Machine Learning 101

Is Anxiety and Depression the same?

Growth In Crisis is not an option