ahakanbaba (u/ahakanbaba)

ahakanbaba commented on I am rich and have no idea what to do vinay.sh/i-am-rich-and-ha... · Posted by u/vhiremath4

ahakanbaba · a year ago

Consider having children. That will help with the “problems” of having too much money and time.

ahakanbaba commented on Ask HN: What is the best postmortem you've seen? · Posted by u/pbohun

ahakanbaba · 3 years ago

I learned quite a lot from this https://blog.cloudflare.com/details-of-the-cloudflare-outage...

ahakanbaba commented on Ask HN: How do you test SQL? · Posted by u/pcarolan

urthor · 3 years ago

fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.

ahakanbaba · 3 years ago

Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.

ahakanbaba commented on Ask HN: How do you test SQL? · Posted by u/pcarolan

ahakanbaba · 3 years ago

The hard part about testing SQL is decoupling from infrastructure and big data sources. We use DuckDB, and pandas dataframes mock data sources to unit test SQL. Python testing frameworks (or simple assert statements) can be used to compare inputs and outputs.

When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.

This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.

It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/

ahakanbaba commented on Simple alerts for missing metrics data with a focus on Wavefront blog.box.com/blog/handlin... · Posted by u/ahakanbaba

killertypo · 8 years ago

I've been in environments without any sense of monitoring; pingdom and esx health checks that provide only a tiny frame of reference to the real health of an application.

Being able to know the true health of a service is an absolute godsend.

So many times a service had been dead or gone for hours before anyone noticed (well our customers noticed, but it has to funnel up the pipeline from customer, to support, to engineering) before we were made aware of a real issue.

Nothing says good PR like not knowing you've been dead in the water for half a day and have no idea.

ahakanbaba · 8 years ago

I absolutely agree. For a service with any availability guarantees there has to be rigorous monitoring and alerting.

This also holds for services that have internal clients. In other words, if your output is consumed only by other services in the same company, the same high monitoring standards must apply. Otherwise failure detection becomes very delayed and the productivity of many teams gets affected. There is no worse buzzkill than explaining other service owners what is wrong with their application.

One other important lesson we have earned is that alerts require time to mature. The thresholds need to be trained, the alert formulation needs to be revised. Our alerts usually give couple of false positives in the first two weeks of their creation. During these two weeks we frequently improve the conditions of alerts.