Hey HN!
We’re Max and Thibault building OpenStatus.dev an OpenSource synthetic monitoring platform with incident managements
1 min demo: https://twitter.com/mxkaske/status/1685666982786404352
We have just reached 2000 stars on GitHub
https://github.com/openstatusHQ/openstatus
We are really excited to hear your feedback/questions and connect further: our emails are max@openstatus.dev and thibault@openstatus.dev.
Thank you!
I am previously the founder of a synthetic monitoring startup, devraven.io.
Just sharing my experience - monitoring is brutally competitive. From my conversations most large enterprises have very little synthetic monitoring, they use DDOG or other APM tools and do not want to try any new tools for few thousand dollar savings. And in a lot of cases they are comfortable with their custom test frameworks that use Selenium. Some are even worried that setting up synthetic monitoring will bring down their environment or trash their database with junk data ::sigh::
Most smaller companies we spoke to are not mature enough to have monitoring and did not have resources who can setup monitoring. They used to ask us for help to build tests for them. Asks for discounts on $29.99/mo price point were not uncommon.
After few months of operating the product, we did find few angels who were interested in investing in us (not the product). But in the end, we did not feel that we can make good use of investor money and provide a decent return to them, so we ended up backing out of the investment and chose to shutdown the product.
[1] https://github.com/louislam/uptime-kuma
Why did you end up going with a SaaS model? 30 Euros or $31.50 USD is pretty expensive for something like a status site. You'd have a lot less to manage day to day and be able to focus more on innovating the product if you just sold the software, imo.
Why the focus on synthetic monitoring? As a SRE, I actively eschew synthetic monitoring. It's highly error prone and doesn't actually indicate regional availability. I'd like a status site that I could push a certain internally derived SLA for a given service to and the status site reflects the average over time of that windowed SLA.
SLA's are intended to incur customer refunds when they're violated if they're meaningful. If your synthetic monitoring shows an SLA of 4 nines but it was actually closer to 4.8 or 4.9 then you could be on the hook for causing your customers a good bit of legal pain. Just something to think about in this space.
Other status sites don't build external SLAs off of internal metrics because the process of deriving internal metrics that align with external outcomes is sufficiently difficult. Instead, they calculate an SLA based off of posted statuses over a period of time eg: Degraded, Down, Up. Supporting both modes could be a boon to potential customers.
Overall looks like a great start; good luck on your venture!
As an end user, hard disagree.
GitHub is a great example of this. Their status almost always shows 100% uptime while the service is entirely unstable.
It is clear that their uptime SLAs do not align with end user experience.
As an end user, I care whether I can access and use the service. I don't care what broke in between.
I agree with you that the ultimate value is in customer impact. I was saying "that's hard" but synthetic monitoring is not the solution because it doesn't achieve what it sounds like it achieves.
However, if I read your comment carefully, you are suggesting to provide an alternative where the company (owner) could decide manually when a system is down or up. If that's the case, wouldn't the status page be just a page template where someone logs into a panel and toggle a button to say "down" or "up" and post updates? If there is no automatic monitoring, the service would look more like a blog/tumblr/twitter than anything else.
Or probably I am missing something because of my lack of experience and I am curious, I'd like to know!
Now combine all of the above with a client that has retry capabilities. That client could be a modern web app or a desktop app. Eventually consistent systems often rely on retry behavior and rate limiting to achieve smooth user transitions. Now I can't simply rely on 500s being sent because they may indicate a timeout or a caching problem. Now I need to rely on statistics on specific endpoints that will definitely result in a user facing error. Collecting that in real-time (real-time enough for alerting, anyway) is challenging as a company at that scale could be dealing with an abundance of requests per second.
When SREs get into an incident they'll often try to determine customer impact in order to know what hemorrhaging to stop first. Looking at a list of 500s in a system like that is often unhelpful, so we'll build dashboards of specific endpoints that show a level of degradation eg: "Show me all requests that did not have 2xx where the number of retries is 3". In my contrived example the client shows an error after the third exponential retry. If you were calculating availability purely off of the number of 500s you're not actually calculating customer impact, you're calculating the number of errors. That said it's a lot easier said than done to build a data system to make a query like what I described, much less to export it. So in order to provide accurate information the status site is updated manually.
On the flip side of what you described, some errors don't have a statistic. For instance, if I force rotate everyone's password and kill logins then I might post that on the status site as well. If it's the result of a security action or vulnerability I might declare the service degraded for a period of time.
Tbh we haven't thought of the sla violation
For region availability we are planing to add multi region check per Monitor
At the moment you can only set one region per monitor
Convenience.
More companies want Datadog,etc than to manage Datadog,etc.
This project relies on 4 different paid services. Why?
Why do you need a SaaS to handle your auth, mailing, database, and logging?
Aren't there libraries for these things in Typescript? Why pay for them?
In 2018-2020 my company (an FMCG company) asked me to temporarily lead the IT & product team of 2 startups that they invest (as a majority) a few years before. One is a telemedicine and the other is an e-commerce.
Both of them have almost all of their auth, db, etc using other unrecognized newly startup SaaS.
After a few meeting I realized that these startups is "guided" by the VC to use other startups service that the VC invest and in return the other startups will(must) use our service (telemedicine) for their employee.
So all of these startup companies can claim the monthly active users and companies that use their products, we also get the topline revenue and then those numbers will be included in a pitch-deck for the next round of investment.
To top that, for the telemedicine I also got a KPI to hire 200 programmers so we can also include that number in the pitch-deck. In 2 years, I got 3 talented one and less than 30 that can code fizz buzz or simple CRUD (with their language of choice).
Turso: Has insanely large free level and means no need to run your own DB(though you can run your own sqlite locally), their free tier even just got drastically expanded.
Clerk: 5000 free users, not having to deal with your own authentication.
Resend: Avoids dealing and managing mail, and dealing with spam filtering etc, i dont know if they allow just using an internal smtp, but seems ok given 3,000 mails per month.
Tinybird, i don't know enough about but also has a free plan...
So mostly i'd imagine most of these aren't about paying for third party platforms, its about offloading tasks you don't want to worry about implementing yourself, and that also give you the ability to scale outside of the small initial deployment for cost.
There are hundreds of auth libraries out there that you can use. Not one of them charges you per user lol. We been doing this for decades. Why are we now paying companies to do it for us?
This can be said about mailing, logging, and databases. I spent decades building web application not once was it hard to implement these features using libraries.
In fact it easier than ever with the tooling with have today.
No wonder 99% of starts up are losing money and going out of business. They are giving all there money away to the few that survive lol.
I guess the typescript people don't appreciate frameworks like Rails, Django, and Phoenix that implement all these features for you lol.
There's absolutely no reason to require SaaS to handle database & logging, but:
1. For mail, in 2023, it's a defacto requirement, for any app. Sure you can do it yourself, but handling spam filters will be a challenge. Defaulting to SaaS on this is extremely defensible.
2. For auth, in 2023, rolling your own auth that is secure & offers decent MFA is a similarly daunting task. Would it be nice if they offered an optional local auth backend, maybe. Would it be nicer if they offered a choice of multiple SaaS backends, definitely. But it's ultimately pretty defensible.
3. It seems to me the DB can be local sqlite / libsqld (looks primarily aimed at dev envs but at least it's an option).
---
On aggregate though you're right, this does seem excessively SaaS-y.
[1] https://github.com/louislam/uptime-kuma
OP, you might want to d a PR here: https://github.com/ivbeg/awesome-status-pages
Everyone else might be interested in that list of similar projects.
I think your core offering is around status tracking and stakeholder notification. However you're also pulling in Monitoring/APM by running your own status checks, for example. I would expect any paying customer to already have monitoring and alerting of some type; New Relic, DataDog, Amazon Cloudwatch Synthetics, etc. Wouldnt your customers want to use their own existing metrics for SLOs, or existing alarms & alerts for incident detection? Similarly it seems like youre implementing alerting/engagement as well. Are you asking your customers to reimplement their PagerDuty/OpsGenie/VictorOps configuration? There's a lot of organisational inertia around business processes that define alerting & engagement. I haven't looked at userbase numbers in a long time but I would guess the vast majority of your target customers are using one of those three already.
If I was to guess initial adoption would be aided by "ease of use", particularly integration with the customers existing tools & process. Then differentiation and value is based on what those existing monitoring/alerting tools cant do, eg alternative data sources (APM vs RUM), automated/predefined response, approval processes, customized visibility & communication per client, etc.
disclosure: Principal at AWS. Comments are my own personal opinion, based on public information only.