Show HN: OpenStatus – Open-source monitoring with incident managements

kc10 · 2 years ago

Congrats on the launch!!

I am previously the founder of a synthetic monitoring startup, devraven.io.

Just sharing my experience - monitoring is brutally competitive. From my conversations most large enterprises have very little synthetic monitoring, they use DDOG or other APM tools and do not want to try any new tools for few thousand dollar savings. And in a lot of cases they are comfortable with their custom test frameworks that use Selenium. Some are even worried that setting up synthetic monitoring will bring down their environment or trash their database with junk data ::sigh::

Most smaller companies we spoke to are not mature enough to have monitoring and did not have resources who can setup monitoring. They used to ask us for help to build tests for them. Asks for discounts on $29.99/mo price point were not uncommon.

After few months of operating the product, we did find few angels who were interested in investing in us (not the product). But in the end, we did not feel that we can make good use of investor money and provide a decent return to them, so we ended up backing out of the investment and chose to shutdown the product.

CAP_NET_ADMIN · 2 years ago

In what ways is it better than Uptime Kuma which doesn't require a bunch of SaaS spaghetti to run and has much broader community support.

[1] https://github.com/louislam/uptime-kuma

gettodachoppa · 2 years ago

As a power user (not professional sysadmin) I love Uptime Kuma. A simple docker image that uses 110MB RAM and gave me all the monitoring I need for my home lab and my cloud VPS. Very easy to use too, lovely UI.

tibozaurus · 2 years ago

Uptime kuma is great but for some orgs and users they don't want to have an other stacks to manage

oooyay · 2 years ago

First, congrats on the launch!

Why did you end up going with a SaaS model? 30 Euros or $31.50 USD is pretty expensive for something like a status site. You'd have a lot less to manage day to day and be able to focus more on innovating the product if you just sold the software, imo.

Why the focus on synthetic monitoring? As a SRE, I actively eschew synthetic monitoring. It's highly error prone and doesn't actually indicate regional availability. I'd like a status site that I could push a certain internally derived SLA for a given service to and the status site reflects the average over time of that windowed SLA.

SLA's are intended to incur customer refunds when they're violated if they're meaningful. If your synthetic monitoring shows an SLA of 4 nines but it was actually closer to 4.8 or 4.9 then you could be on the hook for causing your customers a good bit of legal pain. Just something to think about in this space.

Other status sites don't build external SLAs off of internal metrics because the process of deriving internal metrics that align with external outcomes is sufficiently difficult. Instead, they calculate an SLA based off of posted statuses over a period of time eg: Degraded, Down, Up. Supporting both modes could be a boon to potential customers.

Overall looks like a great start; good luck on your venture!

lucgagan · 2 years ago

> Why the focus on synthetic monitoring? As a SRE, I actively eschew synthetic monitoring. It's highly error prone and doesn't actually indicate regional availability. I'd like a status site that I could push a certain internally derived SLA for a given service to and the status site reflects the average over time of that windowed SLA.

As an end user, hard disagree.

GitHub is a great example of this. Their status almost always shows 100% uptime while the service is entirely unstable.

It is clear that their uptime SLAs do not align with end user experience.

As an end user, I care whether I can access and use the service. I don't care what broke in between.

oooyay · 2 years ago

I suspect on GitHubs front this has to do with how they populate their status site. They may update it manually once they identify customer impact. If they're using internal metrics to qualify the status site then they're likely not using all of the needed metrics to reflect customer impact. There's also a third possibility which is that between you and GitHub there's something that causes a partition or failure that is outside of GitHub and your domain of control.

I agree with you that the ultimate value is in customer impact. I was saying "that's hard" but synthetic monitoring is not the solution because it doesn't achieve what it sounds like it achieves.

101008 · 2 years ago

I don't know much about statuses pages, I just check them to see if the services I use are having an issue. It's the first time I read about "synthetic monitoring", and from a quick Google search, it seems to referring to "automatic monitoring". A bsic versino of this would be to do a ping to see if the server is responding, or a HTTP request to see if it's returning a 200 status code.

However, if I read your comment carefully, you are suggesting to provide an alternative where the company (owner) could decide manually when a system is down or up. If that's the case, wouldn't the status page be just a page template where someone logs into a panel and toggle a button to say "down" or "up" and post updates? If there is no automatic monitoring, the service would look more like a blog/tumblr/twitter than anything else.

Or probably I am missing something because of my lack of experience and I am curious, I'd like to know!

oooyay · 2 years ago

Good question. Status sites usually advertise the availability of features. When your service to feature mapping is 1:1 with just a load balancer or a cache in between then it's relatively simple to calculate. The number of 500s on the load balancer, cache, or both indicates errors sent to users. As a company grows several services usually combine to form a single feature; think about how a company has a "sign in" feature. There's likely a service that handles typical username password auth, then one for SSO, one for passkey, etc... at this rate, you have several inputs but the outputs remain somewhat consistent. 500s seen on your most externally facing endpoints are errors to users.

Now combine all of the above with a client that has retry capabilities. That client could be a modern web app or a desktop app. Eventually consistent systems often rely on retry behavior and rate limiting to achieve smooth user transitions. Now I can't simply rely on 500s being sent because they may indicate a timeout or a caching problem. Now I need to rely on statistics on specific endpoints that will definitely result in a user facing error. Collecting that in real-time (real-time enough for alerting, anyway) is challenging as a company at that scale could be dealing with an abundance of requests per second.

When SREs get into an incident they'll often try to determine customer impact in order to know what hemorrhaging to stop first. Looking at a list of 500s in a system like that is often unhelpful, so we'll build dashboards of specific endpoints that show a level of degradation eg: "Show me all requests that did not have 2xx where the number of retries is 3". In my contrived example the client shows an error after the third exponential retry. If you were calculating availability purely off of the number of 500s you're not actually calculating customer impact, you're calculating the number of errors. That said it's a lot easier said than done to build a data system to make a query like what I described, much less to export it. So in order to provide accurate information the status site is updated manually.

On the flip side of what you described, some errors don't have a statistic. For instance, if I force rotate everyone's password and kill logins then I might post that on the status site as well. If it's the result of a security action or vulnerability I might declare the service degraded for a period of time.

tibozaurus · 2 years ago

Thanks again !

Tbh we haven't thought of the sla violation

For region availability we are planing to add multi region check per Monitor

At the moment you can only set one region per monitor

paulddraper · 2 years ago

> Why did you end up going with a SaaS model?

Convenience.

More companies want Datadog,etc than to manage Datadog,etc.

impulser_ · 2 years ago

Why is it normal in the Typescript community to rely on a lot of other SaaS providers to build a simple application?

This project relies on 4 different paid services. Why?

Why do you need a SaaS to handle your auth, mailing, database, and logging?

Aren't there libraries for these things in Typescript? Why pay for them?

ies7 · 2 years ago

Disclaimer: This is my personal opinion/experience. Not all people/startup do this.

In 2018-2020 my company (an FMCG company) asked me to temporarily lead the IT & product team of 2 startups that they invest (as a majority) a few years before. One is a telemedicine and the other is an e-commerce.

Both of them have almost all of their auth, db, etc using other unrecognized newly startup SaaS.

After a few meeting I realized that these startups is "guided" by the VC to use other startups service that the VC invest and in return the other startups will(must) use our service (telemedicine) for their employee.

So all of these startup companies can claim the monthly active users and companies that use their products, we also get the topline revenue and then those numbers will be included in a pitch-deck for the next round of investment.

To top that, for the telemedicine I also got a KPI to hire 200 programmers so we can also include that number in the pitch-deck. In 2 years, I got 3 talented one and less than 30 that can code fizz buzz or simple CRUD (with their language of choice).

cchance · 2 years ago

Not the dev but i know for me it mostly comes down to not reinventing the wheel, and it allows for a lot easier ability to scale while also allowing for free operation as well for indies.

Turso: Has insanely large free level and means no need to run your own DB(though you can run your own sqlite locally), their free tier even just got drastically expanded.

Clerk: 5000 free users, not having to deal with your own authentication.

Resend: Avoids dealing and managing mail, and dealing with spam filtering etc, i dont know if they allow just using an internal smtp, but seems ok given 3,000 mails per month.

Tinybird, i don't know enough about but also has a free plan...

So mostly i'd imagine most of these aren't about paying for third party platforms, its about offloading tasks you don't want to worry about implementing yourself, and that also give you the ability to scale outside of the small initial deployment for cost.

impulser_ · 2 years ago

You don't have to reinvent the wheel. You just have to use a library instead of using a library and paying someone lol.

There are hundreds of auth libraries out there that you can use. Not one of them charges you per user lol. We been doing this for decades. Why are we now paying companies to do it for us?

This can be said about mailing, logging, and databases. I spent decades building web application not once was it hard to implement these features using libraries.

In fact it easier than ever with the tooling with have today.

No wonder 99% of starts up are losing money and going out of business. They are giving all there money away to the few that survive lol.

I guess the typescript people don't appreciate frameworks like Rails, Django, and Phoenix that implement all these features for you lol.

lucideer · 2 years ago

> Why do you need a SaaS to handle your auth, mailing, database, and logging?

There's absolutely no reason to require SaaS to handle database & logging, but:

1. For mail, in 2023, it's a defacto requirement, for any app. Sure you can do it yourself, but handling spam filters will be a challenge. Defaulting to SaaS on this is extremely defensible.

2. For auth, in 2023, rolling your own auth that is secure & offers decent MFA is a similarly daunting task. Would it be nice if they offered an optional local auth backend, maybe. Would it be nicer if they offered a choice of multiple SaaS backends, definitely. But it's ultimately pretty defensible.

3. It seems to me the DB can be local sqlite / libsqld (looks primarily aimed at dev envs but at least it's an option).

---

On aggregate though you're right, this does seem excessively SaaS-y.

josevalerio · 2 years ago

https://x.com/bholmesdev/status/1704846748441534627?s=46

nodesocket · 2 years ago

I run Uptime Kuma[1] in my home to monitor all my homelab and Kubernetes services. It's really awesome. How does OpenStatus compare to it?

[1] https://github.com/louislam/uptime-kuma

jaxn · 2 years ago

I had just been looking at open source status pages this morning, and this was not in the list I was looking at.

OP, you might want to d a PR here: https://github.com/ivbeg/awesome-status-pages

Everyone else might be interested in that list of similar projects.

tibozaurus · 2 years ago

We are already in it :)

jaxn · 2 years ago

Ah, under Services, not Open Source.

donavanm · 2 years ago

Hey Max & Thibault, interesting approach. It seems like you're going for after a specific feature and (unintentionally?) pulling in some product areas that are very hard businesses. I believe youre reusing existing saas and framework tools to make your effort more effective, but may be asking your customers to adopt new dependencies as well.

I think your core offering is around status tracking and stakeholder notification. However you're also pulling in Monitoring/APM by running your own status checks, for example. I would expect any paying customer to already have monitoring and alerting of some type; New Relic, DataDog, Amazon Cloudwatch Synthetics, etc. Wouldnt your customers want to use their own existing metrics for SLOs, or existing alarms & alerts for incident detection? Similarly it seems like youre implementing alerting/engagement as well. Are you asking your customers to reimplement their PagerDuty/OpsGenie/VictorOps configuration? There's a lot of organisational inertia around business processes that define alerting & engagement. I haven't looked at userbase numbers in a long time but I would guess the vast majority of your target customers are using one of those three already.

If I was to guess initial adoption would be aided by "ease of use", particularly integration with the customers existing tools & process. Then differentiation and value is based on what those existing monitoring/alerting tools cant do, eg alternative data sources (APM vs RUM), automated/predefined response, approval processes, customized visibility & communication per client, etc.

disclosure: Principal at AWS. Comments are my own personal opinion, based on public information only.

jacooper · 2 years ago

Looks good, however i find the pricing a bit on the high side, especially compared to others like uptime robot and hettrix tools.

tibozaurus · 2 years ago

Yep but we are on the same pricing range as BetterStack or Checkly based on the number of request we make per month month