Cue, an open-source data validation language

wikibob · 5 years ago

Some context:

Cue is a project originally started by Marcel van Lohuizen who previously was part of BCL (Borg Config Lang) at Google. The main use is to generate config files.

See the Kubernetes examples at: https://cuelang.org/docs/tutorials/

Here are two posts discussing the motivations for Cue over BCL/Jsonnet:

- https://github.com/cuelang/cue/issues/33#issuecomment-483615...

- https://github.com/cuelang/cue/discussions/669

A very interesting development is that Grafana appears to be adopting Cue as a first-class configuration option. See: "Bring new CUE-based config schema system to release-readiness" https://github.com/grafana/grafana/issues/33139

This could mean that a future where Grafana dashboards can be two-way synced with a git repo will eventually exist.

----

Other tools with some industry adoption in the "Infrastructure as Code" space include

- Dhall

- Jsonnet (from BCL)

- kustomize

- Helm

- kubecfg

- Tanka

- SkyCfg

- jkcfg

- Krane

- HCL (Terraform)

And two tools that fall into a separate class of enabling "Infrastructure as Software"

- Pulumi (TypeScript/Go/Python/.NET)

- CDK

sdboyer · 5 years ago

re: Grafana (i'm the author of the linked issue) - i'm quite excited, i do think there's a world of possibilities here.

Two-way sync with a git repo is one possible path, and we've talked a lot internally at GL about how to best support it. My sense is that we can do it with relatively little friction and likely will - but if you're just syncing with a git repo, there's still a lot of arbitrary, opaque repo layout decisions that still have to be made (how do you map a filesystem position for a dashboard to a position in Grafana? In a way that places the dashboards next to the systems they're intended to observe? With many teams? With many Grafana instances?) which induce new kinds of friction at scale.

Fortunately - and not mutually exclusively with the above - by building the system for schema in CUE, we've made a composable thing that we can make into larger systems. That's what we're starting to do with Polly: https://github.com/pollypkg/polly

Conveniently, my parts of a Grafanaconline talk tomorrow discusses both of these https://grafana.com/go/grafanaconline/2021/dashboards-as-cod... :D

thelastbender12 · 5 years ago

This seems really exciting. I haven't had the chance to use Grafana yet; from the linked issue, am I understanding correctly that you'll be able to serialize dashboards to Cue schema, and hence get all the niceties of a structured representation - versioning, non-visual editing, and reproducibility?

I recall seeing another project HN which created dashboards out of a yaml description. This seems like a fantastic idea, given that a lot of business panels and dashboard apps can be implemented with a limited set of UI interactions.

tamalsaha001 · 5 years ago

It might sound a bit pedantic, but kustomize strictly avoids the "Infrastructure as Code" space and stays in the "Infrastructure as Data" space. The main difference is that since it just deals with "data", you can build any higher level tooling on this. One of the major proponents of this idea is Brian Grant from Google. He tweets about this from time to time. Here is a recent one: https://twitter.com/bgrant0607/status/1404461906186833927

chalst · 5 years ago

Is this distinction really about whether the customisation language is declarative? It seems to me that Dhall has the advantages Brian Grant attributes to "Infrastructure as Data", although it is an executable specification.

bookofsand · 5 years ago

Thanks for the "why cue" posts. The two key points appear to be inheritance vs. unification and nothing vs. typed. Somehow I'm unable to grok why unification is better than inheritance. Going a bit deeper:

* "Inheritance, is not commutative and idempotent in the general case"

* "A value is always final in CUE, it can only be made more specific."

From an engineering perspective, the latter is definitely more appealing. But I lack well articulated stories to understand how inheritance fails short, and how graph unification fares better. I wonder if there is somewhere a simple concrete example to contrast the not-idempotent inheritance approach vs. the graph unification approach.

curryst · 5 years ago

I believe they're discussing commutation and idempotency in the sense of types, rather than the sense of values.

Inheritance allows you to override properties/attributes. If you inherit from 2 classes that both specify the same attribute/property, but with different types for the same attribute, one of them takes precedence and overrides the other. A inherits from B inherits from C is not the same as A inherits from C inherits from B if C says attribute X is a string and B says attribute X is an int.

From my understanding, the equivalent graph unification is invalid. If type A is a unification of type B and C, then B and C cannot have any overlap. Each property is either a member of B or a member of C, but never both. It's commutative because A = B | C (A is the unification of B and C) is the same as A = C | B (A is the unification of C and B). If x is a member of B, and I access A.x, I will always end up accessing B. With inheritance, there can be a B.x and a C.x. Which one I end up accessing depends on which one is A's parent.

Inheritance is not idempotent because if A inherits from B inherits from C, then A is implicitly also B and C. However, A can override B's and C's behavior, so I can't trust that calling C.x will always return the same value. It might return the type C has for that attribute, it might return the type B has for that attribute or it might return the type C has for that attribute. You can prevent overriding the types in children, but at that point you've basically built graph unification.

To give a concrete example, Python allows inheritance. If we are provided with this:

    class MyCar:
        # Epoch time for when the car was made
        created_at: int
    
    class MyCarV2(MyCar):
        # Time it was created in RFC3339 format
        created_at: str
    
    class MyCarV3(MyCarV2):
        # Using an actual datetime object
        created_at: datetime.datetime

And we have a function like this:

    def time_since_created(car: MyCar) -> datetime.timedelta:

That function has no idea what the type of car.created_at will be. Mypy will complain at you because it's bad practice, but it's valid inheritance. Even if they all start with same conceptual time, MyCar.created_at, MyCarV2.created_at and MyCarV3.created_at return different types, despite all supposedly being valid instances of MyCar.

Graph unification forces you to pick a single type for each attribute of a single type. Rather than having 3 types that behave differently, graph unification forces you condense them into one:

    class MyCar:
        created_at: typing.Union[int, str, datetime.datetime]

That time_since_created function now knows exactly what type created_at is. Nothing else can change the type of created_at. If you need to add another possible type you have to either add it to the typing.Union, or create a new class. You can't create a subclass of MyCar with a different type for created_at.

purpleidea · 5 years ago

cough how can you leave out https://github.com/purpleidea/mgmt/ =D It's in golang, and is the only reactive DSL.

dolmen · 5 years ago

The language is named Go.

gervwyk · 5 years ago

This is very interesting! Working through the docs now and I'm enjoying the schema, and I've came up with similar ideas regarding data validation / generation in the past. It's nice to find a project like this! Thanks!

In most projects data validation becomes problematic. In a most of cases the schema could be a lot more defined than what type def offers. This allows for test cases to make sure data fits the model.

We've also been creating a DSL to build web apps. Check out Lowdefy [0] - I'm trying to come up with an "Infrastructure as Code" word for Lowdefy. "UI as config" is the closest fit, but not sure...

[0] - https://github.com/lowdefy/lowdefy

shykes · 5 years ago

If you’ve ever had to wrangle yaml configuration files… do yourself a favor and learn Cue. It’s still young and the website can seem intimidating; but it’s simpler than it looks, and the language is unbelievably powerful. There simply isn’t anything else like it. In my opinion it’s in a league of its own compared to other configuration languages like HCL, Jsonnet, Dhall, Starlark etc. Marcel, the creator of Cue, is basically the godfather of configuration languages - most of the state of the art can be traced back to his work at Google. Despite his deep knowledge of the subject and unparalleled experience, he is modest, pragmatic and responsive to questions and feedback. The momentum behind Cue reminds me of Go in its early days.

I’ve been using Cue for over a year now, using it as the foundation for a new projet; and will gladly answer questions about our experience.

throwaway894345 · 5 years ago

I tried dabbling with Cue, but it doesn't seem to solve the problem that I care about, which is that I have a whole bunch of configs that vary only slightly and I want to DRY them up.

For example, for any given application we have several fixed environments--dev, staging, prod--as well as "on demand" environments for things like pull requests or individual developer environments. The configs for these environments are almost the same, but they vary based on a handful of parameters. I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

Cue doesn't seem to care much about this problem, but rather it's just trying to make sure your data is type checked. It seems more like an advanced JSONSchema rather than a typed Starlark. I think the latter would be more powerful (albeit Cue's type system is more powerful than an ordinary generic type system with things like range types).

Cue almost has an answer to the DRY problem, but you can't quite emulate functions as far as I can tell (due, I think, to shadowing problems). I wonder what people who are convinced that Cue is the future would say to this? Am I just thinking about the problem wrong?

sdboyer · 5 years ago

I'd say all of these problems have answers in CUE.

> I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

This is pretty solidly in the target use case range, i'd say - managing variations of the same "object type" over some dimension is a lot of what's targeted by the way that CUE treats directory hierarchies when loading files: https://cuelang.org/docs/concepts/packages/#instances

The main thing you have to consider in designing a layout is that you have to take a compositional approach to how you define individual config instances. That is, you can't start from prod's config, then override a value or two for staging.

If i were to do it - i have not, this is not how i currently use CUE - my first approach would probably be by defining defaults at the "policy" level (per the above link), which effectively allows you to get exactly one "override"-ish behavior.

Lots of possible approaches to this, though.

> but you can't quite emulate functions as far as I can tell

Function-like capability is present, just in a form that's less familiar. I think of them as "function structs." This post has a bunch of examples https://github.com/cuelang/cue/issues/139#issuecomment-55677.... It seems there's a plan to add a more comfortable notation (https://github.com/cuelang/cue/issues/943), but it's fundamentally possible now.

dqpb · 5 years ago

Cue solves the DRY problem using a lattice data structure instead of inheritance. This is precisely why cue is better than everything else.

Kinrany · 5 years ago

> I want to be able to write a "generic environment" module for each application and then parameterize it accordingly for each environment.

This specifically looks like you want inheritance, which Cue eschews.

You can set up a generic config with default values for everything, and then have more specific configs that override the defaults.

A concrete example would help figure out if Cue can do what you want.

Deleted Comment

bitfield · 5 years ago

A quick overview of CUE for those who are wondering what all the fuss is about: https://bitfieldconsulting.com/golang/cuelang-exciting

wstrange · 5 years ago

This is a fantastic read - well done.

I'd recommend starting with this article as it nicely lays out the motivations for Cue.

debarshri · 5 years ago

I had met Marcel van Lohuizen when he was in the board of my previous company. One of the passionate techie and down to earth guy. He was actually working on cuelang and had not released it yet. After he gave us a presentation on Cue, one of my thoughts was that it is not easy for beginners to grasp it but then the language is not meant for beginners. My Second thought which was completely whack, was may be you could use it as a add-on for Protobufs, as the schema definitions in Cuelang has validations builts into it, which might remove boilerplate validation code in grpc services.

hawaiianSpork · 5 years ago

If you are looking to do data validation from the JVM, you may try Baleen (written in Kotlin): https://github.com/ShopRunner/baleen/

I'm one of the contributors. We created a DSL in the language to describe the data and create tests. You can then use that data description to validate against json, csv, avro... One of the neat things we came up with was the concept of a data trace which is like a stack trace but is a path through the data to a particular error.

carlosf · 5 years ago

At this point one might consider using a real language and common software practices for type checking, extending, modularization, testing, etc... Instead of building an ecosystem just to keep Infrastructure as Yaml sane.

My experience with Pulumi and AWS CDK is absolutely brilliant in this regard, hopefully good DevOps/SRE/WhateverNewTerm practices and patterns will reassemble good software development practices in the future.

dqpb · 5 years ago

Cue has a unique lattice type system that allows you to refine a property from type->constraint->value, but does not allow you override an existing value (or change it in any way that conflicts with the existing type/constraints).

In my view, this is the insight and value proposition that sets cue apart from everything else, including general programming languages.

Inheritance + property overriding is the source of most problems in configuration because you can never know if a value is the source of truth.

sdboyer · 5 years ago

Cannot +1 this hard enough. It is the kernel from which all other useful things flow.

sdboyer · 5 years ago

> real language

In what sense is CUE not a real language?

Micoloth · 5 years ago

I think parent means a General Purpose language, i.e. capable of computations.

Personally, on one hand I know allowing computations into configuration immediatly destroys any hope of having a tidy, rational schema in real word projects.

On the other hand though, i do believe configuration and code should be build with related tools, possibly the same tool- or at least tools using the same syntax!

(a bit like the json syntax is the same as a Python dict syntax, except this is the terrible example that is so poorly thought out that does more harm than good)

This unlocks a much greater degree of freedom and power than all the gluing together technologies that we have to do...

heywherelogingo · 5 years ago

I think he means one language that includes config, instead of yaml plus yaml-taming ecosystem.

andix · 5 years ago

It looks quite cool. I think it would be really useful if you have a lot of integrations into different programming languages, frameworks, and maybe even SQL servers.

So you could do data validation on the frontend, backend and the database server based on the same definitions.

It would save us a lot of bugs caused by different opinions of valid data in different layers of software.

Kinrany · 5 years ago

I really like the language and look forward to seeing more adoption.

Is there a second implementation in another programming language?

As I understand it, it would make sense to have libraries for processing Cue data in every language.

I'm a bit concerned about Cue relying on Go too much. A data validation language should be independent of the implementation language.

rjrodger · 5 years ago

Started working on a JS version a few weeks ago [1]. Even with 20% of the features it’s already so useful we’re building systems with it. And not just config - model all the things!

Overrides and inheritance are a world of pain. Unification and commutative operations restore sanity to the actual work of coding with a domain representation language because WYSIWYG. And you get type safety for your domain model.

The project is still at the “Read the Source, Luke” stage so caveat emptor until we get a respectable release out.

* https://github.com/rjrodger/aontu

sdboyer · 5 years ago

Very excited to see someone doing this! Right now, Grafana is [planned to] relying on an anemic CUE->Typescript translator for getting its schema to the frontend - https://github.com/sdboyer/cuetsy. (Somebody also pointed me to Project Cambria recently, which could be an interesting compilation target for what we have https://www.inkandswitch.com/cambria.html)

Being able to work with CUE natively in TS, though, would be a huge gamechanger for what we can do with CUE in Grafana

wdb · 5 years ago

Nice, I was looking for a Javascript version. I will check it out. Regarding "Read the Source, Luke" you are in good company with Apple and its Swift ABI :)

d0100 · 5 years ago

Does this make it possible to add syntax highlight and validation in an editor like monaco?