I think it would be useful to differentiate more clearly between what is offered by Python's type system, and what is offered by Pydantic.
That is, you can approximate Rusts's enum (sum type) with pure Python using whatever combination of Literal, Enum, Union and dataclasses. For example (more here[1]):
@dataclass
class Foo: ...
@dataclass
class Bar: ...
Frobulated = Foo | Bar
Pydantic adds de/ser, but if you're not doing that then you can get very far without it. (And even if you are, there are lighter-weight options that play with dataclasses like cattrs, pyserde, dataclasses-json).
Yep, someone brought this up on another discussion forum. The post was intended to be explicitly about accomplishing the ser/de half as well, hence the emphasis on Pydantic :-)
(Python’s annotated types are very powerful, and you can do this and more with them if you don’t immediately need ser/de! But they also have limitations, e.g. I believe Union wasn’t allowed in isinstance checks or matching until a recent version.)
If you're looking for serialization/deserialization, you might consider confactory [1]. I created to be a factory for objects defined in configs. It actually builds the Python objects without much effort on the user. It simply makes use of type annotations (though you can define your own serializers and deserializers).
It also supports complex structures like union types, lists, etc. I used it to create cresset [2], a package that allows building Pytorch models directly from config files.
I think it’s quite useful to separate ser/de, structural validation, and semantic validation. This is where I struggle with a library like ruamel.yaml, running deserialization and structural validation together, or Pydantic, running structural and semantic validation together. It’s not hard to write a Python type annotation for what you get from json.loads, and it’s also not hard to write a recursive function with a 200 line match statement that reflects on type annotations to convert that to typeddicts, data classes, and so forth. But semantic validation is a whole other problem, one that tends to be so domain specific it’s better deferred. Not that you shouldn’t do it, but that it belongs in its own data processing layer. Also this lets you be specific about what’s wrong with a piece of input. Bad JSON? A list where a dictionary was expected? An end timestamp that’s before the start? Sure, check each of these, and in context make invalid state unrepresentable, but invalid state after json.loads is very different from invalid state after validating your timestamps.
Good point, but that's not always desirable. If you have strict type-checking and _aren't_ doing ser/de, it's likely not necessary (eg Rust doesn't do runtime checks).
I needed to reflect Rust enums and went a bit further with that approach. All variants are wrapped in a decorated class, where the decorator automatically computes the union type and adds de/serialization hooks for `cattrs`.
@enumclass
class MyEnum:
class UnitLikeVariant(Variant0Arg): ...
class TupleLikeVariant(Variant2Arg[int, str]): ...
@dataclass
class StructLikeVariant:
foo: float
bar: int
# The following class variable is automatically generated:
#
# type = UnitLikeVariant | TupleLikeVariant | StructLikeVariant
Everyone is offering their suggestions, but no one has posted about marshmallow which handles everything out of the box including serialization and de-serialization. It's the perfect balance of dataclasses, (de)serialization, and lack of useless features and umpteen hacks that libraries like Pydantic and FastAPI have.
The problem I see with it is this: Now, instead of understanding Python, which is straightforward, you have to understand a bunch about Pydantic and type unions. In a large shop of Python programmers, I would expect many would not follow most of this.
Essentially, if this is a feature you must have, Python seems like the wrong language. Maybe if you only need it in spots this makes sense...
I think an important piece of context here is that this is not useful for non-ser/de patterns in Python: if all you have is pure Python types that don't need to cross serialization boundaries, then you can do all of this in pure Python (and refine it with Python's very mature type annotations).
In practice, however, Pydantic is one of the most popular packages/frameworks in Python because people do in fact need this kind of complexity. In particular, it makes wrangling complicated object hierarchies that come from REST APIs much easier/error prone.
> instead of understanding Python, which is straightforward, you have to understand a bunch about Pydantic and type unions.
This like saying "instead of understanding Python, you have to understand a bunch about SQLAlchemy and ORMs" or "instead of understanding Python, you need to understand GRPC and data streaming."
Ultimately every library you add to a project is cognitive overhead. Major frameworks or tools like sqlalchemy, Flask/Django, Pandas, etc. have a lot of cognitive overhead. The engineering decision is whether that cognitive overhead is worth what the library provides.
The measurement of worth is really dependent on your use case. If your use for Python is data scientists doing iterative, interactive work in Jupyter notebooks, Pydantic is probably not worth it. If you're building a robust data pipeline or backend web app with high availability requirements but dealing with suspect data parsing, Pydantic might be worth it.
You're not wrong, but the distinction here that I was responding to was the idea of needing to use Pydantic routinely for typechecking. Libraries that you have to know might as well be language features.
The phrasing of "The engineering decision" in your reply is telling -- you are coming from it as an engineer. But I'm looking at the population of Python programmers, which extends far beyond software engineers. The more such people have to learn, the more problematic the language becomes. Python succeeded despite not being a statically compiled language with clear typechecking because there is an audience for which those aren't the critical factors.
As I said in another response, it reminds me of what happened to Java. Maybe that's just my own quirk, but none of these changes are free.
> Ultimately every library you add to a project is cognitive overhead. Major frameworks or tools like sqlalchemy, Flask/Django, Pandas, etc. have a lot of cognitive overhead. The engineering decision is whether that cognitive overhead is worth what the library provides.
IMO a library that provides regular functions and values that follow the rules of the language adds zero cognitive overhead. Frameworks that change/break the rules, that let you do things that you can't normally do with regular values, or don't let you do things that you normally could do, are the ones that add overhead, and it sounds like Pydantic is more in that category.
Pydantic is truly a godsend to the Python ecosystem. It is a full implementation of "parse don't validate" and does so using Python's existing type declarations. It uses the same forms as dataclasses, SQLAlchemy, and Django that have been part of Python forever so most Python programmers are familiar with it. And the reason you reach for it is that it eliminates whole classes of errors when the boundary between your program and the outside world is only via .model_validate() and .model_dump(). The outside world including 3rd-party API calls. The data either comes back to you exactly like you expect it to, or it errs. It's hundreds of tests that you simply don't have to write.
In the same way that SQLite bills itself as the better alternative to fopen(), Pydantic is the better alternative to json.loads()/json.dumps().
I don't think you are wrong and I have at times missed having such an option. But... I saw Java go down this path of cool features that you needed to learn, on top of the basic language, and eventually it took Java to an environment where learning the toolset and environment was complex, and vastly changed the calculus of how approachable the language was. In my mind, anyway, it went from being a useful if incomplete tool to being a more complete language that was not really worth messing with unless you were going to make a big commitment.
Every step that takes Python in that direction is a mistake, because if we need to make a huge commitment, Python probably isn't the right language. A large part of the appeal of Python is that it is easy to learn, easy to bring devs up to speed on if they don't know it, easy to debug and understand. That's why people use it despite its performance shortcomings, despite its concurrency issues, etc. (That and the benefit of a large and fairly high quality library.)
As a matter of fact this would not be a "major undertaking" in Python, unless your definition of the term is majorly loose:
@dataclass
class InnerCircle:
s: str
@dataclass
class NPC:
s: str
@dataclass
class Dissenter:
s: str
Committer = Union[InnerCircle, NPC, Dissenter]
class CocReaction(Enum):
DoNothing = auto()
ThreeMonthsWithoutHumiliation = auto()
PublicDefamation = auto()
def adjudicate(c: Committer) -> CocReaction:
match c:
case InnerCircle():
return CocReaction.DoNothing
case NPC():
return CocReaction.ThreeMonthsWithoutHumiliation
case Dissenter():
return CocReaction.PublicDefamation
Although in reality you'd likely model Committer as a product of a status and name, and adjudicate as a map of status to reaction, unless there are other strong reasons to make Committer a sum.
I think it is normal to know popular libraries of a language. For python, django, drf, fastapi, pydantic or jinja are very common.
There are some people resisting type checks in python but I think fewer and fewer. I dont think people refusing to learn basic concepts and libraries are a reason to not use something.
Also, I am not a big fan of not doing something useful because we need to do a bit of learning. It seems like a variant of "we have always done it this way". Plus it is a strawman attributed to python developers, IMO.
As it just so happens, I was struggling with this in Python recently and this post describes a better solution than what I came up with.
> Essentially, if this is a feature you must have, Python seems like the wrong language.
While I don't disgree in the absolute sense, there are constraints. You can't just switch language or change the problem you're solving. If you have the need for more type safety, then this is a price worth paying.
Okay? Programmers have to understand lots of things that aren't just the bare basics of the language they're using. When did we decide that all software developers are helpless? When can we get back to expecting experts to know things?
When? Probably around the time when the people hoping they don't need to know anything because ai will write what they want discover desire without knowledge doesn't work so well?
One caveat of the tip in the "Deduplicating shared variant state" section about including an underspecified discriminator field in the base class, is that it doesn't play well if you're using Literals instead of Enums as the discriminator type. Python does not allow you to narrow a literal type of a field in a subclass, so the following doesn't type check:
from typing import Literal
class _FrobulatedBase:
kind: Literal['foo', 'bar']
value: str
class Foo(_FrobulatedBase):
kind: Literal['foo'] = 'foo'
foo_specific: int
class Bar(_FrobulatedBase):
kind: Literal['bar'] = 'bar'
bar_specific: bool
"kind" overrides symbol of same name in class "_FrobulatedBase"
Variable is mutable so its type is invariant
Override type "Literal['foo']" is not the same as base type "Literal['foo', 'bar']"
> it doesn't play well if you're using Literals instead of Enums as the discriminator type
The original example code with Enums doesn't type-check either, and for the same reason:
If the type checker allowed that, someone could take an object of type Foo, assign it to a variable of type _FrobulatedBase, then use that variable to modify the kind field to 'bar' and now you have an illegal Foo with kind 'bar'.
However, I think that's possibly a bug :-) -- I agree that narrowing a literal via subclassing is unsound. That's why the example in the blog used `str` for the superclass, not the closure of all `Literal` variants.
(I use this pattern pretty extensively in Python codebases that are typechecked with mypy, and I haven't run into many issues with mypy failing to understand the variant shapes -- the exception to this so far has been with `RootModel`, where mypy has needed Pydantic's mypy plugin[2] to understand the relationship between the "root" type and its underlying union. But it's possible that this is essentially unsound as well.)
Using str in the superclass equally unsound and also doesn't type check. There's no good way to do it, as the discriminator type is by definition disjoint between all kinds.
Something I've wondered of late. I keep seeing these articles pop up and they're trying to recreate ADTs for Python in the manner of Rust. But there's a long history of ADTs in other languages. For instance we don't see threads on recreating Haskell's ADT structures in Python.
Is this an artifact of Rust is hype right now, especially on HN? As in the typical reader is more familiar with Rust than Haskell, and thus "I want to do what I'm used to in Rust in Python" is more likely to resonate than "I want to do what I'm used to in Haskell in Python"?
At the end of the day it doesn't *really* matter as the underlying construct being modeled is the same. It's the translation layer that I'm wondering about.
I think so, in the sense that Rust has successfully translated ADTs and other PLT-laden concepts from SML/Haskell into syntax that a large base of engineers finds intuitive. Whether or not that’s hype is a value judgement, but that is the reason I picked it for the example snippet: I figured more people would “get” it with less explanation required :-)
Apologies for my meta-meta-comment :) I've been writing code for ~30 years in various languages, and today my brain can't compute how people find any syntax other than this more intuitive:
data Thing
= ThingA Int
| ThingB String Bool
| ThingC
To me, the above syntax takes away all the noise and just states what needs to be stated.
Got it. It makes sense and was what I figured is the case. I find it interesting as it's a sign of the times watching the evolution of what the "base language" is in threads like this over time. I mentioned in another comment that several years ago it'd have been Haskell or Scala. If one went back further (before my time!) it'd probably have been in OCaml or something.
I think "hype" has some connotations that I wouldn't necessarily agree with, and I don't think it's as much "on HN" as "people who write Python," but I would agree that I would expect at this point more Python folks to be familiar with Rust than Haskell, and so that to be the reason, yes.
The reason I said hype is that it's a cycle here. If you go back 10 years every example *would* have been in Haskell. Or perhaps Scala. They were the cool languages of the era. And the topics here painted a picture that their use in the broader world was more common than they really were. And I say that as someone who used both Haskell & Scala in my day job at the time. HN would have you believe that I was the norm, but I very much was not.
That's not to say it's bad, or a problem. If it gets more people into these concepts that's great.
It is quite common to see people in Rust circles mentioning Rust being innovative for feature XYZ, that was initially in a ML variant, Ada, Eiffel, ....
I would say familarity, and lack of exposure to programming languages in general.
Nowhere in this post or in any Rust community post I'm aware of does anybody claim that sum types (or product types, or affine/linear types, etc.) are a Rust novelty.
As a stretch, I've seen Rust content where people claim that Rust has successfully popularized a handful of relatively obscure PLT concepts. But this is a much, much weaker claim than Rust innovating or inventing them outright, and it's one that's largely supported by the size of the Rust community versus the size of Haskell or even the largest ML variant communities.
(I say this as someone who wrote OCaml for a handful of years before I touched Rust.)
Is there any reason why you've singled out Rust as particularly notable here and not any of the many other languages with them? OCaml, Elm, F#, Scala, I think more recent versions of Java, Kotlin, Nim, TypeScript, and Swift all support ADTs. Python already supports them, albeit with very little runtime support. Rust doesn't particularly stand out in such a broad field of languages. They're so useful a language needs a good reason these days to not support them.
FWIW I seem to often find myself reaching for Haskell-isms when writing Typscript or Scala. And I’ve never actually written production Haskell code! But so many concepts like this just map nicely. “Parse don’t validate”, “make illegal states unreprsentable”, etc - all those patterns.
typedload does this without need to pass a "discriminator" parameter.
Just having the types with the same field defined as a literal of different things will suffice.
I've also implemented an algorithm to inspect the data and find out the type directly from the literal field, to avoid having to try multiple types when loading a union. Pydantic has also implemented the same strategy afterwards.
typedload is faster than pydantic to load tagged unions. It is written in pure python.
edit: Also, typedload just uses completely regular dataclasses or attrs. No need for all those different BaseModel, RootModel and understanding when to use them.
I know that Foo and Frobulator and so on have history in code examples, but I personally find examples with them require more careful reading than examples built on real concepts.
Something I've learned is that in general, people find it easier to follow concrete examples than abstract ones.
I agree with you that the article would have been improved if they'd used real-world examples, e.g. a ContactMethod type that has Address or PhoneNumber or something like that.
It's a shame there's so many different names for a set of very related (or identical?) concepts. For example wikipedia says "tagged union" is also known as "variant, variant record, choice type, discriminated union, disjoint union, sum type, or coproduct". [https://en.wikipedia.org/wiki/Tagged_union]
That is, you can approximate Rusts's enum (sum type) with pure Python using whatever combination of Literal, Enum, Union and dataclasses. For example (more here[1]):
Pydantic adds de/ser, but if you're not doing that then you can get very far without it. (And even if you are, there are lighter-weight options that play with dataclasses like cattrs, pyserde, dataclasses-json).[1] https://threeofwands.com/algebraic-data-types-in-python/
(Python’s annotated types are very powerful, and you can do this and more with them if you don’t immediately need ser/de! But they also have limitations, e.g. I believe Union wasn’t allowed in isinstance checks or matching until a recent version.)
It also supports complex structures like union types, lists, etc. I used it to create cresset [2], a package that allows building Pytorch models directly from config files.
[1]: https://pypi.org/project/confactory/ [2]: https://pypi.org/project/cresset/
Also I’d add msgspec to your list at the end. Lightweight and fast, handles validation during decoding.
Do you have anything public that elaborates on this?
If you don’t care about types and just want ser/de that’s great, but I think it’s clearly on topic here to care about types.
It was also an order of magnitude slower than other libraries, and at the time all these libraries were much slower.
Essentially, if this is a feature you must have, Python seems like the wrong language. Maybe if you only need it in spots this makes sense...
In practice, however, Pydantic is one of the most popular packages/frameworks in Python because people do in fact need this kind of complexity. In particular, it makes wrangling complicated object hierarchies that come from REST APIs much easier/error prone.
This like saying "instead of understanding Python, you have to understand a bunch about SQLAlchemy and ORMs" or "instead of understanding Python, you need to understand GRPC and data streaming."
Ultimately every library you add to a project is cognitive overhead. Major frameworks or tools like sqlalchemy, Flask/Django, Pandas, etc. have a lot of cognitive overhead. The engineering decision is whether that cognitive overhead is worth what the library provides.
The measurement of worth is really dependent on your use case. If your use for Python is data scientists doing iterative, interactive work in Jupyter notebooks, Pydantic is probably not worth it. If you're building a robust data pipeline or backend web app with high availability requirements but dealing with suspect data parsing, Pydantic might be worth it.
The phrasing of "The engineering decision" in your reply is telling -- you are coming from it as an engineer. But I'm looking at the population of Python programmers, which extends far beyond software engineers. The more such people have to learn, the more problematic the language becomes. Python succeeded despite not being a statically compiled language with clear typechecking because there is an audience for which those aren't the critical factors.
As I said in another response, it reminds me of what happened to Java. Maybe that's just my own quirk, but none of these changes are free.
IMO a library that provides regular functions and values that follow the rules of the language adds zero cognitive overhead. Frameworks that change/break the rules, that let you do things that you can't normally do with regular values, or don't let you do things that you normally could do, are the ones that add overhead, and it sounds like Pydantic is more in that category.
In the same way that SQLite bills itself as the better alternative to fopen(), Pydantic is the better alternative to json.loads()/json.dumps().
Every step that takes Python in that direction is a mistake, because if we need to make a huge commitment, Python probably isn't the right language. A large part of the appeal of Python is that it is easy to learn, easy to bring devs up to speed on if they don't know it, easy to debug and understand. That's why people use it despite its performance shortcomings, despite its concurrency issues, etc. (That and the benefit of a large and fairly high quality library.)
There are some people resisting type checks in python but I think fewer and fewer. I dont think people refusing to learn basic concepts and libraries are a reason to not use something.
Also, I am not a big fan of not doing something useful because we need to do a bit of learning. It seems like a variant of "we have always done it this way". Plus it is a strawman attributed to python developers, IMO.
> Essentially, if this is a feature you must have, Python seems like the wrong language.
While I don't disgree in the absolute sense, there are constraints. You can't just switch language or change the problem you're solving. If you have the need for more type safety, then this is a price worth paying.
The original example code with Enums doesn't type-check either, and for the same reason:
If the type checker allowed that, someone could take an object of type Foo, assign it to a variable of type _FrobulatedBase, then use that variable to modify the kind field to 'bar' and now you have an illegal Foo with kind 'bar'.
However, I think that's possibly a bug :-) -- I agree that narrowing a literal via subclassing is unsound. That's why the example in the blog used `str` for the superclass, not the closure of all `Literal` variants.
(I use this pattern pretty extensively in Python codebases that are typechecked with mypy, and I haven't run into many issues with mypy failing to understand the variant shapes -- the exception to this so far has been with `RootModel`, where mypy has needed Pydantic's mypy plugin[2] to understand the relationship between the "root" type and its underlying union. But it's possible that this is essentially unsound as well.)
[1]: https://mypy-play.net/?mypy=latest&python=3.12&gist=f35da62e...
[2]: https://docs.pydantic.dev/latest/integrations/mypy/
Deleted Comment
Something I've wondered of late. I keep seeing these articles pop up and they're trying to recreate ADTs for Python in the manner of Rust. But there's a long history of ADTs in other languages. For instance we don't see threads on recreating Haskell's ADT structures in Python.
Is this an artifact of Rust is hype right now, especially on HN? As in the typical reader is more familiar with Rust than Haskell, and thus "I want to do what I'm used to in Rust in Python" is more likely to resonate than "I want to do what I'm used to in Haskell in Python"?
At the end of the day it doesn't *really* matter as the underlying construct being modeled is the same. It's the translation layer that I'm wondering about.
I think so, in the sense that Rust has successfully translated ADTs and other PLT-laden concepts from SML/Haskell into syntax that a large base of engineers finds intuitive. Whether or not that’s hype is a value judgement, but that is the reason I picked it for the example snippet: I figured more people would “get” it with less explanation required :-)
That's not to say it's bad, or a problem. If it gets more people into these concepts that's great.
I would say familarity, and lack of exposure to programming languages in general.
As a stretch, I've seen Rust content where people claim that Rust has successfully popularized a handful of relatively obscure PLT concepts. But this is a much, much weaker claim than Rust innovating or inventing them outright, and it's one that's largely supported by the size of the Rust community versus the size of Haskell or even the largest ML variant communities.
(I say this as someone who wrote OCaml for a handful of years before I touched Rust.)
typedload does this without need to pass a "discriminator" parameter.
Just having the types with the same field defined as a literal of different things will suffice.
I've also implemented an algorithm to inspect the data and find out the type directly from the literal field, to avoid having to try multiple types when loading a union. Pydantic has also implemented the same strategy afterwards.
typedload is faster than pydantic to load tagged unions. It is written in pure python.
edit: Also, typedload just uses completely regular dataclasses or attrs. No need for all those different BaseModel, RootModel and understanding when to use them.
I agree with you that the article would have been improved if they'd used real-world examples, e.g. a ContactMethod type that has Address or PhoneNumber or something like that.