Don't let dicts spoil your code (2020)

The article is not explaining the point, which I believe is: type your dicts if you want to provide strict guarantees to your downstream about data shape.

If you know precisely what the data is used for - great, go ahead - type system is your friend.

If you don't know how the data should be used, it's often a different story. Wrapping data in hand typed classes is a terrible idea in the typical data engineering scenarios where there might be hundreds of these api endpoints, which also might be changing as the upstream sees fit. Perfect way to piss off your downstream users is to keep telling them "sorry the data is not available because I overspecified the data type and now it failed on TypeError again". Usually the downstream is the domain expert, they know which fields should be used and they don't know which ones before they start using it. Typically the best way is to pass ALL the upstream data down, materialize extra fields and NOT modify any existing field names, even when you think you're super smart and know better than domain experts. Too often it happens that a "smart" engineer though he knew better and included only some fields. Only for then to be realized that the data source contained many more gold nuggets, and it was never documented that these were cleverly dropped.

fulafel · 3 years ago

Another option besides types is using a schema library. You can do more things, like define custom validation rules over eg several fields, publish the schema as data (eg at an API endpint, openapi or json schema etc), reuse it in another language (depending on schema system), version it explicitly, and machine generate it if it comes from some external spec (like a db schema).

Also great for property testing / fuzzing. And other fun meta datamodel stuff like eg inferring schema from example data.

In general programming language type systems are pretty weak in comparison because they're not very programmable. (In most languages, for most people, etc .. there are fancy level type systems approaching formal proof toolkits but they're hard to use)

skybrian · 3 years ago

This sounds specific to a particular company's organization where there are at least three different systems involved and no single source of truth. It seems like that's a problem in itself - how do you get everyone to refer to and update the same document?

Ideally everyone would be using a single type definition. Admittedly that's more common with protobufs, though, where you can't send any data that's not in the definition.

Come to think of that, it's true of plain old structs too.

tomazio · 3 years ago

This is more common than you might otherwise think. I've worked at multiple companies that have multiple systems/sources of truth for various reasons. One example of that is my current company has stored and handled all its transactional data in a legacy point of sale system from the early 90s. They decided to upgrade to a modern ERP system a couple years ago, but it takes a while to fully implement and roll over to a new source system. Especially in a high transaction system that cannot go down otherwise the company will start losing a lot of money. Thus its being incrementally rolled out, resulting in both systems running together and being read and written to simultaneously.

oivey · 3 years ago

Sometimes defining who should have authority over a singular original type definition isn't possible. This is sometimes true at companies, and it's even more true in open source projects. Even when possible, single type definitions in those cases often end up as Homer-car monstrosities that are too big and difficult to construct when only a small subset of fields are needed.

throwaway894345 · 3 years ago

This is one of the things I appreciate about languages like Go and Rust (I'm sure there are others as well). If the data is static, use a struct. If the data is dynamic, use a map/HashMap. No need to worry about TypedDict vs classes vs DataClasses vs etc, and no one uses HashMap for static data (they could, but virtually no one in those communities is such a glutton for punishment).

From Zen of Python:

> There should be one--and preferably only one--obvious way to do it

mathisonturing · 3 years ago

Forget about DataClasses, TypedDict etc. Can't you achieve the same in python with a class and a dict? Is there a difference, other than perhaps being overloaded with options?

bottled_poe · 3 years ago

Python is one of those languages where everything starts to look like a nail.

xdfgh1112 · 3 years ago

A popular AWS API library does this and it is infuriating. AWS added a new field but the library hasn't been updated yet? Too bad, you can't use that field then!

imankulov · 3 years ago

True. Don't slap in types just because you can — add types when you need to work with your data. Most of the time, I worked with systems where my python code *was* the downstream and required data to run some business logic. In that context, types make the most sense.

Python's strapped on type annotations have been designed around traditional OOP, and it feels like a bad fit for the language. Duck typing is a tremendously powerful form of polymorphism, and none of the PEPs for type annotations do a great job of supporting it. Protocols don't work well with dataclasses and not at all with dicts. TypedDicts could have been perfect, but they explicitly disallow extra keys. Why even use a TypedDict instead of a dataclass? Why make yet another traditional OOP abstraction that was already well served by multiple other features of the language? Even more frustratingly, TypedDicts show that it could have been done. They just decided to break it on purpose.

TFA accidentally even brings up the reason by dicts are so powerful: they enable easy interoperability between libraries (like a wire format). Using two libraries together that insist on their own bespoke class hierarchy is an exercise in data conversion pain. Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.

djrobstep · 3 years ago

Dealing with all these differences is one of the most frustrating, stupid things about programming today.

99% of the data i deal with on a day-to-day basis is lists and mappings.

Very conceptually simple, but with a million different implementations. Particularly in python where we have dicts, namedtuples, dataclasses, regular objects, etc etc etc, then you deal with databases (which are really just mappings of keys to rows), where the interaction works completely differently again (with annoying differences for each database of course). Then hundreds of different encodings again to send things across a network or save them to files.

None of this complexity is inherent to the problems being solved - it's all accumulated cruft and bullshit.

oivey · 3 years ago

At least with things like databases and Pandas you can claim that there might be a valid performance reason for a different abstraction. Regular objects allow for inheritance which I usually find bad, but lots of people do like it. NamedTuples, TypedDicts, and Dataclasses are basically all rapid iterations on the same idea with the same purpose.

pharmakom · 3 years ago

I completely agree.

You would probably love Clojure. Perhaps you tried it already?

elcritch · 3 years ago

I haven’t programmed in Python a lot for years. Though I still somewhat follow the new features and versions and wow it surprises me how often modern Python misses an elegant solution that could simplify the ecosystem in favor of bespoke new syntax and new ways to do more incompatible OO.

Interestingly I’ve actually been using _more_ duck-typing style programming in Nim as it’s become my daily driver.

It’s kinda funny since Nim is a statically typed language you think it’d be hard yet its so seamless to use compile time checks that it’s easy to think of it as runtime duck-typing. You can add overloaded types to a function like `proc foo(arg1: int | string, arg2: float)` and then use the `with` expression to change behavior in parts of the function handling the specifics for the types. It’s really power way to handle polymorphism and things like visitor patterns without a bunch of OO infrastructure. I take it the Python type annotations aren’t embracing that overloaded type setup?

You can even trivially use duck typing with type declarations https://nim-lang.org/docs/manual.html#generics-is-operator There’s another pattern I’ve taken to of just declaring getter/setters for things like “X” and “Y”, except just from a generic array. I mean “X” is just a convention for arr[0] right? https://github.com/treeform/vmath/blob/5d7c5e411598cd5cf9071...

Really I hope “duck typing” becomes more the norm rather than the OO stuff. I’m curious what the story in Swift on this topic is nowadays.

cerved · 3 years ago

Having a proper type system can be immensely powerful. IMHO, duck typing is just adding the burden of type checking to the application layer instead of letting a compiler or linter deal with it. Pythons lack of a good type system is what I miss most

Doxin · 3 years ago

The compiler can still do type checking even when using duck typing. It's important to note that duck typing and weak typing are entirely orthogonal. You can have either, both, or neither.

E.g. an example in D of a function that doesn't care too much about the type you pass in:

    T doublify(T)(T v){
        return v*2;
    }

These are all fine:

    writeln(doublify(3));
    writeln(doublify(3.0));
    writeln(doublify(3u));

But this still throws a compile error like you'd expect:

    writeln(doublify("3"));

oivey · 3 years ago

Duck typing is a superset of inheritance. If your language only supports polymorphism via inheritance, then it is strictly less expressive than a language with duck typing.

ambrose2 · 3 years ago

Prior to dataclasses, didn’t the library attrs come about to address a gap, and then dataclasses were added from inspiration from attrs? I mean yea, ideally the best structures were designed from the start, but the history is understandable.

oivey · 3 years ago

Dataclasses came out in 3.7, and TypedDicts and Protocols in 3.8. I had to check. I knew they were pretty close.

fortzi · 3 years ago

> Further, if I want a point to be an object containing fields for "x" and "y", I'd much rather just use a dict rather than construct an object in some incompatible inheritance nightmare.

That’s what Protocols are for

oivey · 3 years ago

They don't work in the way you would think for dataclasses or at all for dicts/TypedDicts.

See this for dataclasses: https://github.com/python/mypy/issues/5374#issuecomment-8841....

BerislavLopac · 3 years ago

> none of the PEPs for type annotations do a great job of supporting it

Except for protocols.

snidane · 3 years ago

valbaca · 3 years ago

Interesting how Clojure takes the complete opposite approach by simply making dicts immutable.

https://chasemerick.files.wordpress.com/2011/07/choosingtype...

epgui · 3 years ago

Yeah, as a clojurist this made me laugh: just like people will naturally feel an urge to fill up conversational silence with words, people can’t seem to be able to go without their classes for more than 5 minutes.

I don’t have anything against classes in theory, but I’m of the opinion that 99.9% of classes out there just shouldn’t exist.

lkrubner · 3 years ago

Clojure has established the gold standard for beautiful abstractions that unify broad categories of data types. It's seq interface is elegant and powerful. Python's efforts towards option data typing or strict data typing looks especially clunky, awkward, forced, and painful when compared to Clojure.

andreareina · 3 years ago

Clojure also makes working with hashes a whole lot more ergonomic with destructuring and symbol keys.

fastball · 3 years ago

Could you clarify a bit here? Python also has destructuring for its dicts and I'm not entirely sure what you mean by symbol keys.

josh_fyi · 3 years ago

You still need to know what keys to expect. The Clojure map that get replaced with a new map have the same problem as a mutable dict.

nicbou · 3 years ago

This is something I enforced in a big rewrite at a previous company.

People would take a full API response, and pass bits of it around with mutations. Understanding what the object looked like 5 functions deep was really hard. If the API changed... Oh boy.

I found many bugs just tracing the code like this. It made me a big proponent of strong typing, or at least strong type hinting.

mejutoco · 3 years ago

Same experience here.

It even has additional advantages, such as generating open api files automatically from the types and validating payloads between microservices.

Pydantic and Typeguard are too very useful libraries in this context.

djhaskin987 · 3 years ago

This opinion gets at the heart of the reason to use type languages or not. After all, what is a dict but an untyped struct?

Untyped languages are excellent for smaller code bases because they are more comfortable to program in and faster and more general. Types of polymorphism possible in these languages are simply not possible or much harder in typed languages. Also, as others have said, the problem domain may not be as explored yet.

Typed languages really start to shine as a code base gets huge. In these instances well maintained untyped language code bases start collapsing under the weight of their own unit tests, while moderately well or poorly well maintained instances of untyped language code bases become a mess. Mostly this is due to difficulties in communication when the code base gets worked on by so many people that it's hard for them all to communicate with each other. In these cases a typed language keeps everyone on the same page to some extent.

Both camps will hate me for saying this I think, but it's what I've observed over the years.

It also may sound like I prefer typed languages, but in fact my favorite languages to work in are Clojure and Python. My code bases as a DevOps engineer rarely pass the 10,000 line mark and never pass 100,000 line mark. It's much more comfortable for me in these untyped languages.

Untyped languages also really shine in microservices for the same reason.

madsbuch · 3 years ago

* Don't let dicts spoil your python code

Maybe that was implied?

Anyways, a lot of languages take another stance. E. Elixir where using dicts along with pattern matches calls for quite powerful abstractions.

As long as the dicts are kept shallow and the number of indirection in the code in general so, then it is alright to navigate and use.

Yes, the context is Python.

ampgt · 3 years ago

Glad to see pydantic get mentioned here. It’s a great solution for this exact problem. I was introduced to it by FastAPI and have been using it in all my projects since.

At the end of the day you really can’t escape typing. It just makes life easier. We should stop letting languages try to remove it.

asddubs · 3 years ago

Took me a really long time to learn this lesson. IMO this is a variation of the primitive obsession code smell, although I'd say it's way more harmful. I was really reluctant to add data classes to my code when the good old PHP array could get the job done without holding me up with a bunch of beaurocracy. Of course they give no guarantees and enforce no structure, so inevitably you get slight variations depending on what you need, or maybe you happen to have a dict that's a superset of what you're feeding in, and it just becomes really hard to reason about things. And of course since it's not a named type, tracing things back becomes really hard.

Supermancho · 3 years ago

> so inevitably you get slight variations depending on what you need

If you have a generic collection, you know it's generic. It does remove a class of errors when you start adding types, but it also adds problems in making changes as a tradeoff. Now I have to make a PR that is the change I want AND I have to modify the type, which comes with explaining/understanding that there isn't a reason to use 2 different types or what the consequences would be to create a second generic collection from the first and modify THAT instead (eg lists with different types, how big are they?).

Never was a big problem using generic collections over the last 30 years and plenty of languages are fine without the training wheels of defining every data structure as a type, so I'm not sure what this ranting is all about.

I'm not sure if I'm misunderstanding your point or you're understanding mine, so I'll just carefully say that the PHP array is really a dictionary + array combination type, and I was referring specifically to its use as a dictionary (since that's what TFA is about). If you're returning a list of things that are all the same type I agree that an array, or an array of a certain type if generics are available is totally fine and serviceable.

but if you're passing in/out some monstrosity which has a structure that you can only really find out by reading the code, often from top to bottom if different parts of the dictionary are referenced in different parts of the code, you are really setting yourself up for trouble down the line.

BurningFrog · 3 years ago

Whenever you use the same string key in different parts of the code, you take one more step on the Legacy Code Road...

I still walk that road sometimes, but not for very long.

klyrs · 3 years ago

Who needs hashes when you've got variable variables? ~ me, 20 years ago, learning the hard way

I still don't know how I feel about the fact that in PHP $$var (a) works, and (b) does exactly what you'd expect.