Clean your codebase with basic information theory

  > group related concepts together
  > The hardest part of this process is deciding what “related concepts” mean.

The article talks about "readability", but arguably the unnamed hard problem it is dancing around is how to structure an application or system by decomposing it into modules.

I'd argue the baseline reasonable approach to structuring applications or systems is the one given in Parnas' 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules":

  > We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others.

http://sunnyday.mit.edu/16.355/parnas-criteria.html

Parnas' criterion embeds the understanding that code and systems are not static but need to evolve over time as requirements change or decisions are made, and that different decompositions can be inferior or superior to accommodating that change.

"Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.

TeMPOraL · 2 years ago

People forget that readability isn't a function of specific program - there is no one optimal readability. On the contrary, it's a function of the program and the goals of the reader. So after fixing and DRYing all the generally bad/inefficient decision, what is readable code becomes solely the issue of why you're reading it - trying to debug or add an entirely new feature will have opposite readability criteria to extending some high-level feature.

Even in the best case, readability just becomes a Pareto frontier[0], given by expressive limits of the dominant programming paradigm - same single plaintext source code for all. There's only so much complexity, so much cross-cutting concerns, we can cram into the same piece of plaintext, until something gives, until the same code is beatiful to you one week, and incomprehensible the next week, with the only thing that changed is the type of work you're doing on it.

So, beyond evolving over time, I'd also consider the orthogonal aspect of different decompositions being good for different purposes, and that you can't have it all and work on the same, single, high-level plaintext code.

EDIT: And I believe the solution to this, the step forward beyond the Pareto frontier, is what 'valty described here: https://news.ycombinator.com/item?id=39426895 - not coding directly in the same plaintext, but treating the single-source-of-truth code as a database, which you query and update through views/lenses that best fit whatever work you're doing at the moment.

[0] - https://en.wikipedia.org/wiki/Pareto_front

UweSchmidt · 2 years ago

If it's not too much trouble, could you create a minimal demonstration of a simple piece of code, structured for various goals - easy to extend, easy to debug etc.? I can't defend my code form the best-practice-people with a Pareto Front wikipedia article.

lupusreal · 2 years ago

In my experience, most programmers go way too hard with anticipating future changes and end up creating systems with entirely too much abstraction. Most of the time those changes never occur and the result is a codebase which has been obfuscated with excessive abstraction that bogs down anybody trying to maintain it. Future changes end up needing different abstractions than the ones which were preemptively created, and as the programmer is adding new abstractions to cover their present need they also create new premature abstractions, thinking they're saving themselves future trouble. The cycle then continues.

Better to KISS and leave abstraction for the future when it actually becomes necessary. If you start out with code that is only as complex as it needs to be in that moment, then it will generally be much easier to change it in the future.

jpc0 · 2 years ago

> Better to KISS and leave abstraction for the future when it actually becomes necessary.

I love this statement and for me personally is also perfectly encompassed in YAGNI. You ain't gonna need it.

Until it's proven you need that piece of code, don't write it.

Izkata · 2 years ago

> "Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.

I'm fairly sure I remember reading somewhere that that piece of advice was originally meant for data/values/configuration, not code, and that applying it to code is itself a mistake that keeps getting repeated.

aranchelk · 2 years ago

Regardless of who said it and what they meant, I don’t want more code to write tests for, more pages to read through when stuff breaks, more material for new engineers to learn. You can always start copy-pasting and make a mess later - less true in the reverse.

Like almost anything it can be taken too far or misapplied.

In the quoted example, when I have multiple occurrences of related business logic, I build a vocabulary of reusable sub elements - find the joints and carve, don’t build a giant mutant.

surprisetalk · 2 years ago

Author here :)

Wasn't familiar with Parnas' criteria, thanks for sharing.

I do something similar in a different way, which I call "IKEA-oriented development". IME, semi-disposable code is very easy to change over time as mental models and product goals evolve:

[1] https://taylor.town/ikea-oriented-development

shoo · 2 years ago

Thank you for the post, and the link to this second one as well.

Re: "IKEA-oriented development", you make a very good point about the cost of change. I think the semi-disposable code idea overlaps comments from folks elsewhere in this discussion thread talking about the horror of codebases that introduced premature abstractions to cope with expected future changes that then never actually appeared ("YAGNI" is indeed a good rule of thumb).

Your point about "make experimentation effortless" is a good one. The highest productivity environment I worked with that supported rapid experimentation was a small business' monorepo codebase with good test coverage and rapid feedback from CI, where the library code was only used internally by the company's software products (i.e. all the abstractions were implementation details, not part of any external interface). Over time we'd learn that some of our early ideas for abstractions in the internal libraries were flawed, but because these abstractions were internal, and we had confidence in the automated test coverage, it was possible to make quite large scale improvements to abstractions rapidly with confidence as we learned more.

The kind of environment that really bogs down experimentation and impedes change and improvements to abstractions is where an initial idea for an abstraction is resourced with its own development team and turned into a production service, and then another half a dozen internal company services start depending on it. Then it's very easy to end up in situations where everyone becomes aware that the abstraction is flawed, but improving it is less "one developer goes dark for a week or two and emerges with a 50-patch PR that atomically replaces the flawed v1 abstraction with v2 while passing all test suites in all projects that depend upon it" and more "project managers, product owners and enterprise architects compare roadmaps for the next few quarters to figure out how many years it might be until a prototype of the v2 abstraction can be ready for manual testing in the integrated test environment".

Maybe in the worst case there's some initial decomposition of the system that is flawed, then an org chart is spun up defining teams that own components matching the flawed system decomposition, so refactoring to improve the decomposition would also require refactoring the org chart to change people's teams. Then instead of having colleagues indifferent to or supporting a purely technical refactor, people will resist it to avoid change to their roles!

calvinv · 2 years ago

I haven't worked on a project where we've know all our problems up front and most of the time the complexity is added to cater for "flexibility" but that rarely ends up being a useful implementation for what we actually needed. It's great to hide this from other areas but you will need to work on it and it will impact how the software is architected

BenoitEssiambre · 2 years ago

That's interesting. Knowing when to decompose systems into modules indeed seems to be key. This is a complex problem because, I think, the choice of the optimal model depends on the uncertainty you have about the reality behind the data, about what you know and don't know about the domain you are modeling.

But there might be optimal solutions rooted in information theory and Basyesian probabilities that you can strive to approach while programming. This is about avoiding over-fitting or under-fitting your domain knowledge.

Theoretically speaking, finding the right Bayesian fit optimizes for future evolution of the code and how it generalizes into the unknown, how correct your software will be when faced with things you haven't specifically designed for. More here: https://benoitessiambre.com/abstract.html

If I were to add something to abstract.html blog post, it would be something about Dependency Length Minimization ( https://www.pnas.org/doi/full/10.1073/pnas.1502134112#:~:tex... ) which has important information theoretic ramifications (for example, files with shortened dependencies tend to compress better and LLMs became much better when they solved for managing dependencies with their "attention" mechanism). When an abstraction breaks out a piece of code to enable reuse, the reduction in redundancy should be weighted against the stretching of dependencies to decide whether the abstraction is warranted.

The original article acknowledges this by mentioning "locality".

Other things to take into account is how tests fit into all this. Again more here: https://benoitessiambre.com/abstract.html

shoo · 2 years ago

Your linked blog post "Abstraction and the Bayesian Occam's Razor" is very interesting. I'll play back my understanding to you, to see if I'm approximately following and summarising your thesis.

Context:

When programming we attempt to design an effective abstraction that models some domain. When designing this abstraction, there are trade-offs between reducing the amount of code required, enabling reuse, reducing coupling, flexibility to accommodate future use cases.

Key Problem:

How do we design an abstraction for our domain model?

Claim 1:

Apply the "Minimum Description Length" (MDL) model selection principle: prefer a domain model embedded in the shortest program able to recreate the dataset of domain knowledge.

Applying MDL model selection will result in an abstraction for the domain model that is both smaller -- giving less code to maintain -- and more likely to generalize to future unknown use cases.

Complication:

Applying the MDL model selection principle relies on having access to a dataset of domain knowledge. We can think of this dataset of domain knowledge as a list of (situation, expected behaviour) pairs -- c.f. a labelled supervised learning dataset, or a gigantic list of requirements. Unfortunately, in typical software projects, no such explicit dataset cataloguing the requirements or expected behaviour in each situation exists.

Claim 2:

We can use the automated test suite as a proxy for the dataset of domain knowledge. When designing our abstraction we should prefer a domain model where the combined size of the logic for the domain model and the size of the corresponding test suite* is minimal.

* with the important caveat that "just cutting out the tests, or removing other safeties like strict types doesn't give you a lower MDL, in that case, you're missing the descriptions of important parts of your data or knowledge".

> In my experience, the key to maintaining readability is developing a healthy respect for locality

I think this pursuit of "locality" is what actually causes more complexity. And I think its mainly around our obsession with representing our code as text files in folder hierarchies.

> coarsely structure codebases around CPU timelines and dataflow

This is why I would prefer code to be in a database, instead of files and folders, so that structure doesn't matter, and the tree view UI can be organized based on runtime code paths, and data flow - via value tracing.

> don’t pollute your namespace – use blocks to restrict variables/functions to the smallest possible scope

Everyone likes to be all modular and develop in tiny little pieces that they assemble together. Relying on modularization means that when stuff changes upstream in the call stack, we just hack around these changes adding some conditionals to handle these changes instead of resorting to larger refactors. People like this because things can keep moving instead of everything breaking.

Instead, what we need to do is make it easier to trace all the data dependencies in our programs so that when we make a change to anything, we can instantly see what depends on it and needs updating.

I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.

Instead we end up with all these little mini-databases spread all over our code, when what we should have is one central one from which we can clearly see all the data dependencies.

> group related concepts together

Instead, we should query a database of code as needed...just like we do with our normalized data.

verinus · 2 years ago

I was thinking about code along the same lines: we are modeling, not writing text. This just happens to be the best way to express our models in a way a computer can be made to understand it, be formal enough and still be understandable by others.

What current languages are bad about is expressing architecture, and the problem of having one way to structure our models (domain models) vs. the actions/transformations that run on them (flow of execution).

I strongly disagree on the global variable side though...

valty · 2 years ago

> I strongly disagree on the global variable side though...

My thinking is that software has been terrible (over-complex) for such a long time, so its time to start questioning our most dogmatic principles, such as "global variables are bad".

Imagine you can instantly see all the dependencies to/from every global variable whenever you select it. This mitigates most of the traditional complaints.

I would argue that adequate tooling that allows for this would dramatically simplify all development. It's the only thing that matters and its so absent from every development platform/language/workflow.

If we could only see what was going on in our programs, we would see the complexity, and we could avoid it.

Another related bit of dogma is _static scoping_. Why does a function have to explicitly state all its arguments? Why aren't we allowed to access variables from anywhere higher up in a call stack?

What you realize is that all of these rules are so you can look at plain text code and (kind of) see what is going on. This is a holdover from low-powered computers without GUIs like most of programming. Even if an argument is explicit, if its passed down via 10 layers, you still have to go look.

Sakos · 2 years ago

I think the main problem is that we think of code as text. So the only way to determine if code is related is by parsing all of the text again. I'm not sure if a database representation is really the correct path to take, but I think we need some other way to represent code and give parts of code meaning.

surprisetalk · 2 years ago

Author here!

You may be interested in the programming language I've been working on :)

[1] https://scrapscript.org

dack · 2 years ago

reminds me of unison in some ways. did that provide some inspiration?

hnben · 2 years ago

Your ideas sound intriguing. Are they original, or can I read up on them somewhere?

hcs · 2 years ago

I've heard the name Intentional Programming applied to this or a similar concept https://en.wikipedia.org/wiki/Intentional_programming

> Tight integration of the environment with the storage format brings some of the nicer features of database normalization to source code. Redundancy is eliminated by giving each definition a unique identity, and storing the name of variables and operators in exactly one place.

TeMPOraL · 2 years ago

They're both old and completely ignored. People occasionally reinvent them when they e.g. store code in DBs, or add scripting languages to their programs, or build new programming langauge because Hello World in Java is too verbose.

Unison plays with these ideas (I tried it, it's taking things in the right direction, though I still can't figure out how to write anything more complex than sorting numbers in the REPL with it; the examples are too Haskelly, IMHO.) Smalltalk language is, I believe, the original - built around the assumption that code is in the database, and coming with a built-in IDE for this. Glamorous Toolkit is trying to push this further, to give programmers better ability to create ad-hoc problem-specific views into their programs.

I've seen a few other articles written about this over the years, but I don't have any link handy.

valty · 2 years ago

I've worked them up from questioning the things about programming that seem most rigid and dogmatic over many years. But there is a lot of literature I have found along the way.

Intentional Programming is an interesting read as someone has already mentioned...from the guy who brought us `strHungarianNotation`. Storing code in a database but retaining the joy of the plain text cut/copy/paste experience is the key challenge, as well as all the unix file goodness.

Its quite fun to talk to ChatGPT about these topics and just question everything and delve back in the history of programming.

elbear · 2 years ago

As far as I'm aware, the Unison Language implements some of his ideas: https://www.unison-lang.org

camgunz · 2 years ago

To me they felt very similar to Joe Armstrong's "Why do we need modules at all?" [0]:

---

Why do we need modules at all?

This is a brain-dump-stream-of-consciousness-thing. I've been thinking about this for a while.

I'm proposing a slightly different way of programming here The basic idea is

- do away with modules

- all functions have unique distinct names

- all functions have (lots of) meta data

- all functions go into a global (searchable) Key-value database

- we need letrec

- contribution to open source can be as simple as contributing a single function

- there are no "open source projects" - only "the open source Key-Value database of all functions"

- Content is peer reviewed

... ---

Whole thread's worth a read.

[0]: https://erlang.org/pipermail/erlang-questions/2011-May/05876...

adius · 2 years ago

I am working on storing code in SQLite: https://mailchi.mp/62e9b4a81f16/cosuz

jihiggins · 2 years ago

this is sort of what modern ides (e.g. jetbrains stuff) already do in the bg. when im working on stuff, i almost never navigate via text or the file explorer, i use things like "goto usages or definition" and navigate via what is essentially data tracing. this only works well with statically typed languages ime, though.

the indexing step is basically building this db in the background, it's just kept out of view / hidden unless you're building ide plugins or whatever.

valty · 2 years ago

> via what is essentially data tracing

Value tracing is at runtime. JetBrains cannot trace how values flow through your code.

To do this, you need to instrument all your code, and track all the transformations that occur for each value. It's really difficult to do if the language is not designed for it and there are a lot of performance implications.

If your code is written in a functional paradigm it becomes much easier to trace...such as with rxjs.

jpc0 · 2 years ago

> All the issues with global variables can be solved with better tracing tooling

I would argue this problem is solved in most current languages with strict types.

Stop making all the things strings or abstract base classes.

Easy example I've worked on recently. An IPv4 address is an IPv4address in code. I don't care if it is just represented as an uint32 or string in Memory, in your code it should be an IPv4 address and if a function expects an IPv4 address and you pass it a string that is a compilation error.

loup-vaillant · 2 years ago

Here's what I have to say about locality: https://loup-vaillant.fr/articles/source-of-readability

TL;DR: locality is likely one of the most important concepts in all of knowledge work. Of course we're a little obsessed with it.

valty · 2 years ago

> Our screens offers only a small window, and even the smartest IDE can’t give us instant access to everything

This is the real problem that needs solving.

> code that is read together should be written together

Code is a database of functions. This approach is like trying to design a database in denormalized form.

silon42 · 2 years ago

How would you do the diff/apply?

Those tools are essential and they basically rely on locality (context).

valty · 2 years ago

Semantic diffs.

Chris_Newton · 2 years ago

I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.

This is an interesting premise, and actually I think we have quite a lot of examples both of successfully applying the idea up to a point and of where it starts to break down in practice.

Modern distributed applications mostly have a back end with some kind of database and front end UIs that depend on that back end via an API. Those databases are often global data stores, accessible from anywhere in the back end implementation. A lot of work is done to design and manage them, probably modelling some real world system that our application is concerned with, and there are varying degrees of abstraction/isolation used to preserve that design intent.

If the data model is simple then this works OK, particularly if you have a SQL database that can enforce some basic constraints and relationships to make illegal states unrepresentable.

What we usually see as the data models and the actions that update them become more complicated is the introduction of some business logic layer. The rest of the system isn’t allowed free access to update the state any more; it’s required to go through some defined interface that provides specific actions that guarantee the state remains valid.

That’s the writing side. On the reading side, aside from security/privacy issues, we generally don’t have the same concerns with allowing free access to the whole database from anywhere. However, often we need some form of derived data that isn’t directly stored in the database itself but instead can be constructed from other state that is. So again we end up with some kind of abstraction/isolation layer between the rest of our system and the database.

In each of these cases, there is probably data that we’re working with that is not the state that ultimately persists within the database. So the question immediately arises, if we only have global data in our programs, where does all of this transient, intermediary data go? If we put it into our database as well then all the usual problems with concurrency and integrity immediately appear, so we are back to needing something that is local to our immediate logic and can’t conflict with any other logic or indeed any other instance of the same logic that happens to be running in some other context at the same time.

We see analogous issues in the front end UI code for those distributed systems. If there is a relatively simple model then maybe the front end can effectively just fetch/cache the state from the back end API. As things get more complicated, maybe you end up with a front end data store analogous to the back end database that becomes the central, authoritative store of your front end state. And again maybe this provides some defined interface for accepting valid updates to the state and/or for accessing derived data. And again the questions arise about where the intermediate data generated by all of that logic should go if we have only our global store to hold state, and the answer is likely to be some form of more local data.

On top of the persistent state and anything acting upon or derived from it, we also have other kinds of information we work with in front end code. Many UIs will have state that is used purely to control the user’s view into the inner world: the sorting criteria and current page of a table, the current position and zoom level over a map, the last item we’re currently showing in an infinitely scrolling list, look and feel settings like whether we’re using a dark mode theme. Some of this data might apply across the whole UI while other aspects might only apply to, say, a specific table, with each table needing its own instance of that “UI state” data. So again, if everything were global, that would mean we’d need to include every possible piece of UI state for the whole application in the global store.

This comment is already far too long so I’ll just quickly note that there are other recurring themes. One is how to synchronise “global” stores in a distributed system where you might have multiple front ends running with their own copies of the state, or perhaps multiple microservices on a back end that have duplication in their databases because everything is supposed to be independent and denormalised. A related issue is how to represent temporary clones of significant parts of the data during user interactions, like building up a transaction with several changes before atomically committing it or rolling it back (think dialog box on the UI side or an internally consistent batch of changes sent to the back end), or supporting an undo facility that needs to reconstruct a previous version of the persistent state one way or another.

I do believe there’s a lot more we have to learn about different types of state and transient data and how we can model those cleanly in our systems. There are certainly common patterns we touch on in a lot of different contexts. And I think both extremes of having too much data trapped too locally and having too much lifted to global storage have their own difficulties and probably there is some sort of structure in between that would be better than what we typically write today. But it’s not an easy problem or we’d all be solving it by now…

valty · 2 years ago

The comment was mainly related to in-memory variables within an application process...focusing on scoping/syntax...but the thinking was definitely inspired by the fact that most apps center around an external database without realizing its essentially a global variable.

In application code, when I talk of global vars, I mean that every function has access to all data...as opposed to access being abstracted and modularized into various services which are exposed via being passed through a chain of function args, or some kind of dependency injection system.

But this global variable could actually be an abstraction (a store) allowing data integrity checks on writes.

> there is probably data that we’re working with that is not the state that ultimately persists within the database

If you think about all your data in one big graph, this transient data still has a relation to the final persisted state. There is a data flow of transient values into the persisted data values. And separately, your intermediate data structures might also contain relations to your persisted data structures.

Most dev tools don't track these relationships, and you have this tangled ad-hoc mess where data is dumped from one structure to the next.

> “UI state” data...we’d need to include every possible piece of UI state for the whole application in the global store.

Yep, this is what you should do.

If I am sorting something on a webapp and I refresh the browser, I probably want to have see the same things as they were sorted. This might vary between use case, but adding the functionality should be easy to do if necessary. So therefore it is good practice to allow all local ui state to be persisted by default.

The ui state is being persisted anyway, inside a component or inside the HTML document. Somewhere in the heap this data is stored. And if we think about our one big graph again, this data is related to other things...its just that we lose these relations.

> A related issue is how to represent temporary clones of significant parts of the data

All rendered UI values should be in their own _ui_ models (view model is similar), separate from the source of truth models.

These ui models basically allow all rendered UI to be editable without immediately committing changes to the database. This allows for optimistic UI updates. They get notified of any incoming changes from the source of truth, and can decide what to do with them.

If you want to batch them up, you just create a Batch entity, and add a relation to these ui models. The main thing is to treat the ui models like any other models. Whether they are persisted or not should simply be flipping a flag in your code.

For UI, everything should be in one big graph. Code is data.

I find with modern programming, all of the popular programming languages, frameworks, libraries, databases, platforms, really get in the way of being able to do things simply.