Readit News logoReadit News
shoo · 2 years ago

  > group related concepts together
  > The hardest part of this process is deciding what “related concepts” mean.
The article talks about "readability", but arguably the unnamed hard problem it is dancing around is how to structure an application or system by decomposing it into modules.

I'd argue the baseline reasonable approach to structuring applications or systems is the one given in Parnas' 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules":

  > We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others.
http://sunnyday.mit.edu/16.355/parnas-criteria.html

Parnas' criterion embeds the understanding that code and systems are not static but need to evolve over time as requirements change or decisions are made, and that different decompositions can be inferior or superior to accommodating that change.

"Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.

TeMPOraL · 2 years ago
People forget that readability isn't a function of specific program - there is no one optimal readability. On the contrary, it's a function of the program and the goals of the reader. So after fixing and DRYing all the generally bad/inefficient decision, what is readable code becomes solely the issue of why you're reading it - trying to debug or add an entirely new feature will have opposite readability criteria to extending some high-level feature.

Even in the best case, readability just becomes a Pareto frontier[0], given by expressive limits of the dominant programming paradigm - same single plaintext source code for all. There's only so much complexity, so much cross-cutting concerns, we can cram into the same piece of plaintext, until something gives, until the same code is beatiful to you one week, and incomprehensible the next week, with the only thing that changed is the type of work you're doing on it.

So, beyond evolving over time, I'd also consider the orthogonal aspect of different decompositions being good for different purposes, and that you can't have it all and work on the same, single, high-level plaintext code.

EDIT: And I believe the solution to this, the step forward beyond the Pareto frontier, is what 'valty described here: https://news.ycombinator.com/item?id=39426895 - not coding directly in the same plaintext, but treating the single-source-of-truth code as a database, which you query and update through views/lenses that best fit whatever work you're doing at the moment.

--

[0] - https://en.wikipedia.org/wiki/Pareto_front

UweSchmidt · 2 years ago
If it's not too much trouble, could you create a minimal demonstration of a simple piece of code, structured for various goals - easy to extend, easy to debug etc.? I can't defend my code form the best-practice-people with a Pareto Front wikipedia article.
lupusreal · 2 years ago
In my experience, most programmers go way too hard with anticipating future changes and end up creating systems with entirely too much abstraction. Most of the time those changes never occur and the result is a codebase which has been obfuscated with excessive abstraction that bogs down anybody trying to maintain it. Future changes end up needing different abstractions than the ones which were preemptively created, and as the programmer is adding new abstractions to cover their present need they also create new premature abstractions, thinking they're saving themselves future trouble. The cycle then continues.

Better to KISS and leave abstraction for the future when it actually becomes necessary. If you start out with code that is only as complex as it needs to be in that moment, then it will generally be much easier to change it in the future.

jpc0 · 2 years ago
> Better to KISS and leave abstraction for the future when it actually becomes necessary.

I love this statement and for me personally is also perfectly encompassed in YAGNI. You ain't gonna need it.

Until it's proven you need that piece of code, don't write it.

Izkata · 2 years ago
> "Don't repeat yourself" refactoring rules of thumb can give poor results if blindly applied. Suppose two sections of application logic just so happen to look similar at this moment in time and get refactored to "remove the duplication" coupling them together, when the two sections of code are subject to different constraints and reasons for change, and will need to evolve separately.

I'm fairly sure I remember reading somewhere that that piece of advice was originally meant for data/values/configuration, not code, and that applying it to code is itself a mistake that keeps getting repeated.

aranchelk · 2 years ago
Regardless of who said it and what they meant, I don’t want more code to write tests for, more pages to read through when stuff breaks, more material for new engineers to learn. You can always start copy-pasting and make a mess later - less true in the reverse.

Like almost anything it can be taken too far or misapplied.

In the quoted example, when I have multiple occurrences of related business logic, I build a vocabulary of reusable sub elements - find the joints and carve, don’t build a giant mutant.

surprisetalk · 2 years ago
Author here :)

Wasn't familiar with Parnas' criteria, thanks for sharing.

I do something similar in a different way, which I call "IKEA-oriented development". IME, semi-disposable code is very easy to change over time as mental models and product goals evolve:

[1] https://taylor.town/ikea-oriented-development

shoo · 2 years ago
Thank you for the post, and the link to this second one as well.

Re: "IKEA-oriented development", you make a very good point about the cost of change. I think the semi-disposable code idea overlaps comments from folks elsewhere in this discussion thread talking about the horror of codebases that introduced premature abstractions to cope with expected future changes that then never actually appeared ("YAGNI" is indeed a good rule of thumb).

Your point about "make experimentation effortless" is a good one. The highest productivity environment I worked with that supported rapid experimentation was a small business' monorepo codebase with good test coverage and rapid feedback from CI, where the library code was only used internally by the company's software products (i.e. all the abstractions were implementation details, not part of any external interface). Over time we'd learn that some of our early ideas for abstractions in the internal libraries were flawed, but because these abstractions were internal, and we had confidence in the automated test coverage, it was possible to make quite large scale improvements to abstractions rapidly with confidence as we learned more.

The kind of environment that really bogs down experimentation and impedes change and improvements to abstractions is where an initial idea for an abstraction is resourced with its own development team and turned into a production service, and then another half a dozen internal company services start depending on it. Then it's very easy to end up in situations where everyone becomes aware that the abstraction is flawed, but improving it is less "one developer goes dark for a week or two and emerges with a 50-patch PR that atomically replaces the flawed v1 abstraction with v2 while passing all test suites in all projects that depend upon it" and more "project managers, product owners and enterprise architects compare roadmaps for the next few quarters to figure out how many years it might be until a prototype of the v2 abstraction can be ready for manual testing in the integrated test environment".

Maybe in the worst case there's some initial decomposition of the system that is flawed, then an org chart is spun up defining teams that own components matching the flawed system decomposition, so refactoring to improve the decomposition would also require refactoring the org chart to change people's teams. Then instead of having colleagues indifferent to or supporting a purely technical refactor, people will resist it to avoid change to their roles!

calvinv · 2 years ago
I haven't worked on a project where we've know all our problems up front and most of the time the complexity is added to cater for "flexibility" but that rarely ends up being a useful implementation for what we actually needed. It's great to hide this from other areas but you will need to work on it and it will impact how the software is architected
BenoitEssiambre · 2 years ago
That's interesting. Knowing when to decompose systems into modules indeed seems to be key. This is a complex problem because, I think, the choice of the optimal model depends on the uncertainty you have about the reality behind the data, about what you know and don't know about the domain you are modeling.

But there might be optimal solutions rooted in information theory and Basyesian probabilities that you can strive to approach while programming. This is about avoiding over-fitting or under-fitting your domain knowledge.

Theoretically speaking, finding the right Bayesian fit optimizes for future evolution of the code and how it generalizes into the unknown, how correct your software will be when faced with things you haven't specifically designed for. More here: https://benoitessiambre.com/abstract.html

If I were to add something to abstract.html blog post, it would be something about Dependency Length Minimization ( https://www.pnas.org/doi/full/10.1073/pnas.1502134112#:~:tex... ) which has important information theoretic ramifications (for example, files with shortened dependencies tend to compress better and LLMs became much better when they solved for managing dependencies with their "attention" mechanism). When an abstraction breaks out a piece of code to enable reuse, the reduction in redundancy should be weighted against the stretching of dependencies to decide whether the abstraction is warranted.

The original article acknowledges this by mentioning "locality".

Other things to take into account is how tests fit into all this. Again more here: https://benoitessiambre.com/abstract.html

shoo · 2 years ago
Your linked blog post "Abstraction and the Bayesian Occam's Razor" is very interesting. I'll play back my understanding to you, to see if I'm approximately following and summarising your thesis.

Context:

When programming we attempt to design an effective abstraction that models some domain. When designing this abstraction, there are trade-offs between reducing the amount of code required, enabling reuse, reducing coupling, flexibility to accommodate future use cases.

Key Problem:

How do we design an abstraction for our domain model?

Claim 1:

Apply the "Minimum Description Length" (MDL) model selection principle: prefer a domain model embedded in the shortest program able to recreate the dataset of domain knowledge.

Applying MDL model selection will result in an abstraction for the domain model that is both smaller -- giving less code to maintain -- and more likely to generalize to future unknown use cases.

Complication:

Applying the MDL model selection principle relies on having access to a dataset of domain knowledge. We can think of this dataset of domain knowledge as a list of (situation, expected behaviour) pairs -- c.f. a labelled supervised learning dataset, or a gigantic list of requirements. Unfortunately, in typical software projects, no such explicit dataset cataloguing the requirements or expected behaviour in each situation exists.

Claim 2:

We can use the automated test suite as a proxy for the dataset of domain knowledge. When designing our abstraction we should prefer a domain model where the combined size of the logic for the domain model and the size of the corresponding test suite* is minimal.

* with the important caveat that "just cutting out the tests, or removing other safeties like strict types doesn't give you a lower MDL, in that case, you're missing the descriptions of important parts of your data or knowledge".

valty · 2 years ago
> In my experience, the key to maintaining readability is developing a healthy respect for locality

I think this pursuit of "locality" is what actually causes more complexity. And I think its mainly around our obsession with representing our code as text files in folder hierarchies.

> coarsely structure codebases around CPU timelines and dataflow

This is why I would prefer code to be in a database, instead of files and folders, so that structure doesn't matter, and the tree view UI can be organized based on runtime code paths, and data flow - via value tracing.

> don’t pollute your namespace – use blocks to restrict variables/functions to the smallest possible scope

Everyone likes to be all modular and develop in tiny little pieces that they assemble together. Relying on modularization means that when stuff changes upstream in the call stack, we just hack around these changes adding some conditionals to handle these changes instead of resorting to larger refactors. People like this because things can keep moving instead of everything breaking.

Instead, what we need to do is make it easier to trace all the data dependencies in our programs so that when we make a change to anything, we can instantly see what depends on it and needs updating.

I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.

Instead we end up with all these little mini-databases spread all over our code, when what we should have is one central one from which we can clearly see all the data dependencies.

> group related concepts together

Instead, we should query a database of code as needed...just like we do with our normalized data.

verinus · 2 years ago
I was thinking about code along the same lines: we are modeling, not writing text. This just happens to be the best way to express our models in a way a computer can be made to understand it, be formal enough and still be understandable by others.

What current languages are bad about is expressing architecture, and the problem of having one way to structure our models (domain models) vs. the actions/transformations that run on them (flow of execution).

I strongly disagree on the global variable side though...

valty · 2 years ago
> I strongly disagree on the global variable side though...

My thinking is that software has been terrible (over-complex) for such a long time, so its time to start questioning our most dogmatic principles, such as "global variables are bad".

Imagine you can instantly see all the dependencies to/from every global variable whenever you select it. This mitigates most of the traditional complaints.

I would argue that adequate tooling that allows for this would dramatically simplify all development. It's the only thing that matters and its so absent from every development platform/language/workflow.

If we could only see what was going on in our programs, we would see the complexity, and we could avoid it.

Another related bit of dogma is _static scoping_. Why does a function have to explicitly state all its arguments? Why aren't we allowed to access variables from anywhere higher up in a call stack?

What you realize is that all of these rules are so you can look at plain text code and (kind of) see what is going on. This is a holdover from low-powered computers without GUIs like most of programming. Even if an argument is explicit, if its passed down via 10 layers, you still have to go look.

Sakos · 2 years ago
I think the main problem is that we think of code as text. So the only way to determine if code is related is by parsing all of the text again. I'm not sure if a database representation is really the correct path to take, but I think we need some other way to represent code and give parts of code meaning.
surprisetalk · 2 years ago
Author here!

You may be interested in the programming language I've been working on :)

[1] https://scrapscript.org

dack · 2 years ago
reminds me of unison in some ways. did that provide some inspiration?
hnben · 2 years ago
Your ideas sound intriguing. Are they original, or can I read up on them somewhere?
hcs · 2 years ago
I've heard the name Intentional Programming applied to this or a similar concept https://en.wikipedia.org/wiki/Intentional_programming

> Tight integration of the environment with the storage format brings some of the nicer features of database normalization to source code. Redundancy is eliminated by giving each definition a unique identity, and storing the name of variables and operators in exactly one place.

TeMPOraL · 2 years ago
They're both old and completely ignored. People occasionally reinvent them when they e.g. store code in DBs, or add scripting languages to their programs, or build new programming langauge because Hello World in Java is too verbose.

Unison plays with these ideas (I tried it, it's taking things in the right direction, though I still can't figure out how to write anything more complex than sorting numbers in the REPL with it; the examples are too Haskelly, IMHO.) Smalltalk language is, I believe, the original - built around the assumption that code is in the database, and coming with a built-in IDE for this. Glamorous Toolkit is trying to push this further, to give programmers better ability to create ad-hoc problem-specific views into their programs.

I've seen a few other articles written about this over the years, but I don't have any link handy.

valty · 2 years ago
I've worked them up from questioning the things about programming that seem most rigid and dogmatic over many years. But there is a lot of literature I have found along the way.

Intentional Programming is an interesting read as someone has already mentioned...from the guy who brought us `strHungarianNotation`. Storing code in a database but retaining the joy of the plain text cut/copy/paste experience is the key challenge, as well as all the unix file goodness.

Its quite fun to talk to ChatGPT about these topics and just question everything and delve back in the history of programming.

elbear · 2 years ago
As far as I'm aware, the Unison Language implements some of his ideas: https://www.unison-lang.org
camgunz · 2 years ago
To me they felt very similar to Joe Armstrong's "Why do we need modules at all?" [0]:

---

Why do we need modules at all?

This is a brain-dump-stream-of-consciousness-thing. I've been thinking about this for a while.

I'm proposing a slightly different way of programming here The basic idea is

- do away with modules

- all functions have unique distinct names

- all functions have (lots of) meta data

- all functions go into a global (searchable) Key-value database

- we need letrec

- contribution to open source can be as simple as contributing a single function

- there are no "open source projects" - only "the open source Key-Value database of all functions"

- Content is peer reviewed

... ---

Whole thread's worth a read.

[0]: https://erlang.org/pipermail/erlang-questions/2011-May/05876...

adius · 2 years ago
I am working on storing code in SQLite: https://mailchi.mp/62e9b4a81f16/cosuz
jihiggins · 2 years ago
this is sort of what modern ides (e.g. jetbrains stuff) already do in the bg. when im working on stuff, i almost never navigate via text or the file explorer, i use things like "goto usages or definition" and navigate via what is essentially data tracing. this only works well with statically typed languages ime, though.

the indexing step is basically building this db in the background, it's just kept out of view / hidden unless you're building ide plugins or whatever.

valty · 2 years ago
> via what is essentially data tracing

Value tracing is at runtime. JetBrains cannot trace how values flow through your code.

To do this, you need to instrument all your code, and track all the transformations that occur for each value. It's really difficult to do if the language is not designed for it and there are a lot of performance implications.

If your code is written in a functional paradigm it becomes much easier to trace...such as with rxjs.

jpc0 · 2 years ago
> All the issues with global variables can be solved with better tracing tooling

I would argue this problem is solved in most current languages with strict types.

Stop making all the things strings or abstract base classes.

Easy example I've worked on recently. An IPv4 address is an IPv4address in code. I don't care if it is just represented as an uint32 or string in Memory, in your code it should be an IPv4 address and if a function expects an IPv4 address and you pass it a string that is a compilation error.

loup-vaillant · 2 years ago
Here's what I have to say about locality: https://loup-vaillant.fr/articles/source-of-readability

TL;DR: locality is likely one of the most important concepts in all of knowledge work. Of course we're a little obsessed with it.

valty · 2 years ago
> Our screens offers only a small window, and even the smartest IDE can’t give us instant access to everything

This is the real problem that needs solving.

> code that is read together should be written together

Code is a database of functions. This approach is like trying to design a database in denormalized form.

silon42 · 2 years ago
How would you do the diff/apply?

Those tools are essential and they basically rely on locality (context).

valty · 2 years ago
Semantic diffs.
Chris_Newton · 2 years ago
I have actually started to think that, against conventional wisdom, everything should be a global variable. All the issues with global variables can be solved with better tracing tooling.

This is an interesting premise, and actually I think we have quite a lot of examples both of successfully applying the idea up to a point and of where it starts to break down in practice.

Modern distributed applications mostly have a back end with some kind of database and front end UIs that depend on that back end via an API. Those databases are often global data stores, accessible from anywhere in the back end implementation. A lot of work is done to design and manage them, probably modelling some real world system that our application is concerned with, and there are varying degrees of abstraction/isolation used to preserve that design intent.

If the data model is simple then this works OK, particularly if you have a SQL database that can enforce some basic constraints and relationships to make illegal states unrepresentable.

What we usually see as the data models and the actions that update them become more complicated is the introduction of some business logic layer. The rest of the system isn’t allowed free access to update the state any more; it’s required to go through some defined interface that provides specific actions that guarantee the state remains valid.

That’s the writing side. On the reading side, aside from security/privacy issues, we generally don’t have the same concerns with allowing free access to the whole database from anywhere. However, often we need some form of derived data that isn’t directly stored in the database itself but instead can be constructed from other state that is. So again we end up with some kind of abstraction/isolation layer between the rest of our system and the database.

In each of these cases, there is probably data that we’re working with that is not the state that ultimately persists within the database. So the question immediately arises, if we only have global data in our programs, where does all of this transient, intermediary data go? If we put it into our database as well then all the usual problems with concurrency and integrity immediately appear, so we are back to needing something that is local to our immediate logic and can’t conflict with any other logic or indeed any other instance of the same logic that happens to be running in some other context at the same time.

We see analogous issues in the front end UI code for those distributed systems. If there is a relatively simple model then maybe the front end can effectively just fetch/cache the state from the back end API. As things get more complicated, maybe you end up with a front end data store analogous to the back end database that becomes the central, authoritative store of your front end state. And again maybe this provides some defined interface for accepting valid updates to the state and/or for accessing derived data. And again the questions arise about where the intermediate data generated by all of that logic should go if we have only our global store to hold state, and the answer is likely to be some form of more local data.

On top of the persistent state and anything acting upon or derived from it, we also have other kinds of information we work with in front end code. Many UIs will have state that is used purely to control the user’s view into the inner world: the sorting criteria and current page of a table, the current position and zoom level over a map, the last item we’re currently showing in an infinitely scrolling list, look and feel settings like whether we’re using a dark mode theme. Some of this data might apply across the whole UI while other aspects might only apply to, say, a specific table, with each table needing its own instance of that “UI state” data. So again, if everything were global, that would mean we’d need to include every possible piece of UI state for the whole application in the global store.

This comment is already far too long so I’ll just quickly note that there are other recurring themes. One is how to synchronise “global” stores in a distributed system where you might have multiple front ends running with their own copies of the state, or perhaps multiple microservices on a back end that have duplication in their databases because everything is supposed to be independent and denormalised. A related issue is how to represent temporary clones of significant parts of the data during user interactions, like building up a transaction with several changes before atomically committing it or rolling it back (think dialog box on the UI side or an internally consistent batch of changes sent to the back end), or supporting an undo facility that needs to reconstruct a previous version of the persistent state one way or another.

I do believe there’s a lot more we have to learn about different types of state and transient data and how we can model those cleanly in our systems. There are certainly common patterns we touch on in a lot of different contexts. And I think both extremes of having too much data trapped too locally and having too much lifted to global storage have their own difficulties and probably there is some sort of structure in between that would be better than what we typically write today. But it’s not an easy problem or we’d all be solving it by now…

valty · 2 years ago
The comment was mainly related to in-memory variables within an application process...focusing on scoping/syntax...but the thinking was definitely inspired by the fact that most apps center around an external database without realizing its essentially a global variable.

In application code, when I talk of global vars, I mean that every function has access to all data...as opposed to access being abstracted and modularized into various services which are exposed via being passed through a chain of function args, or some kind of dependency injection system.

But this global variable could actually be an abstraction (a store) allowing data integrity checks on writes.

> there is probably data that we’re working with that is not the state that ultimately persists within the database

If you think about all your data in one big graph, this transient data still has a relation to the final persisted state. There is a data flow of transient values into the persisted data values. And separately, your intermediate data structures might also contain relations to your persisted data structures.

Most dev tools don't track these relationships, and you have this tangled ad-hoc mess where data is dumped from one structure to the next.

> “UI state” data...we’d need to include every possible piece of UI state for the whole application in the global store.

Yep, this is what you should do.

If I am sorting something on a webapp and I refresh the browser, I probably want to have see the same things as they were sorted. This might vary between use case, but adding the functionality should be easy to do if necessary. So therefore it is good practice to allow all local ui state to be persisted by default.

The ui state is being persisted anyway, inside a component or inside the HTML document. Somewhere in the heap this data is stored. And if we think about our one big graph again, this data is related to other things...its just that we lose these relations.

> A related issue is how to represent temporary clones of significant parts of the data

All rendered UI values should be in their own _ui_ models (view model is similar), separate from the source of truth models.

These ui models basically allow all rendered UI to be editable without immediately committing changes to the database. This allows for optimistic UI updates. They get notified of any incoming changes from the source of truth, and can decide what to do with them.

If you want to batch them up, you just create a Batch entity, and add a relation to these ui models. The main thing is to treat the ui models like any other models. Whether they are persisted or not should simply be flipping a flag in your code.

For UI, everything should be in one big graph. Code is data.

I find with modern programming, all of the popular programming languages, frameworks, libraries, databases, platforms, really get in the way of being able to do things simply.

lmm · 2 years ago
To my mind the conclusion is backwards. A file with a high compressed size might be doing something useful; a file with a low compressed size but a high uncompressed size is a file that's full of repetitive junk, and those are the files that should be a target for refactoring.
TeMPOraL · 2 years ago
Exactly. As ways to differentiate between "essential complexity" and "accidental complexity" go, the idea of looking at what compresses well sounds quite good - but it's the accidental complexity that will compress the best, and essential the worst. And latter is not the problem, the former is.

Deleted Comment

userbinator · 2 years ago
In other words, the compression ratio is what's important.
nyrikki · 2 years ago
That is the entire basically the entire concept of complexity theory.

The reals are compressed into the computable reals. The real numbers are un-computable 'almost everywhere'

Semi-decidabity is just recursively enumerable that gets you to finite time with unlimited resources.

NP-hard is brute forcible in exponential time, with most likely no approximate polynomial reductions.

P has exact polynomial time reductions...

Most code is made of IF and FOR loops, because that produces primitive recursive functions, which are most of the intuitive computable functions that always halt.

The problem becomes with complex systems where you need to balance coupling with cohesion along with free variables, WHILE and GOTO.

Note that the above compression was lossy.

If you consider that information loss as setting constraints through heuristics, (educated guesses) those constraints may or may not work for a particular use case.

The problem with how we often try to set coding style is that we want simple universal rules.

Unless you have a system that fits in with those ideals all the time that is problematic.

Gödel demonstrated that isn't possible for complex systems. Either our rules will be inconsistent or incomplete.

This is why I think selling conventions as ideals, that need to yield to less preferred cohesion models when appropriate is the real solution.

Unfortunately that requires a lot more thought and vigilance.

Functional cohesion with loose coupling is what we shoot for when it is appropriate, but not as a hard fast rule.

aappleby · 2 years ago
The article is somewhat silly, but there's a kernel of good advice here -

To estimate the "complexity" of a codebase:

1. Remove all comments

2. Replace all spans of whitespace with a single space

3. Concatenate all source together into a single file

4. Compress the resulting text file using gzip -9 (or your favorite compression engine)

The size of the resulting file is a good proxy for overall complexity. It's not heavily affected by naming conventions, and a refactoring that reduces the number is probably good for overall complexity.

It's not a perfect metric as it doesn't include any notion of cyclomatic complexity, but it's a good start and useful to track over time.

sltkr · 2 years ago
I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.

Here are some examples where you would increase the compressed code size while not making the project more complex:

1. Adding unit tests to code that was previously untested. Unit tests add little complexity because they don't introduce new interfaces.

2. Splitting a God class up into multiple independent classes. Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.

etc.

kqr · 2 years ago
> I think the value of metrics like this is limited, since code base size only very roughly corresponds to implementation complexity.

This sounds a lot like the "your model is wrong because nuance X" argument. I want to remind you that all models are wrong, but some of them are useful anyway. In particular, I have found the size of source code to be a highly useful predictor of complexity. It has helped me predict where bugs are, where changes are made, where developers point out areas of large technical debt, and many other variables associated with complexity.

The test of a model is not whether it accounts for all theoretical nuances, but rather whether it's empirically useful – and critically, has higher return-on-investment than alternative models. What model do you suggest for implementation complexity that you have verified to be better than code size? Genuinely interested!

(Additionally, I have also successfully used the compressed size of input data to predict the resource requirements of processing that data, without actually having to process it first. This is useful because the compressed size can be approximated on-line rather cheaply.)

TeMPOraL · 2 years ago
> Usually this improves readability thanks to separation of concerns, but it often increases raw code size because each new class adds some boiler plate.

That is why compression is mentioned. Boilerplate is something that disappears under good enough compression. It's literally why we call it boilerplate and generally dislike it - because once we spot the pattern, we can mentally compress it away, and then are annoyed that we have to do that mental compression whenever reading or modifying that code. Feels like pointless work, which it is.

gkbrk · 2 years ago
Why would you include unit tests in the code size or complexity calculations?
CapsAdmin · 2 years ago
Sometimes I've scanned code bases of my own for all user definable variable names and just levenshtein distanced them. It's kind of useful, but the hurdle for me at least is that I need to run something in a terminal to get the results. Maybe I'd use it more if it was a plugin in my ide of choice.

Something else you could maybe do is to simplify the code and compare sequences of statements and expressions to each other.

ie the 2 statements "foo = bar; foo += 20" is identical to "zoo = war; zoo += 20"

smburdick · 2 years ago
This is what a minifier does, and those go even further to rename variables.

Another thing that should be pruned away entirely are data files, including all constant strings within the code, since humans should avoid those when focusing on algorithms

At that point you pretty much have a highly compressed version of what you'd find in CLRS or any other algorithmic text.

crq-yml · 2 years ago
Many of the issues that come up with applied information theory to practical code are of the form "oh, but it won't be fast if we do it that way". The end of the article links Alan Kay discussing STEPS and how it solves many computing. fundamentals needed for a desktop in miniscule amounts of code. One of the comments to that video, made five years ago, dismisses it as unrealistic ivory tower nonsense that can't run fast enough. (Notwithstanding, the presentation was given on a running system demonstrating the proof of concept)

But there is a similar sentiment to Kay's from the bottom-up viewpoint. The Forth community, who have made livings on implementing this kind of succinct design in commercial settings, tend to point to hardware manufacturers themselves as the primary difficulty. Their business is to sell you more hardware than you need, and that leads them towards doing nothing to help with the software crisis, but rather, to encourage processing and I/O to be complex things to reason about, with complex protocols and mystery-meat drivers. If you have to use USB, Bluetooth, TCP/IP...you're stuck. Nobody wants to deal with those hot potatoes. You can't address it properly by running up the abstraction stack and doing "everything in the browser". That's playing nicely with the standards instead of attacking them. When software companies play along and say "well, it's the standard so we have to use it," their problem gets deeper.

Some room could be conceded to say that some of that complexity is essential, but one of the ways in which we describe progress in science and technology is to find solutions that are lighter and simpler to understand, e.g. instead of astronomical tables describing "Earth at the center of the universe" epicycles, smaller equations describing orbits around the Sun.

nighthawk454 · 2 years ago
Bit of roundabout way to say ‘DRY’. Information isn’t a universal context-free quantity, it depends on the models involved. In this case the target is removing repeated words/code symbols and being concise.
userbinator · 2 years ago
I've also thought of the idea of using some LZ-based compression on source code files to determine which ones have the most redundancy (the ones that have the best ratios) and could be simplified by refactoring, which is not too different from the entropy-based approach described here. It's worth noting that this also identifies languages that trend towards boilerplate and egregious verbosity --- for example, I've noticed that the average C# or Java codebase will compress much better than C, while (much) denser stuff like APL-family languages don't compress as much.
Tade0 · 2 years ago
I can half-agree with this, but I would measure it differently.

It's not the words themselves that are surprising, but the range of language features used.

I have a friend who produces particularly readable (TypeScript) code - at least in my opinion. For a long time I couldn't figure out what was so special about it, then it dawned on me that he simply doesn't use the more fancy and recent features of the language - by recent I mean stuff that came out in the last four years or so.

I wouldn't be so radical in my approach, but I believe feature usage should obey a power law, with more advanced parts occurring less frequently. I'm sure there's a balance to be struck between terseness and using as minimal a subset of the language as possible.

Also it doesn't hurt to have whitespace here and there. We use paragraphs in written speech for a reason.