Mito – Excel-like interface for Pandas dataframes in Jupyter notebook

Hey everyone. Mito cofounder here. Thanks to whoever posted this - was a real surprise to find it here :-)

Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.

Spreadsheets were the original killer app for computers, and are the most popular programming language used worldwide today. That being said, spreadsheets have some growing to do! They don’t handle large datasets well, they don’t lead to repeatable or auditable processes, and generally they disrespect many of the hard won software engineering principals that us engineers fight for.

More than that, as spreadsheet users run into these problems and turn to Python to solve them, they struggle to use pandas to accomplish what would have been two clicks in a spreadsheet. Pandas is great, but the syntax is not always so obvious (not is learning to program in the first place!)

Mito is the our first step in addressing these problems. Take any dataframe, edit it like a spreadsheet, and generate code that corresponds to those edits. You can then take this Python code and use it in other scripts, send it to your colleagues, or just rerun it.

We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.

To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!

Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.

My cofounders and I will be trolling about in the comments. Say hey! :-)

aarondia · 3 years ago

Heyo! Another co-founder here. Excited to see Mito on HN :) Thanks @alefnula for posting!

+1 to everything @narush said.

It's important to us that the software we build is empowering to users and not restrictive. This plays out in two primary ways: 1) Since Mito is open source and generates Python code for every edit, Mito doesn't lock users into a 'Mito ecosystem', instead it help users interact with the powerful & robust Python ecosystem. 2) Because Mito is an extension to Jupyter Notebooks + JupyterLab, Mito improves your existing workflows instead of completely altering your data analytics stack.

Excited to interact with you all in the comments :)

kite_and_code · 3 years ago

Can you please clarify what you mean by "mito is open-source"?

Last time I checked the code was under a proprietary license.

Edit: I found in another comment below that mito is now available under GPL license here: https://github.com/mito-ds/monorepo/blob/dev/LICENSE

Edit2: Just saw your answer now - thanks for the clarification and links!

kite_and_code · 3 years ago

If you are a large company trying to migrate to Python, you might also want to have a look at bamboolib.com which was acquired by Databricks.

bamboolib is very similar to mito (hard to tell who was first).

The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.

bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.

Full disclosure: I am a co-founder of bamboolib and employed by Databricks

NoImmatureAdHom · 3 years ago

bamboolib appears to be closed-source. You're at their mercy.

lcrmorin · 3 years ago

Hey a bit late to the party (HN newsletter crowd). This really seems like something my BigCorp could use. I am on holiday RN, so I won't fire my computer to try it. But I was wondering, does it allows easy copy pasting the table into standard MS documents (work ? outlook mails ?).

I like this. Is a "friendlier" way to browse data. Said that, I have to add:

Exploring large datasets requires a COMPLETELY different mindset. When your data starts growing, it's impossible to keep it all in a visual format (for 2 reasons[0]) and you have to start thinking analytically. You have to start looking at the statistical values of your data to understand what's its shape. That's why the `.describe()` and `.info()` methods in Pandas are so useful. After many years doing this, I can "see" the shape of my data just by looking at the statistical information about it (mean, median, std, min, max, etc).

After some time you don't need to rely on visual tools, just can run a few methods, look at some numbers, and understand all your data. Kinda feels like the operator of The Matrix that is looking at the green numbers descend and knows what's going on behind the scenes.

[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.

wenc · 3 years ago

I would caution against this approach in general (unless you’re working with unusually uniform data from a deterministic source — in my world that is rarely the case). Summary statistics are useful but taken in isolation they can mislead. One loses the ability to get a feel for interesting non-aggregated phenomenon.

I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.

The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.

mejutoco · 3 years ago

A good example of what you are warning against is Anscombe’s quartet

https://en.wikipedia.org/wiki/Anscombe's_quartet

kite_and_code · 3 years ago

If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.

- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)

- mito (GPL license)

- dtale (MIT license)

santiagobasulto · 3 years ago

Of course that `.head()`, `.tail()`, `iloc` and other mechanisms to visualize the data of subsets is always important. But would you really caution AGAINST this? Like, literally telling someone NOT to use summary statistics to explore a dataset?

aarondia · 3 years ago

This is a great point and something that we're actively working on improving in Mito. If you have millions of rows of data, its not enough to just scroll through your data, you need tools to build your understanding.

Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.

In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.

pbronez · 3 years ago

It would be super fun to implement an intelligent head() function that shows a representative sample rather than the first X rows. Do the profiling & identify a collection of rows that represent the overall distribution.

You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.

narush · 3 years ago

Good points! I also think that this is an area that Mito could do better in. While we do provide pretty cool summary stats [1] and graphing capabilities [2], there isn't a great view for the summary stats of the entire dataframe. It's def on the roadmap -- but this comment makes me think we should move on it quick.

Thanks for the feedback!

[1] https://docs.trymito.io/how-to/summary-statistics

[2] https://docs.trymito.io/how-to/graphing

CJefferson · 3 years ago

I find the world is full of datasets with < 200 datapoints, and that is where excel (in my experience) is great. With such datasets it often makes sense to look through the data at particular outliers.

Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!

kite_and_code · 3 years ago

+1 - this is also how I operated as a Data Scientist myself

awild · 3 years ago

> try loading a 15GB CSV in Excel.

Or visualising it in r or pandas without meaningful subsampling.

pea · 3 years ago

One cool library I saw recently for helping on the visualisation side is https://github.com/vegafusion/vegafusion

It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.

kurupt213 · 3 years ago

I have an excel template for handling a relatively large amount of data. No where 15GB on one sheet. I use it for preprocessing experimental data from a single experiment. There are about 10 chart tabs build in so I can visually inspect the data looking for errors (and go back and inspect the raw instrument data when something looks off).

The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.

Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?

kurupt213 · 3 years ago

In all seriousness, excell can’t be the right option for 15GB of alphanumeric data (one sheet?)

mint2 · 3 years ago

Do you as a rule look at a sample of the individual raw data, non aggregated?

santiagobasulto · 3 years ago

Usually aggregated... then can start looking at "subsets". For example, step 1 is look at the whole dataset. Then you identify that there are a lot of rows with a type of missing value, so you look at the statistical attributes of that subset (all the rows with value X in null).

From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".

jpn · 3 years ago

I played around with many of these before:

- https://github.com/quantopian/qgrid

- https://github.com/man-group/dtale

I find that I'm actually a lot faster using basic Pandas methods to get the data I want in exactly the form I want it.

If I really want to show everything, I just use:

```

with pd.option_context('display.max_rows', None):

   print(df)

Foivos · 3 years ago

I use a similar function when I want to see everything:

def showAllRows(dataframeToShow):

  with pd.option_context('display.max_rows', None, 'display.max_columns', None):

    display(dataframeToShow)

# calling it while limiting the number of returned rows.

showAllRows(df.head(1000))

Be warned though! if you call this function without limiting the number of rows to be fetched, it is guaranteed you will crash your machine. Always use head, sample or slices.

If do get a crush, then your only option is to open the ipynb file with vi and manually delete the millions of lines this function created.

Another function that I like is:

def showColumns(df, substring):

    print([x for x in df.columns if substring in x])

    return

# calling it

showColumns(df, "year")

This is useful in data frames with many columns, when you want to find all the columns that have a specific string in their name. It returns a string, which then you can pass it in the dataframe to print only these columns.

dekhn · 3 years ago

what irks me about dtale is if you scroll with the vertical slider, it can't update the view fast enough until you stop scrolling.

harabat · 3 years ago

For those who are going through the thread finding new tools: pandas-profiling[0] is a library for automatic EDA (which bamboolib[1], mentioned elsewhere, also does).

[0]: https://github.com/pandas-profiling/pandas-profiling [1]: https://bamboolib.com/

Lux might also be interesting: https://github.com/lux-org/lux

Def check these all out! Lots of cool tools out there. For anyone who's tried a bunch of these... that's a great topic for a Medium post :)

MaxDPS · 3 years ago

I just found out about pandas-profiling a couple days ago and the examples blew my mind, it looks amazing (I’ve yet to actually try it out though).

rcarmo · 3 years ago

The telemetry thing is... weird. So we can use it for free but have no way to turn it off but upgrade to paid?

Thanks for that feedback. Mito's approach to telemetry is that we never log any of your data or metadata about your data. We don't track things like the size, shape, or content of your data.

We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.

That being said, it's important to us that there is a way to be totally telemetry-less if users don't want any information to be leave their computer. Compared to most other cloud-based sass data science tools where you pretty much have no hope of total privacy, we're proud of the flexibility that we offer.

But of course, we're always open to feedback about how we can continue to improve our practices!

learndeeply · 3 years ago

I don't get it. What in the license prevents users from removing the telemetry? AGPL just means the user needs to open source that change, right?

Edit: To remove telemetry, just call:

   from mitoinstaller.user_install import go_pro; go_pro();

No licensing or payment required, and doesn't violate the license.

Just to be very clear, the way to be "totally telemetry-less" is to pay you?

MadameBanaan · 3 years ago

Yeah, I have a hard pass on anything that offers an "Open Source" version, but actually meant to be a "Try it and be my Guinea Pig".

Deleted Comment

To the founders of mito, regarding the mito GPL license:

What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?

Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.

And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?

Disclosure: I am employed by Databricks.

Hiya kite_and_code - thanks for the question + good to see you here :)

Our understanding of our license is evolving - we're first time open source devs, and as I'm sure you know it can be a tricky process. That being said: we totally support Mito users using Mito from notebooks hosted in the cloud!

Currently, we have quite a few users using Mito in notebooks hosted through AWS, GCP, etc. We’re aiming to be good stewards of open source software, and want to see Mito exist where ever it is solving users problems!

We’ve had lots of folks in lots of environments request Mito, and are actively working on prioritizing supporting those other environments. We added classic Notebook support last month (funnily, I thought it’d take weeks to support, and it took 2 days lol) - and are looking into VS Code, Streamlit, Dash, and more!

EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!

I can totally relate that finding a suitable open-source business model is a fuzzy journey.

Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.

Currently, it seems like you are tolerating usage inside the cloud providers without taking a clear stance. I think this creates fear, uncertainty, doubt and slows down mito adoption within the cloud.

I would appreciate a clear statement in the near future around your thinking on how mito should be made available in those environments. After all, the clouds are an environment to where more and more users are migrating to. Or at least use it in parallel to local setups.

I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.

Really love what you're doing and the innovation that you are pushing for! <3

mbreese · 3 years ago

> Our understanding of our license is evolving

As a potential user, this is pretty troubling. I can understand your intentions, but if the license doesn’t match your intentions (and if you don’t completely understand the license), how can we be sure our workflows will be supported/possible in the future?

boringg · 3 years ago

Looks neat - pandas is very powerful and it makes it more approachable for non-programmers. However paid product like this - I probably wouldn't make the switch to this and then have the company go belly up leaving users stranded. Too much risk.

Hope for the best though - pandas is pretty fantastic.

okennedy · 3 years ago

You might want to check out a tool Vizier: https://vizierdb.info (I'm one of the devs). Direct interaction with notebooks state (e.g., dataframes as spreadsheets) is one of the central ideas, and it's fully open source.

This looks cool :)

One of the creators of Mito, here. Thanks for your feedback. I wanted to share a couple of nuggets about Mito that have been helpful in talking about this with other users.

1. The core Mito product is open source. You can see our GitHub here [1]. We also have a pro version that has some additional, code visible, but non-open source features. The way that we think about which features belong in which version of the product is as following: Features that are needed to just get any average analysis done are open source features. On the other hand, features that are specifically useful in an organization -- connecting to company databases, formatting / styling data and graphs for a presentation, etc. -- are pro features. So if you are a team that is relying on our pro features, you're helping support the longevity & progress of Mito. If you are not one of those users and using the open source version, then you will always have access to Mito (and can even help improve it!). Of course the line between what features are specifically helpful in an organization and what feature are needed for an average analysis is a bit blurry, and is a moving target as we continue to expand Mito's offering.

2. Mito is designed specifically to not force users to make a big 'switch'. I've commented this elsewhere in this thread, but just to recap: Because Mito is an extension to Juptyer and because we generate python code for every edit you make, Mito is designed to improve your existing workflow instead of lock you into a new system. Many Mito users use Mito as a starting point! They do as much of their analysis as they can in the Mito spreadsheet and then continue writing more customized Python code to finish up their work.

Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.

Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.

[1] https://github.com/mito-ds/monorepo

I love how mito enables companies to use the power of open-source!

You might want to think about enabling companies to create the company specific extensions themselves e.g. via a plugin API. You might still request them to pay for this version of Mito but they are enabled to extend it with their engineering power instead of relying on you.

We had good experiences with this at bamboolib (I am one of the co-founders) and in addition to recurring license revenue it also increased demand for consulting from our end because the internal company devs started working on plugins and then wanted our direct guidance on how to get the more tricky things to work.

noobker · 3 years ago

Mito looks cool. I'm hopeful a tool like it can create a bridge between Excel-based analysts/researchers and more mature application flows.

Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/

Heyo, Mito cofounder here, bridging that gap is one of the main ways that enterprises are using Mito today! Helping business users become data self-sufficient in a world where Excel's data size limitations make it a non-option is where Mito shines :)