Hey everyone. Mito cofounder here. Thanks to whoever posted this - was a real surprise to find it here :-)
Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.
Spreadsheets were the original killer app for computers, and are the most popular programming language used worldwide today. That being said, spreadsheets have some growing to do! They don’t handle large datasets well, they don’t lead to repeatable or auditable processes, and generally they disrespect many of the hard won software engineering principals that us engineers fight for.
More than that, as spreadsheet users run into these problems and turn to Python to solve them, they struggle to use pandas to accomplish what would have been two clicks in a spreadsheet. Pandas is great, but the syntax is not always so obvious (not is learning to program in the first place!)
Mito is the our first step in addressing these problems. Take any dataframe, edit it like a spreadsheet, and generate code that corresponds to those edits. You can then take this Python code and use it in other scripts, send it to your colleagues, or just rerun it.
We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.
To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!
Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.
My cofounders and I will be trolling about in the comments. Say hey! :-)
Heyo! Another co-founder here. Excited to see Mito on HN :) Thanks @alefnula for posting!
+1 to everything @narush said.
It's important to us that the software we build is empowering to users and not restrictive. This plays out in two primary ways:
1) Since Mito is open source and generates Python code for every edit, Mito doesn't lock users into a 'Mito ecosystem', instead it help users interact with the powerful & robust Python ecosystem. 2) Because Mito is an extension to Jupyter Notebooks + JupyterLab, Mito improves your existing workflows instead of completely altering your data analytics stack.
Excited to interact with you all in the comments :)
If you are a large company trying to migrate to Python, you might also want to have a look at bamboolib.com which was acquired by Databricks.
bamboolib is very similar to mito (hard to tell who was first).
The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.
bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.
Full disclosure: I am a co-founder of bamboolib and employed by Databricks
Hey a bit late to the party (HN newsletter crowd). This really seems like something my BigCorp could use. I am on holiday RN, so I won't fire my computer to try it. But I was wondering, does it allows easy copy pasting the table into standard MS documents (work ? outlook mails ?).
I like this. Is a "friendlier" way to browse data. Said that, I have to add:
Exploring large datasets requires a COMPLETELY different mindset. When your data starts growing, it's impossible to keep it all in a visual format (for 2 reasons[0]) and you have to start thinking analytically. You have to start looking at the statistical values of your data to understand what's its shape. That's why the `.describe()` and `.info()` methods in Pandas are so useful. After many years doing this, I can "see" the shape of my data just by looking at the statistical information about it (mean, median, std, min, max, etc).
After some time you don't need to rely on visual tools, just can run a few methods, look at some numbers, and understand all your data. Kinda feels like the operator of The Matrix that is looking at the green numbers descend and knows what's going on behind the scenes.
[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.
I would caution against this approach in general (unless you’re working with unusually uniform data from a deterministic source — in my world that is rarely the case). Summary statistics are useful but taken in isolation they can mislead. One loses the ability to get a feel for interesting non-aggregated phenomenon.
I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.
The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.
If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.
- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)
Of course that `.head()`, `.tail()`, `iloc` and other mechanisms to visualize the data of subsets is always important. But would you really caution AGAINST this? Like, literally telling someone NOT to use summary statistics to explore a dataset?
This is a great point and something that we're actively working on improving in Mito. If you have millions of rows of data, its not enough to just scroll through your data, you need tools to build your understanding.
Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.
In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.
It would be super fun to implement an intelligent head() function that shows a representative sample rather than the first X rows. Do the profiling & identify a collection of rows that represent the overall distribution.
You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.
Good points! I also think that this is an area that Mito could do better in. While we do provide pretty cool summary stats [1] and graphing capabilities [2], there isn't a great view for the summary stats of the entire dataframe. It's def on the roadmap -- but this comment makes me think we should move on it quick.
I find the world is full of datasets with < 200 datapoints, and that is where excel (in my experience) is great. With such datasets it often makes sense to look through the data at particular outliers.
Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!
It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.
I have an excel template for handling a relatively large amount of data. No where 15GB on one sheet. I use it for preprocessing experimental data from a single experiment. There are about 10 chart tabs build in so I can visually inspect the data looking for errors (and go back and inspect the raw instrument data when something looks off).
The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.
Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?
Usually aggregated... then can start looking at "subsets". For example, step 1 is look at the whole dataset. Then you identify that there are a lot of rows with a type of missing value, so you look at the statistical attributes of that subset (all the rows with value X in null).
From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".
I use a similar function when I want to see everything:
```
def showAllRows(dataframeToShow):
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(dataframeToShow)
# calling it while limiting the number of returned rows.
showAllRows(df.head(1000))
```
Be warned though! if you call this function without limiting the number of rows to be fetched, it is guaranteed you will crash your machine. Always use head, sample or slices.
If do get a crush, then your only option is to open the ipynb file with vi and manually delete the millions of lines this function created.
Another function that I like is:
```
def showColumns(df, substring):
print([x for x in df.columns if substring in x])
return
# calling it
showColumns(df, "year")
```
This is useful in data frames with many columns, when you want to find all the columns that have a specific string in their name. It returns a string, which then you can pass it in the dataframe to print only these columns.
For those who are going through the thread finding new tools: pandas-profiling[0] is a library for automatic EDA (which bamboolib[1], mentioned elsewhere, also does).
Thanks for that feedback. Mito's approach to telemetry is that we never log any of your data or metadata about your data. We don't track things like the size, shape, or content of your data.
We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.
That being said, it's important to us that there is a way to be totally telemetry-less if users don't want any information to be leave their computer. Compared to most other cloud-based sass data science tools where you pretty much have no hope of total privacy, we're proud of the flexibility that we offer.
But of course, we're always open to feedback about how we can continue to improve our practices!
To the founders of mito, regarding the mito GPL license:
What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?
Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.
And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?
Hiya kite_and_code - thanks for the question + good to see you here :)
Our understanding of our license is evolving - we're first time open source devs, and as I'm sure you know it can be a tricky process. That being said: we totally support Mito users using Mito from notebooks hosted in the cloud!
Currently, we have quite a few users using Mito in notebooks hosted through AWS, GCP, etc. We’re aiming to be good stewards of open source software, and want to see Mito exist where ever it is solving users problems!
We’ve had lots of folks in lots of environments request Mito, and are actively working on prioritizing supporting those other environments. We added classic Notebook support last month (funnily, I thought it’d take weeks to support, and it took 2 days lol) - and are looking into VS Code, Streamlit, Dash, and more!
EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!
I can totally relate that finding a suitable open-source business model is a fuzzy journey.
Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.
Currently, it seems like you are tolerating usage inside the cloud providers without taking a clear stance. I think this creates fear, uncertainty, doubt and slows down mito adoption within the cloud.
I would appreciate a clear statement in the near future around your thinking on how mito should be made available in those environments. After all, the clouds are an environment to where more and more users are migrating to. Or at least use it in parallel to local setups.
I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.
Really love what you're doing and the innovation that you are pushing for! <3
As a potential user, this is pretty troubling. I can understand your intentions, but if the license doesn’t match your intentions (and if you don’t completely understand the license), how can we be sure our workflows will be supported/possible in the future?
Looks neat - pandas is very powerful and it makes it more approachable for non-programmers. However paid product like this - I probably wouldn't make the switch to this and then have the company go belly up leaving users stranded. Too much risk.
Hope for the best though - pandas is pretty fantastic.
You might want to check out a tool Vizier: https://vizierdb.info (I'm one of the devs). Direct interaction with notebooks state (e.g., dataframes as spreadsheets) is one of the central ideas, and it's fully open source.
One of the creators of Mito, here. Thanks for your feedback. I wanted to share a couple of nuggets about Mito that have been helpful in talking about this with other users.
1. The core Mito product is open source. You can see our GitHub here [1]. We also have a pro version that has some additional, code visible, but non-open source features. The way that we think about which features belong in which version of the product is as following: Features that are needed to just get any average analysis done are open source features. On the other hand, features that are specifically useful in an organization -- connecting to company databases, formatting / styling data and graphs for a presentation, etc. -- are pro features. So if you are a team that is relying on our pro features, you're helping support the longevity & progress of Mito. If you are not one of those users and using the open source version, then you will always have access to Mito (and can even help improve it!). Of course the line between what features are specifically helpful in an organization and what feature are needed for an average analysis is a bit blurry, and is a moving target as we continue to expand Mito's offering.
2. Mito is designed specifically to not force users to make a big 'switch'. I've commented this elsewhere in this thread, but just to recap: Because Mito is an extension to Juptyer and because we generate python code for every edit you make, Mito is designed to improve your existing workflow instead of lock you into a new system. Many Mito users use Mito as a starting point! They do as much of their analysis as they can in the Mito spreadsheet and then continue writing more customized Python code to finish up their work.
Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.
Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.
I love how mito enables companies to use the power of open-source!
You might want to think about enabling companies to create the company specific extensions themselves e.g. via a plugin API. You might still request them to pay for this version of Mito but they are enabled to extend it with their engineering power instead of relying on you.
We had good experiences with this at bamboolib (I am one of the co-founders) and in addition to recurring license revenue it also increased demand for consulting from our end because the internal company devs started working on plugins and then wanted our direct guidance on how to get the more tricky things to work.
Heyo, Mito cofounder here, bridging that gap is one of the main ways that enterprises are using Mito today! Helping business users become data self-sufficient in a world where Excel's data size limitations make it a non-option is where Mito shines :)
Mito (pronounced my-toe) was born out of our personal experience with spreadsheets, and a previous (failed) spreadsheet version control product.
Spreadsheets were the original killer app for computers, and are the most popular programming language used worldwide today. That being said, spreadsheets have some growing to do! They don’t handle large datasets well, they don’t lead to repeatable or auditable processes, and generally they disrespect many of the hard won software engineering principals that us engineers fight for.
More than that, as spreadsheet users run into these problems and turn to Python to solve them, they struggle to use pandas to accomplish what would have been two clicks in a spreadsheet. Pandas is great, but the syntax is not always so obvious (not is learning to program in the first place!)
Mito is the our first step in addressing these problems. Take any dataframe, edit it like a spreadsheet, and generate code that corresponds to those edits. You can then take this Python code and use it in other scripts, send it to your colleagues, or just rerun it.
We’ve been working on Mito for over a year now. Growth has really picked up in the past few months - and we’ve begun working with larger companies to help accelerate their transition to Python.
To any companies who are somewhere in that Python transition process - please do reach out - we would love to see if we can be helpful for all your spreadsheet users!
Feel free to browse my profile for other spreadsheet related thoughts, I’m a bit of a HN junkie. Of course, any and all feedback (positive or negative) is appreciated.
My cofounders and I will be trolling about in the comments. Say hey! :-)
+1 to everything @narush said.
It's important to us that the software we build is empowering to users and not restrictive. This plays out in two primary ways: 1) Since Mito is open source and generates Python code for every edit, Mito doesn't lock users into a 'Mito ecosystem', instead it help users interact with the powerful & robust Python ecosystem. 2) Because Mito is an extension to Jupyter Notebooks + JupyterLab, Mito improves your existing workflows instead of completely altering your data analytics stack.
Excited to interact with you all in the comments :)
Last time I checked the code was under a proprietary license.
Edit: I found in another comment below that mito is now available under GPL license here: https://github.com/mito-ds/monorepo/blob/dev/LICENSE
Edit2: Just saw your answer now - thanks for the clarification and links!
bamboolib is very similar to mito (hard to tell who was first).
The advantage is that it runs within Databricks which gives you the ability to scale to any amount of data easily and Databricks has many (and growing) security certifications e.g. HIPAA compliance.
bamboolib can be used in plain Jupyter. Also, bamboolib private preview within Databricks is about to start within the next days.
Full disclosure: I am a co-founder of bamboolib and employed by Databricks
Exploring large datasets requires a COMPLETELY different mindset. When your data starts growing, it's impossible to keep it all in a visual format (for 2 reasons[0]) and you have to start thinking analytically. You have to start looking at the statistical values of your data to understand what's its shape. That's why the `.describe()` and `.info()` methods in Pandas are so useful. After many years doing this, I can "see" the shape of my data just by looking at the statistical information about it (mean, median, std, min, max, etc).
After some time you don't need to rely on visual tools, just can run a few methods, look at some numbers, and understand all your data. Kinda feels like the operator of The Matrix that is looking at the green numbers descend and knows what's going on behind the scenes.
[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.
I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.
The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.
https://en.wikipedia.org/wiki/Anscombe's_quartet
- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)
- mito (GPL license)
- dtale (MIT license)
Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.
In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.
You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.
Thanks for the feedback!
[1] https://docs.trymito.io/how-to/summary-statistics
[2] https://docs.trymito.io/how-to/graphing
Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!
Or visualising it in r or pandas without meaningful subsampling.
It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.
The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.
Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?
From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".
- https://github.com/quantopian/qgrid
- https://github.com/man-group/dtale
I find that I'm actually a lot faster using basic Pandas methods to get the data I want in exactly the form I want it.
If I really want to show everything, I just use:
```
with pd.option_context('display.max_rows', None):
``````
def showAllRows(dataframeToShow):
# calling it while limiting the number of returned rows.showAllRows(df.head(1000))
```
Be warned though! if you call this function without limiting the number of rows to be fetched, it is guaranteed you will crash your machine. Always use head, sample or slices.
If do get a crush, then your only option is to open the ipynb file with vi and manually delete the millions of lines this function created.
Another function that I like is:
```
def showColumns(df, substring):
# calling itshowColumns(df, "year")
```
This is useful in data frames with many columns, when you want to find all the columns that have a specific string in their name. It returns a string, which then you can pass it in the dataframe to print only these columns.
[0]: https://github.com/pandas-profiling/pandas-profiling [1]: https://bamboolib.com/
We do collect info about app usage, things like which buttons users click. This allows us to focus development time on improving the features that are used most often.
That being said, it's important to us that there is a way to be totally telemetry-less if users don't want any information to be leave their computer. Compared to most other cloud-based sass data science tools where you pretty much have no hope of total privacy, we're proud of the flexibility that we offer.
But of course, we're always open to feedback about how we can continue to improve our practices!
Edit: To remove telemetry, just call:
No licensing or payment required, and doesn't violate the license.Deleted Comment
What is your take on that regarding usage inside cloud provider's notebooks like AWS, GCP, Azure, Databricks?
Is it allowed or not allowed by the license? And who should/can control the usage since users can install any kind of Python library in those environments.
And, separately from the maybe ambiguous legal answer: What is your personal intention with the license?
Disclosure: I am employed by Databricks.
Our understanding of our license is evolving - we're first time open source devs, and as I'm sure you know it can be a tricky process. That being said: we totally support Mito users using Mito from notebooks hosted in the cloud!
Currently, we have quite a few users using Mito in notebooks hosted through AWS, GCP, etc. We’re aiming to be good stewards of open source software, and want to see Mito exist where ever it is solving users problems!
We’ve had lots of folks in lots of environments request Mito, and are actively working on prioritizing supporting those other environments. We added classic Notebook support last month (funnily, I thought it’d take weeks to support, and it took 2 days lol) - and are looking into VS Code, Streamlit, Dash, and more!
EDIT: due to comment below, I edited this comment for clarity that we 100% support users using Mito from notebooks in the cloud!
Nevertheless, from the user perspective I would love to hear a more clear answer - at least for e.g. the next 6-12 months.
Currently, it seems like you are tolerating usage inside the cloud providers without taking a clear stance. I think this creates fear, uncertainty, doubt and slows down mito adoption within the cloud.
I would appreciate a clear statement in the near future around your thinking on how mito should be made available in those environments. After all, the clouds are an environment to where more and more users are migrating to. Or at least use it in parallel to local setups.
I can understand if you don't want to answer on the spot in case you don't have a clear stance yet. In this case, please take your time and let us know when you made your decision.
Really love what you're doing and the innovation that you are pushing for! <3
As a potential user, this is pretty troubling. I can understand your intentions, but if the license doesn’t match your intentions (and if you don’t completely understand the license), how can we be sure our workflows will be supported/possible in the future?
Hope for the best though - pandas is pretty fantastic.
1. The core Mito product is open source. You can see our GitHub here [1]. We also have a pro version that has some additional, code visible, but non-open source features. The way that we think about which features belong in which version of the product is as following: Features that are needed to just get any average analysis done are open source features. On the other hand, features that are specifically useful in an organization -- connecting to company databases, formatting / styling data and graphs for a presentation, etc. -- are pro features. So if you are a team that is relying on our pro features, you're helping support the longevity & progress of Mito. If you are not one of those users and using the open source version, then you will always have access to Mito (and can even help improve it!). Of course the line between what features are specifically helpful in an organization and what feature are needed for an average analysis is a bit blurry, and is a moving target as we continue to expand Mito's offering.
2. Mito is designed specifically to not force users to make a big 'switch'. I've commented this elsewhere in this thread, but just to recap: Because Mito is an extension to Juptyer and because we generate python code for every edit you make, Mito is designed to improve your existing workflow instead of lock you into a new system. Many Mito users use Mito as a starting point! They do as much of their analysis as they can in the Mito spreadsheet and then continue writing more customized Python code to finish up their work.
Not requiring a big switch is nice for the user and its nice for Mito too! Lots of large companies have been able to get up and running with Mito in 30 minutes because it fits into their data stack.
Anyways, not that these are the only two reasons you might feel uneasy about adopting Mito, but at least wanted to share why the switch to Mito might be less scary than switching to other tools.
[1] https://github.com/mito-ds/monorepo
You might want to think about enabling companies to create the company specific extensions themselves e.g. via a plugin API. You might still request them to pay for this version of Mito but they are enabled to extend it with their engineering power instead of relying on you.
We had good experiences with this at bamboolib (I am one of the co-founders) and in addition to recurring license revenue it also increased demand for consulting from our end because the internal company devs started working on plugins and then wanted our direct guidance on how to get the more tricky things to work.
Another tool like Mito is Bamboo: https://bamboolib.8080labs.com/