Transform Data by Example [video]

teddyh · 9 years ago

You know what this reminds me of? Those trained neural-net things which, however many training examples you give it, always seem to find some way to “cheat” and not do what you want while still obeying all your training data correctly.

Something like this: Suppose we have a table of strings of digits, some including spaces, and we’d like to remove the spaces. From

  123 456
  234567
  345 678

to

  123456
  234567
  345678

Now, what happens if it encounters, say

Would the result be unchanged (as we would probably want), or would it “cheat” and remove the middle “7” character, giving “456890”?

function_seven · 9 years ago

This is why I want any ML device to be able to explain itself. It could train on your before-and-after examples and come up with a list of what it thinks you want it to do.

For your example, it could list:

    “Remove interior spaces from each item”

or it could say:

    “Remove the middle character from any 7-character strings to make them 6 characters in length”

You would be able to do something with that.

jnpatel · 9 years ago

DataWrangler [0] (now productionized as Trifacta Wrangler [1]) does pretty much that. It gives you suggested lists of transformations such as "Cut from position 18-25 as the Year column", that you can chain together as your data cleaning pipeline.

[0]: http://vis.stanford.edu/wrangler/

[1]: https://www.trifacta.com/products/wrangler/

pishpash · 9 years ago

This add-on already does that. (Did nobody try it??) It shows in the pane a list of candidate transforms, seemingly ranked in some descending plausibility order. They have semi-readable names. You get to choose one to apply.

paulddraper · 9 years ago

Neural nets are infamous for not doing this.

Learning algorithms that produce decision trees are usually used in this situation.

randomsearch · 9 years ago

> This is why I want any ML device to be able to explain itself.

This is the problem of lacking explanatory mechanisms in ML.

Note that some techniques that are very out of vogue at the moment, such as Genetic Programming, are much better than neural nets in this regard.

ChristianBundy · 9 years ago

IIRC you _can_ get this, but it's a huge algorithm that doesn't do things in a way that would probably make sense to a human. It would be amazing to be able to transform code into human language.

zitterbewegung · 9 years ago

Some ML systems (like decision trees) can give you a comprehenable way that they made the decision (would give you a list of if conditionals). Unfortunately many can't do this (random forests) . Having an AI that can explain itself in every situation why it does it also has to do with the underlying techniques. For instance random forests generate a random sample of the features and creates decision trees. So the explanation wouldn't be much useful to you.

sundarurfriend · 9 years ago

This is the stated goal of the Explainable AI initiative, (spearheaded afaik by DARPA, though Google tells me corporates have also began work on it). I hope it works out well because there's going to be a lot of AI code in the near future, and the thought of them all being inscrutable black boxes is pretty scary.

teddyh · 9 years ago

But, you know, if you saw something like, all your visible examples were like the strings

  123 456
  234567
  345 678

and the program replies with something like what you wrote: “Remove the middle character from any 7-character strings to make them 6 characters in length”, it would actually take a programmer’s mind to be able to envision why this might in some cases be wrong. Most people who are not programmers would, I think, see this as equivalent to “Remove interior spaces from each item”. I suspect that the skill required to choose an algorithm correctly is the exact same skill required to actually being a programmer.

All this then buys you is that you don’t have to remember the function names.

OscarTheGrinch · 9 years ago

Yes, this system should do have an intermediate step of spiting out a checklist of clear rules and the user can select the best fit, saving the human the time it would take to search from a bloated dropdown of all possible rules.

tonmoy · 9 years ago

MS had been experimenting with this for a while[1]. They even included this in Excel 2013 as "FlashFill". It does not use any NN/ML at all. It uses "program synthesis", which by definition can tell you exactly what "program" it has synthesized to convert you data. In fact in you example it would not cheat, rather leave the string unchanged as explained in the paper.

[1] https://www.microsoft.com/en-us/research/publication/automat...

randomsearch · 9 years ago

More generally, anything by Sumit Gulwani's group at MSR should be of interest.

teddyh · 9 years ago

But maybe I did want it to remove the middle character! Using my training data, there’s no way for the system to actually know for sure what I meant. There is also no way (in general) for it to detect “outliers” and ask me about them, because there is no good way to know what is an outlier and what is not.

Giroflex · 9 years ago

This isn't the machine's fault, though; for a small number of linearly independent examples there exists an enormous number of possible functions that match the training data. It has no way of guessing, really.

thomasahle · 9 years ago

If the machine had a large background knowledge of what humans would typically like to do, it would help.

randomsearch · 9 years ago

Sounds like you're talking about over-fitting the data. Or perhaps just not providing an evaluation function that is sufficiently general. All ML can fall into this trap.

ke1vin · 9 years ago

When you only provide the system with a few examples, there are many possible transformations which satisfy the examples.

One way to eliminate this ambiguity is to also provide a natural language description of what you want, e.g. "remove the spaces".

In the natural language processing community, we call this semantic parsing.

But sometimes the semantic parser can misinterpret the language too and generate a program which still "cheats" in the same manner as you described. We call these "spurious programs".

Shameless plug-- my group has been working on how to deal with these spurious programs:

From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood https://arxiv.org/abs/1704.07926

thewarrior · 9 years ago

You don't realise how big this is. This is the beginning of the automation of coding.

coldtea · 9 years ago

We had such things for decades.

Besides, it depends on the slope of "coding". If it gets really difficult really quick (exponentially say), this could just be forever stuck in the "low hanging fruit" stage.

teddyh · 9 years ago

No, assembly language was the beginning of automation of coding. Almost nobody codes using raw machine code anymore. Everything since is just more added abstractions.

dimaggiosghost · 9 years ago

I agree, thank you for peovoking this thought. It is raw and if so I apologize.

This is where hinting is important. Metadata. That sequence if I know it's a phone number, or a sequence of increasing digits, depends a lot on metadata.

Given some reasonable sample size, i believe machine learning could provide hints as to some of the common types of formats. Semi automated data hinting or structuring?

There is a bidirectional connection between interpreting your data and how your data is structured

Is it possible to use your data column to statistically hint at metadata characteristics by some sort of clustering, then use that to automatically clean input data?

ktamura · 9 years ago

This is a great product idea. If you ask any Excel power users, by far the most time-consuming and hard-to-automate task is text and date manipulation.

The beauty of this product is that its adoption strategy is baked into the product itself: I'd share this with all Excel user friends of mine because I want the algorithm to get smarter, and I might even learn a bit of C# myself so that I can contribute and scratch my own itch. This in turn makes the product better (because of the larger training data), lending itself to more word of mouth.

One concern I have is security: I'd love to hear from folks who built this/more familiar with this about how to ensure the security of suggested transformations.

haswell · 9 years ago

I, too, wonder about security. Just as important: performance/scaleability. What happens when this runs on 100K rows against a service some guy stood up as a weekend project? Now what happens with 100 people hitting that service?

Either way, this looks very useful. Having spent more than my fair share of time massaging data prior to import, this looks pretty great.

Analemma_ · 9 years ago

> What happens when this runs on 100K rows against a service some guy stood up as a weekend project? Now what happens with 100 people hitting that service?

Then they complain to Microsoft, who helpfully suggests the product they should upgrade to. This has always been a strong spot of Microsoft's. "I see you've scaled beyond the capacity of [Product A]. Well, fortunately for you we have [Product B] which can handle it, with a nice import wizard to get you started painlessly." It typically goes Excel > Access > On-prem SQL Server > Azure.

This sounds very negative and I swear I don't mean it that way. It's a great sales tactic if you offer products at every level of scale.

dandare · 9 years ago

Slightly different use case but still very useful for data manipulation is the OpenRefine (formerly Google Refine) https://www.youtube.com/watch?v=B70J_H_zAWM

Cieplak · 9 years ago

I wonder if it uses Z3 under the hood for solving constraints. Very nice of MSFT to MIT license Z3. It's super useful for problems that result in circular dependencies when modeled in Excel, and require iterative solvers (e.g., goal seek). I use the python bindings, but unfortunately it's not as simple as `pip install` and requires a lengthy build/compilation. Well worth the effort, though.

https://github.com/Z3Prover/z3

https://github.com/Z3Prover/z3/issues/288

tonmoy · 9 years ago

Love Z3. It is easy to use and very decent performance! I don't think MS is using Z3 on this product though, looks more like the smart enumeration based program synthesis

c-cube · 9 years ago

I think most of the enumeration-based synthesis tools rely on a SMT solver (Z3 or CVC4, say) to learn from bad solutions.

gergoerdi · 9 years ago

Check out MagicHaskeller which figures out list processing functions from examples: http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...

For example, given the rule `f "abcde" 2 == "aabbccddee"`, it even figures out the role of the parameter `2`, so `f "zq" 3` gives `"zzzqqq"`.

bcherny · 9 years ago

Wait, Excel had this built in since 2013!

https://support.office.com/en-us/article/Use-AutoFill-and-Fl...

taeric · 9 years ago

I recall a joke from someone... Can't remember who, that went like, "does your startup compete with Excel? Did you just rediscover pivot tables?"

netvarun · 9 years ago

Is this related/a commercial application of the 'Deep Learning for Program Synthesis' post[0][1] from Microsoft Research on HN a month ago?

[0]https://www.microsoft.com/en-us/research/blog/deep-learning-...

[1]HN Discussion: https://news.ycombinator.com/item?id=14168027

martinthenext · 9 years ago

Oh man, we did it before Microsoft!

http://comnsense.io/

https://youtu.be/ALF9GY2K-wc

wayneprice · 9 years ago

I'm playing around with a client-side js implementation of this at https://www.robosheets.com/

It's not production ready / launched yet, but it's getting there.

I'd be interested to finds (or really doesn't find) this useful :)

eob · 9 years ago

I wasn't able to figure out how to use the app, but I did want to drop in to say that I like how you've placed the "Upgrade to enable" notices in cells past a limited range.