Show HN: Regex.ai – AI-powered regular expression generator

> Regex.ai is an AI-powered tool that generates regular expressions.

Or, just write regular expressions?

> ... Regex.ai's intuitive interface makes it easy to input sample text and generate complex regular expressions quickly and efficiently.

See: https://www.ibm.com/topics/overfitting

Inputting the sample text:

  foo bar baz
  baz bar foo

And highlighting the first "baz" produced patterns which all had "[A-Z][a-z]*@libertylabs\\.ai" included, assumedly due to the default inclusions.

Removing those and highlighting the second "baz" resulted "<Agent B>" as the results in one case.

There is no explanation of any patterns generated. If a person is to use one of the generated patterns and Regex.ai is supposed to "save you time and streamline your workflow", no matter "[w]hether you're a novice or an expert", then some form of verification and/or explanation must exist.

Otherwise, a person must know how to formulate regular expressions in order to determine which, if any, of the presented options are applicable. And if a person knows how to formulate regular expressions, then why would they use Regex.ai?

anileated · 2 years ago

I often find it faster to write something from scratch rather than to work with someone else’s code to fix it. In the latter case I need to understand the intent, the whys behind the choices.

Well guess what, LLM-generated code is someone else’s code: an amalgamation derived from many peoples’ code. Except those people are ‘helpfully’ “abstracted away” from you by the middleman, so you can’t know their original intents and choices. What’s worse, it’s someone else’s code that will be treated as your code—unlike working with a legacy system that everyone knows was written by some guy, in this case any bugs will be squarely on you.

AdieuToLogic · 2 years ago

This offering, and the other half-dozen like it this past week or so, is like giving a kid a flamethrower.

It's all fun and games until they burn down your house.

> ... I need to understand the intent, the whys behind the choices.

As do I.

And that is something ChatGPT-X (for any given X) cannot provide, regardless of whether or not what is produced is correct. Perhaps with some form of backward chaining[0] a ChatGPT-X someday can explain how it arrived at what was produced works.

But "the why" is the domain of people.

0 - https://en.wikipedia.org/wiki/Backward_chaining

cuuupid · 2 years ago

It's even worse. When working with someone else's code e.g. StackOverflow there's a reputation system gating people from the platform and incentivizing them to provide correct answers. You can reasonably expect that someone else's code has at least been thought through to some extent to solve the problem at hand, and very likely tested.

With LLM-generated code, especially ChatGPT-style decoder models, none of that is true. All of the posts and comments I see about it here seem to be anecdotes "it can do all of my job for me" yet asking it to write the simplest code creates several issues on my end.

Personally I think a model geared towards code generation isn't an unsolvable task; the Spider dataset was released some time ago (text to SQL task) and the winning approach there was no fanciness on the model side, but rather to just test all the output queries to ensure it's at least valid SQL. That got a 20%+ boost in accuracy.

cyanydeez · 2 years ago

Like autopilot in planes that fall back to experienced pilots, we're embarking on the most dangerous "uncanny valley" maneuver where these systems will be adopted by experienced pilots who know the limits but who will inevitably be followed by either no one or students whose conception is entirely synthetic.

At that point the plane AI better be 100% TRUSTWORTHY cause there's no safe fallback.

regexLL · 2 years ago

Thanks for your feedback! Updated ver 1.1 coming soon with more descriptions and better performance :)

AdieuToLogic · 2 years ago

If you have a choice between descriptions or performance, I humbly suggest detailed descriptions perhaps with links to tutorials and/or further reading. Who cares if the wrong thing is returned quickly if that means it lacks any context.

Also, consider how to express anchoring and/or grouping preferences in the UI or weighting based on highlight positioning. These are oft used features of regex languages.

barbariangrunge · 2 years ago

I’d you don’t understand regexes well enough to write them yourself, you should not get some ai to generate them for you. You won’t be able to verify whether they do what you want and the bugs can be subtle and destructive

soiler · 2 years ago

I read a few weeks ago here on HN about one large SAAS grinding to a halt because of a greedy selector in one line of regex. Not sure how people find old stories, it's lost to me now. But it was an excellent example of why regex is dangerous and requires a lot of care to write. I wouldn't trust an AI to write my regex unless I saw that people were finding it to be consistently better than they were are writing what they need.

jameshart · 2 years ago

You gave it an example where inferring the semantics you were after was basically a crapshoot. It’s not going to do well under those conditions. Nor will a random human who lacks insight into what specifically you are after. Did you want all the bazzes that are at the end of lines? The bazzes that follow bars? Who knows?

Try giving it examples where the data provides context cues.

qwertox · 2 years ago

Using your example and deselecting the email addresses I end up with these suggestions:

\b(foo|bar|baz)\b

\w(foo|bar|baz)\w

\bbaz\b

[fF][oO][oO]|[Bb][Aa][Rr]|[Bb][Aa][Zz]

It only lacks a dice button which randomly selects the "correct" answer.

6510 · 2 years ago

There are tools that somewhat explain what each part of a regex does.

qsort · 2 years ago

Just so that you know, your problem is called "regular expression synthesis", there's vast literature on it and a LLM is by no means necessary.

https://arts.units.it/retrieve/handle/11368/2758954/57751/20...

https://arxiv.org/pdf/1908.03316

https://cs.stanford.edu/~minalee/pdf/gpce2016-alpharegex.pdf

__lm__ · 2 years ago

The first one is available here: http://regex.inginf.units.it/

It uses genetic programming to build the regular expression.

nicolaslegland · 2 years ago

https://regex.inginf.units.it/ only needed 5 seconds to generate /(?<=Rd )\d++/ when I highlighted 9856, 10190, 9753 and 8883.

https://regex.ai/ was stuck with /9856|10190|9753|8883/ and confidently emitted /\d{4}/ as an alternative.

https://regex101.com/r/cAaV1z/1 confirms the former.

hackernewds · 2 years ago

and yet a decent regex generator has not existed before.

eviks · 2 years ago

This one isn't decent

roncesvalles · 2 years ago

My Google search results for "regex generator" returned a full page of decent ones.

florianfmmartin · 2 years ago

How about instead of an AI generating a regex we can't understand, we put energy using actually well developped method for parsing & validating text? Why put code you can't understand in your database?

For complex inputs, use actual peg parsers : https://docs.rs/peg/latest/peg/

For simplet inputs, express your intent with readable methods using a lib : https://github.com/sgreben/regex-builder/ & https://github.com/francisrstokes/super-expressive

b5n · 2 years ago

Once you know regex well enough to replace regex you realize that regex is pretty well developed.

There are certainly cases where different parsing methods/grammars are a better fit, but regex shines in many places.

column · 2 years ago

Reality check : there are people like my colleague who aren't software engineers and still have to occasionally maintain/create a regex in some corporate software config.

soiler · 2 years ago

That's even worse. They might not have the knowledge to realize the regex an AI gives them is bunk, or to debug it when it fails.

I'd like to see some numbers on a tool like this. If a huge majority of people are seeing genuine improvements in their workflow with it, I won't be a luddite yelling at them. Rare, low-severity failures shouldn't hold us back.

But the potential cost of failure with (any) regex is very high, so I personally wouldn't want to trust any remotely mission-critical to a person who doesn't understand regex well enough to write it themself, and if they can write it on their own that's often faster than debugging AI-generated regex.

thunky · 2 years ago

> How about instead of an AI generating a regex we can't understand

Would you feel better if it generated a regex-builder expression instead of a regex?

Even if regex-builder generates a regex under the hood?

In any case, the regex itself is only an implementation detail.

textread · 2 years ago

If you would like to generate a regular expression by giving an example input text and an example output match, you could use this closed form solution tool:- https://regex-generator.olafneumann.org/

There is an excellent HN comment that provides more reading material around regex generation:- https://news.ycombinator.com/item?id=32037544

elif · 2 years ago

I had a problem so I used an AI generated regex. Now I have an unknowable number of problems.

jedberg · 2 years ago

Usually when you have an AI like this that is supposed to generate verifiable results, you do an adversarial test where you ask it to solve problems that you already know the answer to, to make sure it works.

It looks like no one did that here. Even using the sample data provided, if you highlight a few of the addresses, it can't find the rest of them, mainly because it generates a regex with ST/AVE/LN in it, missing all the ones with RD. And if you add an RD sample, it just adds that to the list.

There's lots of great innovation coming with LLMs, but people are forgetting their "AI basics" when it comes to verifying them.

pyuser583 · 2 years ago

I really wonder if this sort of thing is the how AI will work.

We tell AI what we want. AI produces a hyper-specific, but barely comprehensible result. We look over the result to make sure it’s all good.

Then execute.

yawnxyz · 2 years ago

I just used ChatGPT to create a ton of permutations for product pricing that I'm putting on Stripe as products.

Except... it made ONE ERROR that I just spent two hours tracking down and fixing in my JSON file and now in the Stripe dash. (I coincidentally found the error using ChatGPT lol).

It's probably still faster and less error-prone than I could have done it manually. But it's still error-prone...

sacred_numbers · 2 years ago

The Reflexion paper (https://arxiv.org/abs/2303.11366) that came out recently shows how this kind of mistake might be overcome. Asking the model to think about the answer after it's generated a first draft greatly improves accuracy. Also, prompt engineering such as copying the generated code, pasting it in a new chat and saying "There's a bug in this code, please find it" can go a long way. There is so much low hanging fruit in harnessing the power of these models that is just being ignored because some even lower hanging fruit (RLHF, system messages, context window size, plugins, etc) is being released seemingly every few days.

globalise83 · 2 years ago

AI generates a comprehensive set of unit tests with correct and incorrect inputs, then we run the tests to ensure that they all pass.

tomashubelbauer · 2 years ago

I was curious if this would be smart enough to generate a regex for any four letter word so I copied the tagline of the site and highlighted all four letter words in it. (I have deleted the previous highlights of course.) It generated three regexes that just had a union of those words and one which started off good-ish by looking for any word of length of three or four, but then tacked on some random suffix and in the end this most promising regex turned out to not even match anything in the source text. As a suggestion to the authors of this tool I'd propose to add a step where any generated regexes that don't match anything in the input text are removed from the results.