CLI tools hidden in the Python standard library

Speaking of hidden Python tools, I'm a big fan of re.Scanner[0]. It's a regex-based tokenizer[1] in the `re` module, that for reasons is completely missing from any official documentation.

You give it a pattern for each token type, and a function to be called on each match, and you get back a list of processed tokens.

Importantly, it processes the list in one pass and ensures the matches are contiguous, where a naive `re.findall` with capture groups will ignore unmatched characters. You also get a reference to the running scanner, so you can record the location of the match for reporting errors.

    import re
    scanner = re.Scanner([
      (r"[0-9]+",       lambda scanner, token:("INTEGER", int(token))),
      (r"[a-z_]+",      lambda scanner, token:("IDENTIFIER", token)),
      (r"[,.]+",        lambda scanner, token:("PUNCTUATION", token)),
      (r"\s+", None), # None == skip token.
    ])

    results, remainder = scanner.scan("45 pigeons, 23 cows, 11 spiders.")
    assert not remainder
    print(results)

    [('INTEGER', 45),
     ('IDENTIFIER', 'pigeons'),
     ('PUNCTUATION', ','),
     ('INTEGER', 23),
     ('IDENTIFIER', 'cows'),
     ('PUNCTUATION', ','),
     ('INTEGER', 11),
     ('IDENTIFIER', 'spiders'),
     ('PUNCTUATION', '.')]

[0]: https://stackoverflow.com/a/693818/252218

[1]: https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization

Timon3 · 3 years ago

It seems like the discussion of whether to document it died in April of 2003: https://mail.python.org/pipermail/python-dev/2003-April/0350...

Bummer, it's a cool feature, but I don't feel safe relying on undocumented features.

bombolo · 3 years ago

Especially now that they are on a crusade against their own stdlib.

They regret including most modules… it seems they regret making python altogether instead of sticking with C? :D

burntsushi · 3 years ago

You'll be able to do this soon with the Rust regex crate as well. Well, by using one of its dependencies. Once regex 1.9 is out, you'll be able to do this with regex-automata:

    use regex_automata::{
        meta::Regex,
        util::iter::Searcher,
        Anchored, Input,
    };
    
    #[derive(Clone, Copy, Debug)]
    enum Token {
        Integer,
        Identifier,
        Punctuation,
    }
    
    fn main() {
        let re = Regex::new_many(&[
            r"[0-9]+",
            r"[a-z_]+",
            r"[,.]+",
            r"\s+",
        ]).unwrap();
        let hay = "45 pigeons, 23 cows, 11 spiders";
        let input = Input::new(hay).anchored(Anchored::Yes);
        let mut it = Searcher::new(input).into_matches_iter(|input| {
            Ok(re.search(input))
        }).infallible();
        for m in &mut it {
            let token = match m.pattern().as_usize() {
                0 => Token::Integer,
                1 => Token::Identifier,
                2 => Token::Punctuation,
                3 => continue,
                pid => unreachable!("unrecognized pattern ID: {:?}", pid),
            };
            println!("{:?}: {:?}", token, &hay[m.range()]);
        }
        let remainder = &hay[it.input().get_span()];
        if !remainder.is_empty() {
            println!("did not consume entire haystack");
        }
    }

And its output:

    $ cargo run
       Compiling regex-scanner v0.1.0 (/home/andrew/tmp/scratch/regex-scanner)
        Finished dev [unoptimized + debuginfo] target(s) in 0.31s
         Running `target/debug/regex-scanner`
    Integer: "45"
    Identifier: "pigeons"
    Punctuation: ","
    Integer: "23"
    Identifier: "cows"
    Punctuation: ","
    Integer: "11"
    Identifier: "spiders"

A bit more verbose than the Python, but the library is exposing much lower level components. You have to do a little more stitching to get the `Scanner` behavior. But it does everything the Python does: a single scan (using finite automata and not backtracking like Python), skip certain token types and guarantees that the entirety of the haystack is consumed.

fastasucan · 3 years ago

'A bit more verbose' = 2,5 times as many characters. Not saying its good or bad just saying its a bit more verbose ;)

n8henrie · 3 years ago

Interesting! Not often using either crate, this example looks like something for which I might usually look to nom. Is there a reason I should consider using regex for this use case instead (if neither is a pre-existing dependency)?

anentropic · 3 years ago

FWIW there is an example of building a tokenizer/scanner using documented features of the re module here: https://docs.python.org/3.11/library/re.html#writing-a-token...

re.Scanner looks more succinct though...!

czx4f4bd · 3 years ago

I find that even more bizarre tbh. Thst seems like the perfect place to document the Scanner class.

matsemann · 3 years ago

> completely missing from any official documentation

To be fair, most things are missing from the official documentation. When I learned kotlin, I read through their official docs, and knew about most language features in a day. When I learned python, I constantly got surprised by things I hadn't seen come up in the docs. For instance decorators was (still is?) not mentioned at all in the official tutorial.

roelschroeven · 3 years ago

The tutorial is not supposed to cover all language features: "This tutorial does not attempt to be comprehensive and cover every single feature, or even every commonly used feature. Instead, it introduces many of Python’s most noteworthy features, and will give you a good idea of the language’s flavor and style."

But then I don't know how you're supposed to learn the features that are not in the tutorial. You can have a look at the table of contents of the standard library documentation for modules that might interest you, but that doesn't cover language features. Those are documented in The Python Language Reference, but that document is not really suited for learning from.

There are lots of websites and Youtube channels and so on, but you have to find them, and filter out the not-so-good ones which is not easy, especially for a beginner. I think there is room for some kind of official advanced tutorial to cover the gap.

BoppreH · 3 years ago

Decorators seem to be documented now:

https://docs.python.org/3/glossary.html#term-decorator

https://docs.python.org/3/reference/compound_stmts.html#func...

fastasucan · 3 years ago

I agree. Wish Python could been better with the documentation. Its a bit absurd that things feel more clear and simple reading Rust documentation than Python documentation for me, given that Python actually is a lot more simple and clear (for me).

emmelaich · 3 years ago

For a long time, MatchObject was missing too.

There now -- https://docs.python.org/3/library/re.html#match-objects

fastasucan · 3 years ago

Woah, thats a pretty cool feature! I allways feel a bit dirty trying to do anything like that manually (usually involving a string.split(",")[0][:2] etc, just asking to break).

anitil · 3 years ago

Amazing! I had just written a regex tokenizer in python the other day, this would have been great!

cricalix · 3 years ago

Curious cat - had you considered using Antlr4 and a Python visitor/listener? (Were you aware of antlr?). Depending what you're trying to do with a regex tokenizer, it might be suitable.

sam2426679 · 3 years ago

I believe this is used internally by json.loads, which is not very surprising in hindsight.

carlossouza · 3 years ago

This is one of the best ChatGPT use cases: creating regex complex patterns

nmstoker · 3 years ago

It's excellent at both producing them and explaining what it has generated.

And they're now just regurgitated things from the web, I've had novel ones generated fine (obviously you need to test them carefully still)

extasia · 3 years ago

I agree. I made a little code formatter for aligning variable assignments on adjacent lines and used chatgpt for a lot of help with the regex

tl_donson · 3 years ago

this one use case justified copilot as an expense for me.

The SQLite project provides a simple command-line program named sqlite3 (or sqlite3.exe on Windows) that allows the user to manually enter and execute SQL statements against an SQLite database or against a ZIP archive.

There is one problem with those: Security.

Using modules (even if they are in the standard library) on the command line lets malicious code in the current dir take over your machine:

https://twitter.com/marekgibney/status/1598706464583028736

cxr · 3 years ago

I find the entire premise of the post to be pretty baffling.

> Seth pointed out this is useful if you are on Windows and don't have the gzip utility installed.

Okay, so instead of installing gzip (or just using the decompressors that aren't the official gzip utility but that do support the format and already ship with Windows by default[1]), you install Python...?

Even if the premise weren't muddy from the start, there is a language runtime and ubiquitous cross-platform API already available on all major desktops that has a really good, industrial strength sandbox/containerization strategy that more than adequately mitigates the security issues raised here so you can e.g. rot13 without fear all day to your delight: the browser.

1. <https://news.ycombinator.com/item?id=36099528>

jerpint · 3 years ago

Most people who read this blog already have python installed

yunohn · 3 years ago

You can definitely be in situations where you have python but not gzip.

formerly_proven · 3 years ago

iirc this is one of the things earmarked for a hypothetical Python 4, making -P the default. It's also one of the many relatively well-known (security) issues in Python that don't get addressed for a surprising amount of time. Others in the same vein would be stuff like stderr being block-buffered when not using a TTY, no randomized hashes for the longest time, loading DLLs preferably from the working directory, std* using 7-bit ASCII in a number of circumstances and many more.

patrec · 3 years ago

> stderr being block-buffered

As opposed to line buffered, I assume? That sounds annoying, but why is it a security problem?

> no randomized hashes

I'm not up to date, but I think last I looked, I had the impression that randomized hashes didn't seem like they would fundamentally prevent collision attacks, just require more sophistication. Is that not the case?

TeMPOraL · 3 years ago

> loading DLLs preferably from the working directory

That is a feature, though, isn't it?

jonnycomputer · 3 years ago

-P?

johannes1234321 · 3 years ago

Doing that requires palcement of files right in the directory where the user is likely to run that module.

Seems to be a quite rare vector for exploitation.

Sure, on a multiuser system I might trick some other user into running such a command in /tmp and prepare that directory accordingly, but other vectors seem more esoteric.

mg · 3 years ago

There are thousands of tools (shellscripts, makefiles ...) which execute "python -m":

https://www.gnod.com/search/?engines=af&nw=1&q=python%20-m

Even in Google's own repos. Starting any of those (no matter where they are stored) in a hostile repo would let the code in the repo take over the machine.

jonnycomputer · 3 years ago

If running with -m, or -c and an import of package matching name of malicious package. Doesn't happen when running your own script (located in another directory) that imports that package, even if you are running it in that directory.

Deleted Comment

lizard · 3 years ago

Its notable, perhaps, that the

  if __name__ == "__main__":

block allows you to do this for a _module_, i.e. a single *.py file. If you want to do this for a package, add a `__main__.py`

You can also use either of these throughout your code so that you can have

  python -m foo
  python -m foo.bar
  python -m foo.bar.baz

each doing different, but hopefully somewhat related, things.

BiteCode_dev · 3 years ago

It also uses the file as a module, not a script, which means suddenly relative imports works the root dir and the cwd are the same, and it is added to sys.path.

This prevents a ton of import problems, albeit for the price of more verbose typing, especially since you don't have completion on dotted path.

It is my favorite way of running my projects.

Unfortunalty it means you can't use "-m pdb", and that's a big loss.

quietbritishjim · 3 years ago

> which means suddenly relative imports works

Not just relative imports but also (properly formed) absolute imports. For example, if you have a directory my_pkg/ with files mod1.py and mod2.py, then

    # In my_pkg/mod2.py
    import my_pkg.mod1

will work if you run `python -m my_pkg.mod2` but will fail if you run `python my_pkg/mod2.py`

However, the script syntax does work properly with absolute paths if if you set the enviornment variable `PYTHONPATH=.` (I don't know about relative paths - I don't use those). That would presumably allow pdb to work (but, shame on me, I've never tried it).

scruple · 3 years ago

I also develop and run projects this way. I really, really enjoy it. It's a very pleasant experience, on both the development side and execution side.

I'm relatively new to Python (used it for ~1 year in 2007/2008, again briefly in 2014 -- which is when I believe I picked this module trick up -- and then didn't touch it again until March of this year). It's made an impression on my team and we're all having a good time developing code this way. I do wonder, though, what other shortcomings might exist with this approach.

sharikous · 3 years ago

You can use pdb

Just

python -m pdb -m module

jxramos · 3 years ago

I've always been curious how that mechanism works exactly what is it about that invocation technique that satisfies the relative imports? I think it changes the pythonpath somehow right in a way related to the module being ran, something like appending the basedir where the module is saved to the PYTHONPATH?

andreareina · 3 years ago

I've taken to adding a --pdb option to my scripts.

polyrand · 3 years ago

Python 3.12 will include a SQLite CLI/REPL in the standard library too[0][1]. This is useful because most operating systems have sqlite3 and python3, but are missing the SQLite CLI.

[0]: https://github.com/python/cpython/blob/3fb7c608e5764559a718c...

[1]: https://docs.python.org/3.12/library/sqlite3.html#command-li...

agumonkey · 3 years ago

slightly related, emacs also includes an sqlite client / view now.. I find it funny to see everybody chasing the same need, and unsurprising since.. it's always good to have sqlite close to you.

maleldil · 3 years ago

It's weird that they're adding code to the stdlib without type annotations.

RGBCube · 3 years ago

Lots of code in the stdlib does not have type annotations. Though I think most popular modules in the std either are annotated or have stubs somewhere.

chmaynard · 3 years ago

> This is useful because most operating systems have sqlite3 and python3, but are missing the SQLite CLI.

Not sure what you mean -- sqlite3 is the SQLite CLI.

Yes, I meant the DLL/.so file. The library is almost always present in the OS, the CLI is not.

They likely mean libsqlite3, i.e. the .DLL/.so.

Not sure why my comment is being downvoted. From the sqlite.org website:

dfc · 3 years ago

I am not someone who complains about "stop piping cats" but using '-e' with grep is a lot quicker and easier to read. This:

    grep -v 'test/' | grep -v 'tests/' | grep -v idlelib | grep -v turtledemo

Becomes:

    grep -ve 'test/' -e 'tests/' -e idlelib -e turtledemo

To be fair, the advantage of doing separate greps is working iteratively and drilling down on what you want.

kevincox · 3 years ago

You can iteratively add the -e flags just the same.

knodi123 · 3 years ago

not this? grep -ve 'test/|tests/|idlelib|turtledemo'

xorcist · 3 years ago

That would be -vE, or you have to escape the pipe symbols.

simonw · 3 years ago

Thanks, that's a useful tip.

st0le · 3 years ago

Also since ripgrep is already installed "grep -v" can be replaced with "rg -v"

submeta · 3 years ago

Nice! I use this on winboxes:

zipfile

Decompress a zip file:

    python -m zipfile -e archive.zip /path/to/extract/to

Compress a directory into a zip file:

    python -m zipfile -c new_archive.zip /path/to/directory

throwawaymobule · 3 years ago

is there an advantage to this over windows builtin zip feature? I don't know if it is usable from CLI.

CJefferson · 3 years ago

I use http.server all the time, particularly as modern browsers disable a bunch of functionality if you open file URLs. Had no idea there was so much other stuff here!

trotro · 3 years ago

Same here, it's also by far the most convenient way I've found to share files between devices on my network.

bradrn · 3 years ago

Wait, how does that work? As far as I can see from the documentation, it can only serve on localhost, which to my understanding is only accessible from the single device it was launched on.

seanw444 · 3 years ago

Same for me. Although I just discovered `qrcp` which has been quite handy, and I'm sure there are many tools like it.

LtWorf · 3 years ago

I wrote this, a while ago https://ltworf.github.io/weborf/qweborf.html

On plasma it also installs a right click shortcut to share directory from dolphin.

vs4vijay · 3 years ago

FYI: you can use https://file.pizza/ for sending the file outside the network.

pinkcan · 3 years ago

www is aliassed on my .zshrc for that reason:

www='python -m http.server'

icar · 3 years ago

I use miniserve (`cargo install miniserve`). You also have available `npx serve`.

HumblyTossed · 3 years ago

Oh and the Rust brigade have arrived... Was only a matter of time.

thrdbndndn · 3 years ago

Chrome also often takes forever to load even the simple HTMLs if it's a local file.

I suspect it's related to relative path resource but never figured it out.

9dev · 3 years ago

> Pretty-print JSON: > echo '{"foo": "bar", "baz": [1, 2, 3]}' | python -m json.tool

This is even more fun on MacOS if you combine it with the pbpaste/pbcopy utils:

  alias json_pretty="pbpaste | python -m json.tool | pbcopy"

That command will pretty-print any JSON in your clipboard, and write it back to the clipboard, so you can paste it somewhere else formatted!

TRiG_Ireland · 3 years ago

    alias json-format="python -m json.tool | pygmentize -l javascript"

I have a similar mapping to use `json.tool` inside of vim buffers. Very useful tool that's gotten a ton of mileage from me over the last ~decade.