Speaking of hidden Python tools, I'm a big fan of re.Scanner[0]. It's a regex-based tokenizer[1] in the `re` module, that for reasons is completely missing from any official documentation.
You give it a pattern for each token type, and a function to be called on each match, and you get back a list of processed tokens.
Importantly, it processes the list in one pass and ensures the matches are contiguous, where a naive `re.findall` with capture groups will ignore unmatched characters. You also get a reference to the running scanner, so you can record the location of the match for reporting errors.
You'll be able to do this soon with the Rust regex crate as well. Well, by using one of its dependencies. Once regex 1.9 is out, you'll be able to do this with regex-automata:
use regex_automata::{
meta::Regex,
util::iter::Searcher,
Anchored, Input,
};
#[derive(Clone, Copy, Debug)]
enum Token {
Integer,
Identifier,
Punctuation,
}
fn main() {
let re = Regex::new_many(&[
r"[0-9]+",
r"[a-z_]+",
r"[,.]+",
r"\s+",
]).unwrap();
let hay = "45 pigeons, 23 cows, 11 spiders";
let input = Input::new(hay).anchored(Anchored::Yes);
let mut it = Searcher::new(input).into_matches_iter(|input| {
Ok(re.search(input))
}).infallible();
for m in &mut it {
let token = match m.pattern().as_usize() {
0 => Token::Integer,
1 => Token::Identifier,
2 => Token::Punctuation,
3 => continue,
pid => unreachable!("unrecognized pattern ID: {:?}", pid),
};
println!("{:?}: {:?}", token, &hay[m.range()]);
}
let remainder = &hay[it.input().get_span()];
if !remainder.is_empty() {
println!("did not consume entire haystack");
}
}
A bit more verbose than the Python, but the library is exposing much lower level components. You have to do a little more stitching to get the `Scanner` behavior. But it does everything the Python does: a single scan (using finite automata and not backtracking like Python), skip certain token types and guarantees that the entirety of the haystack is consumed.
Interesting! Not often using either crate, this example looks like something for which I might usually look to nom. Is there a reason I should consider using regex for this use case instead (if neither is a pre-existing dependency)?
> completely missing from any official documentation
To be fair, most things are missing from the official documentation. When I learned kotlin, I read through their official docs, and knew about most language features in a day. When I learned python, I constantly got surprised by things I hadn't seen come up in the docs. For instance decorators was (still is?) not mentioned at all in the official tutorial.
The tutorial is not supposed to cover all language features: "This tutorial does not attempt to be comprehensive and cover every single feature, or even every commonly used feature. Instead, it introduces many of Python’s most noteworthy features, and will give you a good idea of the language’s flavor and style."
But then I don't know how you're supposed to learn the features that are not in the tutorial. You can have a look at the table of contents of the standard library documentation for modules that might interest you, but that doesn't cover language features. Those are documented in The Python Language Reference, but that document is not really suited for learning from.
There are lots of websites and Youtube channels and so on, but you have to find them, and filter out the not-so-good ones which is not easy, especially for a beginner. I think there is room for some kind of official advanced tutorial to cover the gap.
I agree. Wish Python could been better with the documentation. Its a bit absurd that things feel more clear and simple reading Rust documentation than Python documentation for me, given that Python actually is a lot more simple and clear (for me).
Woah, thats a pretty cool feature! I allways feel a bit dirty trying to do anything like that manually (usually involving a string.split(",")[0][:2] etc, just asking to break).
Curious cat - had you considered using Antlr4 and a Python visitor/listener? (Were you aware of antlr?). Depending what you're trying to do with a regex tokenizer, it might be suitable.
It also uses the file as a module, not a script, which means suddenly relative imports works the root dir and the cwd are the same, and it is added to sys.path.
This prevents a ton of import problems, albeit for the price of more verbose typing, especially since you don't have completion on dotted path.
It is my favorite way of running my projects.
Unfortunalty it means you can't use "-m pdb", and that's a big loss.
Not just relative imports but also (properly formed) absolute imports. For example, if you have a directory my_pkg/ with files mod1.py and mod2.py, then
# In my_pkg/mod2.py
import my_pkg.mod1
will work if you run `python -m my_pkg.mod2` but will fail if you run `python my_pkg/mod2.py`
However, the script syntax does work properly with absolute paths if if you set the enviornment variable `PYTHONPATH=.` (I don't know about relative paths - I don't use those). That would presumably allow pdb to work (but, shame on me, I've never tried it).
I also develop and run projects this way. I really, really enjoy it. It's a very pleasant experience, on both the development side and execution side.
I'm relatively new to Python (used it for ~1 year in 2007/2008, again briefly in 2014 -- which is when I believe I picked this module trick up -- and then didn't touch it again until March of this year). It's made an impression on my team and we're all having a good time developing code this way. I do wonder, though, what other shortcomings might exist with this approach.
I've always been curious how that mechanism works exactly what is it about that invocation technique that satisfies the relative imports? I think it changes the pythonpath somehow right in a way related to the module being ran, something like appending the basedir where the module is saved to the PYTHONPATH?
Python 3.12 will include a SQLite CLI/REPL in the standard library too[0][1]. This is useful because most operating systems have sqlite3 and python3, but are missing the SQLite CLI.
slightly related, emacs also includes an sqlite client / view now.. I find it funny to see everybody chasing the same need, and unsurprising since.. it's always good to have sqlite close to you.
Lots of code in the stdlib does not have type annotations. Though I think most popular modules in the std either are annotated or have stubs somewhere.
Not sure why my comment is being downvoted. From the sqlite.org website:
The SQLite project provides a simple command-line program named sqlite3 (or sqlite3.exe on Windows)
that allows the user to manually enter and execute SQL statements against an SQLite database
or against a ZIP archive.
I use http.server all the time, particularly as modern browsers disable a bunch of functionality if you open file URLs. Had no idea there was so much other stuff here!
Wait, how does that work? As far as I can see from the documentation, it can only serve on localhost, which to my understanding is only accessible from the single device it was launched on.
I find the entire premise of the post to be pretty baffling.
> Seth pointed out this is useful if you are on Windows and don't have the gzip utility installed.
Okay, so instead of installing gzip (or just using the decompressors that aren't the official gzip utility but that do support the format and already ship with Windows by default[1]), you install Python...?
Even if the premise weren't muddy from the start, there is a language runtime and ubiquitous cross-platform API already available on all major desktops that has a really good, industrial strength sandbox/containerization strategy that more than adequately mitigates the security issues raised here so you can e.g. rot13 without fear all day to your delight: the browser.
iirc this is one of the things earmarked for a hypothetical Python 4, making -P the default. It's also one of the many relatively well-known (security) issues in Python that don't get addressed for a surprising amount of time. Others in the same vein would be stuff like stderr being block-buffered when not using a TTY, no randomized hashes for the longest time, loading DLLs preferably from the working directory, std* using 7-bit ASCII in a number of circumstances and many more.
As opposed to line buffered, I assume? That sounds annoying, but why is it a security problem?
> no randomized hashes
I'm not up to date, but I think last I looked, I had the impression that randomized hashes didn't seem like they would fundamentally prevent collision attacks, just require more sophistication. Is that not the case?
Doing that requires palcement of files right in the directory where the user is likely to run that module.
Seems to be a quite rare vector for exploitation.
Sure, on a multiuser system I might trick some other user into running such a command in /tmp and prepare that directory accordingly, but other vectors seem more esoteric.
Even in Google's own repos. Starting any of those (no matter where they are stored) in a hostile repo would let the code in the repo take over the machine.
If running with -m, or -c and an import of package matching name of malicious package. Doesn't happen when running your own script (located in another directory) that imports that package, even if you are running it in that directory.
You give it a pattern for each token type, and a function to be called on each match, and you get back a list of processed tokens.
Importantly, it processes the list in one pass and ensures the matches are contiguous, where a naive `re.findall` with capture groups will ignore unmatched characters. You also get a reference to the running scanner, so you can record the location of the match for reporting errors.
[0]: https://stackoverflow.com/a/693818/252218[1]: https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization
Bummer, it's a cool feature, but I don't feel safe relying on undocumented features.
They regret including most modules… it seems they regret making python altogether instead of sticking with C? :D
re.Scanner looks more succinct though...!
To be fair, most things are missing from the official documentation. When I learned kotlin, I read through their official docs, and knew about most language features in a day. When I learned python, I constantly got surprised by things I hadn't seen come up in the docs. For instance decorators was (still is?) not mentioned at all in the official tutorial.
But then I don't know how you're supposed to learn the features that are not in the tutorial. You can have a look at the table of contents of the standard library documentation for modules that might interest you, but that doesn't cover language features. Those are documented in The Python Language Reference, but that document is not really suited for learning from.
There are lots of websites and Youtube channels and so on, but you have to find them, and filter out the not-so-good ones which is not easy, especially for a beginner. I think there is room for some kind of official advanced tutorial to cover the gap.
https://docs.python.org/3/glossary.html#term-decorator
https://docs.python.org/3/reference/compound_stmts.html#func...
There now -- https://docs.python.org/3/library/re.html#match-objects
And they're now just regurgitated things from the web, I've had novel ones generated fine (obviously you need to test them carefully still)
You can also use either of these throughout your code so that you can have
each doing different, but hopefully somewhat related, things.This prevents a ton of import problems, albeit for the price of more verbose typing, especially since you don't have completion on dotted path.
It is my favorite way of running my projects.
Unfortunalty it means you can't use "-m pdb", and that's a big loss.
Not just relative imports but also (properly formed) absolute imports. For example, if you have a directory my_pkg/ with files mod1.py and mod2.py, then
will work if you run `python -m my_pkg.mod2` but will fail if you run `python my_pkg/mod2.py`However, the script syntax does work properly with absolute paths if if you set the enviornment variable `PYTHONPATH=.` (I don't know about relative paths - I don't use those). That would presumably allow pdb to work (but, shame on me, I've never tried it).
I'm relatively new to Python (used it for ~1 year in 2007/2008, again briefly in 2014 -- which is when I believe I picked this module trick up -- and then didn't touch it again until March of this year). It's made an impression on my team and we're all having a good time developing code this way. I do wonder, though, what other shortcomings might exist with this approach.
Just
python -m pdb -m module
[0]: https://github.com/python/cpython/blob/3fb7c608e5764559a718c...
[1]: https://docs.python.org/3.12/library/sqlite3.html#command-li...
Not sure what you mean -- sqlite3 is the SQLite CLI.
zipfile
Decompress a zip file:
Compress a directory into a zip file:Deleted Comment
On plasma it also installs a right click shortcut to share directory from dolphin.
www='python -m http.server'
I suspect it's related to relative path resource but never figured it out.
This is even more fun on MacOS if you combine it with the pbpaste/pbcopy utils:
That command will pretty-print any JSON in your clipboard, and write it back to the clipboard, so you can paste it somewhere else formatted!Using modules (even if they are in the standard library) on the command line lets malicious code in the current dir take over your machine:
https://twitter.com/marekgibney/status/1598706464583028736
> Seth pointed out this is useful if you are on Windows and don't have the gzip utility installed.
Okay, so instead of installing gzip (or just using the decompressors that aren't the official gzip utility but that do support the format and already ship with Windows by default[1]), you install Python...?
Even if the premise weren't muddy from the start, there is a language runtime and ubiquitous cross-platform API already available on all major desktops that has a really good, industrial strength sandbox/containerization strategy that more than adequately mitigates the security issues raised here so you can e.g. rot13 without fear all day to your delight: the browser.
1. <https://news.ycombinator.com/item?id=36099528>
As opposed to line buffered, I assume? That sounds annoying, but why is it a security problem?
> no randomized hashes
I'm not up to date, but I think last I looked, I had the impression that randomized hashes didn't seem like they would fundamentally prevent collision attacks, just require more sophistication. Is that not the case?
That is a feature, though, isn't it?
Seems to be a quite rare vector for exploitation.
Sure, on a multiuser system I might trick some other user into running such a command in /tmp and prepare that directory accordingly, but other vectors seem more esoteric.
https://www.gnod.com/search/?engines=af&nw=1&q=python%20-m
Even in Google's own repos. Starting any of those (no matter where they are stored) in a hostile repo would let the code in the repo take over the machine.
Deleted Comment