itamarst (u/itamarst)

scolvin · 7 months ago

Pydantic author here. We have plans for an improvement to pydantic where JSON is parsed iteratively, which will make way for reading a file as we parse it. Details in https://github.com/pydantic/pydantic/issues/10032.

Our JSON parser, jiter (https://github.com/pydantic/jiter) already supports iterative parsing, so it's "just" a matter of solving the lifetimes in pydantic-core to validate as we parse.

This should make pydantic around 3x faster at parsing JSON and significantly reduce the memory overhead.

itamarst · 7 months ago

That's great! Would also be cool (separately from Pydantic use case) to add jiter backend to ijson.

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

tomrod · 7 months ago

I have a side question -- what did you use for slides?

itamarst · 7 months ago

https://remarkjs.com/

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

fidotron · 7 months ago

Having only recently encountered this, does anyone have any insight as to why it takes 2GB to handle a 100MB file?

This looks highly reminiscent (though not exactly the same, pedants) of why people used to get excited about using SAX instead of DOM for xml parsing.

itamarst · 7 months ago

I talk about this more explicitly in the PyCon talk (https://pythonspeed.com/pycon2025/slides/ - video soon) though that's not specifically about Pydantic, but basically:

1. Inefficient parser implementation. It's just... very easy to allocate way too much memory if you don't think about large-scale documents, and very difficult to measure. Common problem with many (but not all) JSON parsers.

2. CPython in-memory representation is large compared to compiled languages. So e.g. 4-digit integer is 5-6 bytes in JSON, 8 in Rust if you do i64, 25ish in CPython. An empty dictionary is 64 bytes.

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

zxilly · 7 months ago

Maybe using mmap would also save some memory, I'm not quite sure if this can be implemented in Python.

itamarst · 7 months ago

Once you switch to ijson it will not save any memory, no, because ijson essentially uses zero memory for the parsing. You're just left with the in-memory representation.

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

m_ke · 7 months ago

Or just dump pydantic and use msgspec instead: https://jcristharif.com/msgspec/

itamarst · 7 months ago

msgspec is much more memory efficient out of the box, yes. Also quite fast.

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

fjasdfas · 7 months ago

So are there downsides to just always setting slots=True on all of my python data types?

itamarst · 7 months ago

You can't add extra attributes that weren't part of the original dataclass definition:

  >>> from dataclasses import dataclass
  >>> @dataclass
  ... class C: pass
  ... 
  >>> C().x = 1
  >>> @dataclass(slots=True)
  ... class D: pass
  ... 
  >>> D().x = 1
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      D().x = 1
      ^^^^^
  AttributeError: 'D' object has no attribute 'x' and no __dict__ for setting new attributes

Most of the time this is not a thing you actually need to do.

itamarst commented on Loading Pydantic models from JSON without running out of memory pythonspeed.com/articles/... · Posted by u/itamarst

thisguy47 · 7 months ago

I'd like to see a comparison of ijson vs just `json.load(f)`. `ujson` would also be interesting to see.

itamarst · 7 months ago

For my PyCon 2025 talk I did this. Video isn't up yet, but slides are here: https://pythonspeed.com/pycon2025/slides/

The linked-from-original-article ijson article was the inspiration for the talk: https://pythonspeed.com/articles/json-memory-streaming/

itamarst commented on Poireau: A Sampling Allocation Debugger github.com/backtrace-labs... · Posted by u/luu

itamarst · 7 months ago

One key question in these sort of things is how free() works: it is given a pointer, and it has to decide whether this was sampled or not, with _minimum_ effort.

Poireau does this, IIRC, by putting the pointers it sampled in a different memory address.

Sciagraph (https://sciagraph.com), a profiler I created, uses allocation size. If an allocation is chosen for sampling the profiler makes sure its size is at least 16KiB. Then free() will assume that any allocation 16KiB or larger is sampled. This may not be true, it might be false positive, but it means you don't have to do anything beyond malloc_usable_size() if you have free() on lots and lots of small allocations. A previous iteration used alignment as a heuristic, so that's another option.

u/itamarst

KarmaCake day5606February 24, 2016

About

I write about speeding up Python software development, and code, at https://pythonspeed.com

I write about fundamental engineering skills and programmer career advice at https://codewithoutrules.com

I also write a weekly email about all the mistakes I've made both coding and in my career over the past 20 years, so that you can learn and avoid them: https://softwareclown.com

View Original