1. SAM 2 was trained on 256 A100 GPUs for 108 hours (SAM1 was 68 hrs on same cluster). Taking the upper end $2 A100 cost off gpulist means SAM2 cost ~$50k to train - surprisingly cheap for adding video understanding?
2. new dataset: the new SA-V dataset is "only" 50k videos, with careful attention given to scene/object/geographical diversity incl that of annotators. I wonder if LAION or Datacomp (AFAICT the only other real players in the open image data space) can reach this standard..
3. bootstrapped annotation: similar to SAM1, a 3 phase approach where 16k initial annotations across 1.4k videos was then expanded to 63k+197k more with SAM 1+2 assistance, with annotation time accelerating dramatically (89% faster than SAM1 only) by the end
4. memory attention: SAM2 is a transformer with memory across frames! special "object pointer" tokens stored in a "memory bank" FIFO queue of recent and prompted frames. Has this been explored in language models? whoa?
(written up in https://x.com/swyx/status/1818074658299855262)
edit: It looks like you just launched appropriately early :) I assume you're aware of products like stigg.
What could I expect to pay for my company to use the Team plan?
How could a merger like this possibly be good for consumers? You take two companies actively competing against each other and ...stop one of them.