There is a finite set of features you need to add to make this an "excellent" video editor. It will take time & most people who'll eventually use this project may not use them at all.
The best way to guarantee ROI is to do things that users need immediately, like captions. Salesforce isn't in the finite set of things a video editor needs, so that request will be rejected.
[Plane in the following refers to the image/sensor plane of the camera.] My understanding is that with both an in-plane and a normal component of translation, together with enough 3D rotation to anchor/calibrate the gyro, the 3D accelerometer's absolute scale can be translated to a fixed-to-earth (Swaying skyscraper doesn't count! On-a-train doesn't count!) static scene's Structure-from-Motion feature/anchor points. In-plane alone just gives you parallax that tells you the _distance ratios_ of the two objects that parallax to another as you translate in-plane; but once you add plane-normal translation, an absolute translation interacts additively to both object's distances thus letting you recover absolute scale not just distance ratios. Of course you'd hope for some suitably good features dense enough in the scene, start out with some optical flow or similar style to get a baseline on gyro calibration and (translation/linear) velocity zeroing, to then get a decent shot at being able to use the SfM point features with very little "brute force" in the SfM alignment/solution process.
AR interactivity allows directing the camera man to collect appropriately dense coverage of the scenario/area before allowing conditions to change (illumination/plant growth; people moving furniture; people moving back to actively use/occupy the space), before one could let the software refine the entire situation as a background task. Once sufficient refinement has been done (during which one would prefer to redirect the interactive AR compute resources to said refining task), one could quickly lock back into a now-static scene which could render the captured version anchored to real-time camera feedback from the real location to practically eliminate the traditionally annoying drift/tracking-artifacting. At least in places with enough light to allow the camera to track non-blurry views of the reference features despite the obviously interactive motion.
I feel like for basic interactions like dragging etc, it is better if the user does it by hand. AI can handle complicated workflows like removing silences, quickly removing unwanted background elements etc
*this is text based but I hope you get the point
Most of the community wants auto captions and color grading, so that goes first.
Deleted Comment