SenseTime released a technical blog on NEO-unify, exploring the native multimodal unified architecture

36Kr

2026.03.06 08:42

36Kr learned that SenseTime Technology, in collaboration with Nanyang Technological University, released the NEO-unify preview version—an end-to-end native architecture that abandons traditional visual encoders and variational autoencoders, learning directly from pixels and text. It approaches the performance of Flux VAE in image reconstruction tasks, achieving a score of 3.32 in image editing benchmarks. Research shows that this architecture enhances understanding and generation in a synergistic manner, with data training efficiency superior to existing solutions