This year’s Bay Area Robotics Symposium (BARS) brought together leading researchers and students from the University of California, Berkeley, the University of California, Davis, the University of California, Santa Cruz and Stanford University. The short talks touched upon many aspects of robotics across models, learning methods, scenarios and embodiments.
My observations will solely focus on the software aspects of robotics. It is important to remember that academic research of course is not covering all aspects of robotics use in the real world. And amongst the topics, it is not always obvious at the moment in time which technologies will be game changing, which will remain features, and which will disappear into oblivion.
What is clear is the unrelenting speed of developments. The term Vision Language Model (VLM) emerged and gained prominence in 2022, and Vision Language Action (VLA) was coined more recently by Google DeepMind in July 2023. If, like Sleeping Beauty, the roboticist had gone to sleep in 2020 and just woken up in time for the conference she would not recognize today’s world and indeed consider it a fairy tale.
๐๐ฉ๐ณ๐ฆ๐ฆ ๐ต๐ฉ๐ฆ๐ฎ๐ฆ๐ด ๐ด๐ต๐ฐ๐ฐ๐ฅ ๐ฐ๐ถ๐ต ๐ข๐ค๐ณ๐ฐ๐ด๐ด ๐ณ๐ฆ๐ด๐ฆ๐ข๐ณ๐ค๐ฉ ๐จ๐ณ๐ฐ๐ถ๐ฑ๐ด
1️⃣ ๐ฉ๐ถ๐๐ถ๐ผ๐ป ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐๐ฐ๐๐ถ๐ผ๐ป ๐ ๐ผ๐ฑ๐ฒ๐น๐ ๐๐ฟ๐ฒ ๐๐ฒ๐ฐ๐ผ๐บ๐ถ๐ป๐ด ๐๐ต๐ฒ ๐ก๐ฒ๐ ๐ฅ๐ผ๐ฏ๐ผ๐๐ถ๐ฐ๐ ๐ฆ๐๐ฎ๐ฐ๐ธ
Next-generation VLAs are now coordinating perception, reasoning, and control. These next-generation VLAs can plan, adjust, and correct themselves at inference time using techniques like test-time action sampling (RoboMonkey), memory retrieval (MemER), and affordance reasoning (LITEN).
Implication: Robots are shifting from fixed pipelines to inference-time intelligence.
2️⃣ ๐ฆ๐๐ป๐๐ต๐ฒ๐๐ถ๐ฐ ๐ฎ๐ป๐ฑ ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฑ ๐๐๐บ๐ฎ๐ป ๐๐ฎ๐๐ฎ ๐๐ ๐ฅ๐ฒ๐ฝ๐น๐ฎ๐ฐ๐ถ๐ป๐ด ๐ฅ๐ผ๐ฏ๐ผ๐ ๐ง๐ฒ๐น๐ฒ๐ผ๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป
A major shift is underway: instead of collecting thousands of robot demonstrations, researchers are turning human data into robot-friendly training material. This includes retargeted human motion (LeVerB), edited human videos (Masquerade), and fast, mocap-free data collection (TWIST2 or Real2Render2Real).
Implication: Robots no longer need endless teleoperation—they can learn from people, videos, and synthetic versions of both. The key constraint is no longer data collection, but data transformation and representation.
3️⃣ ๐๐๐ฏ๐ฟ๐ถ๐ฑ ๐ฆ๐ถ๐บ๐๐น๐ฎ๐๐ถ๐ผ๐ป ๐๐ ๐๐ต๐ฒ ๐ข๐ป๐น๐ ๐ฃ๐ฎ๐๐ต ๐ง๐ต๐ฟ๐ผ๐๐ด๐ต ๐๐ต๐ฒ ๐ฅ๐ฒ๐ฎ๐น๐ถ๐๐ ๐๐ฎ๐ฝ
Physics engines provide controllability; video models provide realism; world models provide long-horizon prediction. The emerging direction synthesizes all three.
Implication: Future training pipelines will depend on composite simulators, not a single dominant tool.
No comments:
Post a Comment