Monday, December 15, 2025

Three Themes in Robotics Research in 2025

This year’s Bay Area Robotics Symposium (BARS) brought together leading researchers and students from the University of California, Berkeley, the University of California, Davis, the University of California, Santa Cruz and Stanford University. The short talks touched upon many aspects of robotics across models, learning methods, scenarios and embodiments.

My observations will solely focus on the software aspects of robotics. It is important to remember that academic research of course is not covering all aspects of robotics use in the real world. And amongst the topics, it is not always obvious at the moment in time which technologies will be game changing, which will remain features, and which will disappear into oblivion.

What is clear is the unrelenting speed of developments. The term Vision Language Model (VLM) emerged and gained prominence in 2022, and Vision Language Action (VLA) was coined more recently by Google DeepMind in July 2023. If, like Sleeping Beauty, the roboticist had gone to sleep in 2020 and just woken up in time for the conference she would not recognize today’s world and indeed consider it a fairy tale.




๐˜›๐˜ฉ๐˜ณ๐˜ฆ๐˜ฆ ๐˜ต๐˜ฉ๐˜ฆ๐˜ฎ๐˜ฆ๐˜ด ๐˜ด๐˜ต๐˜ฐ๐˜ฐ๐˜ฅ ๐˜ฐ๐˜ถ๐˜ต ๐˜ข๐˜ค๐˜ณ๐˜ฐ๐˜ด๐˜ด ๐˜ณ๐˜ฆ๐˜ด๐˜ฆ๐˜ข๐˜ณ๐˜ค๐˜ฉ ๐˜จ๐˜ณ๐˜ฐ๐˜ถ๐˜ฑ๐˜ด

1️⃣ ๐—ฉ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—”๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—”๐—ฟ๐—ฒ ๐—•๐—ฒ๐—ฐ๐—ผ๐—บ๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐—ก๐—ฒ๐˜„ ๐—ฅ๐—ผ๐—ฏ๐—ผ๐˜๐—ถ๐—ฐ๐˜€ ๐—ฆ๐˜๐—ฎ๐—ฐ๐—ธ

Next-generation VLAs are now coordinating perception, reasoning, and control. These next-generation VLAs can plan, adjust, and correct themselves at inference time using techniques like test-time action sampling (RoboMonkey), memory retrieval (MemER), and affordance reasoning (LITEN).

Implication: Robots are shifting from fixed pipelines to inference-time intelligence.



2️⃣ ๐—ฆ๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฎ๐—ป๐—ฑ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฑ ๐—›๐˜‚๐—บ๐—ฎ๐—ป ๐——๐—ฎ๐˜๐—ฎ ๐—œ๐˜€ ๐—ฅ๐—ฒ๐—ฝ๐—น๐—ฎ๐—ฐ๐—ถ๐—ป๐—ด ๐—ฅ๐—ผ๐—ฏ๐—ผ๐˜ ๐—ง๐—ฒ๐—น๐—ฒ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป

A major shift is underway: instead of collecting thousands of robot demonstrations, researchers are turning human data into robot-friendly training material. This includes retargeted human motion (LeVerB), edited human videos (Masquerade), and fast, mocap-free data collection (TWIST2 or Real2Render2Real).

Implication: Robots no longer need endless teleoperation—they can learn from people, videos, and synthetic versions of both. The key constraint is no longer data collection, but data transformation and representation.



3️⃣ ๐—›๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—ฆ๐—ถ๐—บ๐˜‚๐—น๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—œ๐˜€ ๐˜๐—ต๐—ฒ ๐—ข๐—ป๐—น๐˜† ๐—ฃ๐—ฎ๐˜๐—ต ๐—ง๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐˜๐—ต๐—ฒ ๐—ฅ๐—ฒ๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—š๐—ฎ๐—ฝ

Physics engines provide controllability; video models provide realism; world models provide long-horizon prediction. The emerging direction synthesizes all three.

Implication: Future training pipelines will depend on composite simulators, not a single dominant tool.




This post was first published on LinkedIn in November 2025.

No comments:

Post a Comment