Until recently, many robots have been operating in highly controlled, pre-mapped spaces. Photo: Yucel Yilmaz

Droids walk the talk at a good clip

Wednesday, 28 February, 2024 - 14:00

The Star Wars droid, with its stiff walking style and awkward interactions, seems like a good fit for the strangely human-yet-not-quite-human interactions we have with modern chatbots.

But to actually interact in the real world, our droid will need more than just good manners. It will need to understand the world and what it can do in it.

There are many examples of humanoid robots that appear to fit the bill. The best known is the Atlas robot developed by Boston Dynamics. It’s a large, tethered humanoid often seen in YouTube videos performing backflips or jumping through obstacle courses.

Elsewhere, Tesla’s Optimus humanoid robot has also been seen moving things and waving to the audience, while Agility Robotics has a bipedal robot for moving parcels in warehouses.

However, many of these robots are operating in highly controlled, pre-mapped spaces.

If we want a C-3PO robot that can land on a moon and evade capture from an evil empire, we will need a robot that can operate anywhere.

Just say it

The amazing thing about chatbots is that they can be asked anything and will usually return a reasonable-sounding result.

This is the holy grail for an artificial intelligence system, as it shows a large capability to deal with the unpredictability of the real world. The problem is that text is not always a good way to describe something; sometimes a picture is much better.

Fortunately, researchers have solved this problem with an AI technique called CLIP. If you’ve ever asked an AI to make a picture for you, then it’s probably using the CLIP system in some way.

The amazing thing about CLIP is that it can work in reverse, instead of making a picture from text, you can make a text description of a picture. Sometimes this is also called an ‘embedding’, which is a precursor to text creation and can be more easily used by AI systems. With these image embeddings you can now compare voice commands such as ‘find Obi-Wan Kenobi’ with images of what your robot sees.

Using CLIP, a robot can understand what is in any picture coming from its camera.

Translation

Even if a robot can understand text and sight, it needs to be able to move.

An interesting property of chatbots and the large language models they are built upon is that they can translate between different languages such as Japanese and English. Recent work by Google, named RT-1, asked ‘what if movement was just another language?’. The result was surprising.

The RT-1 system was able to translate an instruction like ‘pick up rice chips from top drawer onto counter’ into the necessary robot movements to perform that task. The system was focused on pick-and-place tasks in a kitchen-style environment but was able to achieve up to 67 per cent accuracy, which is a remarkable result.

Ok Robot

An exciting new robot demonstration spread through the internet last month.

The New York University-based team showcased a robot that could do a quick scan of any bedroom and then accept open vocabulary commands.

The ‘Ok Robot’ system was able to start anywhere in a bedroom and complete tasks such as ‘move the soda can to the box’ with up to 82 per cent accuracy.

The best thing about this system is that it was an engineering solution, combining existing AI systems and robotics algorithms rather than training things from scratch.

Modern droids

So, is it possible to build a C-3PO with modern tools?

In the movies, C-3PO and R2-D2 act like fully conscious robot humans and are able to think and plan. Our modern robots aren’t that sophisticated, but demonstrations like Ok Robot take us closer to having household droids that can understand us and do what we ask.

• John Vial has a PhD in robotics and has spent the past several years leading teams in major Perth businesses focused on AI and robotics