AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors

International Conference on Learning Representations (ICLR) 2025

Ruoxuan Feng1, Jiangyu Hu2,3, Wenke Xia1, Tianci Gao1, Ao Shen1, Yuhao Sun3, Bin Fang3, Di Hu1
1Renmin University of China     2Wuhan University of Science and Technology
3 Beijing University of Posts and Telecommunications

Overview



Tactile perception is crucial for humans to perceive the physical world. Over the years, various visuo-tactile sensors have been designed to endow robots with human-like tactile perception abilities. However, the low standardization of visuo-tactile sensors has hindered the development of a powerful tactile perception system. In this work, we present TacQuad, an aligned multi-modal multi-sensor tactile dataset that enables the explicit integration of sensors. Building on this foundation and other open-sourced tactile datasets, we propose learning unified representations from both static and dynamic perspectives to accommodate a range of tasks. We introduce AnyTouch, a unified static-dynamic multi-sensor tactile representation learning framework with a multi-level architecture, enabling comprehensive static and real-world dynamic tactile perception.

TacQuad: Aligned Multi-Modal Multi-Sensor Tactile Dataset



TacQuad is an aligned multi-modal multi-sensor tactile dataset collected from 4 types of visuo-tactile sensors (GelSight Mini, DIGIT, DuraGel and Tac3D). It offers a more comprehensive solution to the low standardization of visuo-tactile sensors by providing multi-sensor aligned data with text and visual images. This explicitly enables models to learn semantic-level tactile attributes and sensor-agnostic features to form a unified multi-sensor representation space through data-driven approaches. This dataset includes two subsets of paired data with different levels of alignment:

  • Fine-grained spatio-temporal aligned data: This portion of the data was collected by pressing the same location of the same object at the same speed with the four sensors. It contains a total of 17,524 contact frames from 25 objects, which can be used for fine-grained tasks such as cross-sensor generation.
  • Coarse-grained spatial aligned data: This portion of the data was collected by hand, with the four sensors pressing the same location on the same object, although temporal alignment is not guaranteed. It contains 55,082 contact frames from 99 objects, including both indoor and outdoor scenes, which can be used for cross-sensor matching task.

AnyTouch Model



AnyTouch is a unified static-dynamic multi-sensor tactile representation learning framework which integrates the input format of tactile images and videos. It learns both fine-grained pixel-level details for refined tasks and semantic-level sensor-agnostic features for understanding properties and building unified space by a multi-level structure:

  • Masked Image/Video Modeling: To enhance the fine-grained perception capabilities of the tactile representation model, we employ the masked autoencoder technique compelling the model to capture pixel-level details across multiple sensors. We randomly mask the tokens of both tactile images and videos, and build a decoder to obtain the reconstructed static images and dynamic videos. We also introduce an additional task of predicting the next frame while reconstructing the dynamic video.
  • Multi-Modal Aligning: We use multi-modal aligning to bind data from various sensors with paired modalities for a more comprehensive semantic-level perception and reduce perceptual differences between sensors. We select the text modality, which consistently describes tactile attributes across datasets, as an anchor to align touch, vision, and text modalities. The module is also compatible with missing modalities.
  • Cross-Sensor Matching: To fully utilize multi-sensor aligned data and build unified space by clustering multi-sensor tactile representations of the same object, we introduce a novel cross-sensor matching task. In this task, the model needs to determine whether two tactile images or videos are collected from the same position on the same object. We aim to cluster representations of the same tactile information from different sensors while performing multi-modal aligning, thereby enhancing the learning of sensor-agnostic features and forming a unified multi-sensor representation space.

Experiments

Sensor Transferability

We incorporate data from GelSight, GelSlim, DIGIT and GelSight Mini into the training of AnyTouch to obtain four different models, and compare them across four downstream tasks. We observe performance improvements across the three unseen datasets, with greater enhancements for unseen sensors than seen sensors. This suggests that the knowledge from the data of GelSlim, DIGIT, and GelSight Mini can transfer to the GelSight and other sensors.

Multi-Sensor Representation Space

We extract one aligned contact frame from each sensor for the 30 touches in the unused fine-grained subset of TacQuad. We then use t-SNE to visualize the tactile representations. With our cross-sensor matching task, the representations from different sensors fully mix in a shared multi-sensor space, clearly clustering by the object's tactile information. This indicates that our model possesses the ability to extract sensor-agnostic features, enabling generalization to unseen sensors.

Static and Dynamic Perception

To validate the benefit of unified multi-sensor representations in transferring knowledge from multiple sensor data to seen sensors and unseen sensors, we compare AnyTouch with existing multi-sensor models on two datasets from seen sensors and two datasets from unseen sensors. As shown in the tables, our AnyTouch model outperforms existing methods on all four datasets, demonstrating its static perception capabilities on both seen and unseen sensors.

To test the dynamic perception capability of our method in real-world object manipulation tasks, we conduct experiments on a real-world task: fine-grained pouring. The robot arm must rely entirely on tactile feedback to pour out 60g of small beads from a cylinder that initially contains 100g of beads. We conduct 10 real-world test runs and record the mean error. The results demonstrate the importance of learning unified multi-sensor representations from both static and dynamic perspectives for completing various tasks including real-world tasks.

BibTeX

@inproceedings{feng2025learning,
    title={Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors},
    author={Ruoxuan Feng and Jiangyu Hu and Wenke Xia and Tianci Gao and Ao Shen and Yuhao Sun and Bin Fang and Di Hu},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=XToAemis1h}
}