AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception

Ruoxuan Feng1,2,3, Yuxuan Zhou4, Siyu Mei4, Dongzhan Zhou5, Pengwei Wang3, Shaowei Cui6,3, Bin Fang7,3, Guocai Yao3,8, Di Hu1,2,3
1Gaoling School of Artificial Intelligence Renmin University of China Beijing, China 2 Beijing Key Laboratory of Research on Large Models and Intelligent Governance 3 Beijing Academy of Artificial Intelligence 4 Beijing Jiaotong University 5 Shanghai Artificial Intelligence Laboratory 6 Institute of Automation, Chinese Academy of Sciences 7 Beijing University of Posts and Telecommunications 8 State Key Laboratory of Multimedia Information Processing, Peking University

Overview

The rise of high-resolution optical tactile sensors are ushering robotics into an era of dynamic tactile perception, where robots can sense the temporal variations in contact, force, and material interaction for increasingly complex real-world tasks. In stark contrast, existing tactile datasets and models are fundamentally unable to support this revolution. They remain limited to static, object-level properties, leaving the rich temporal dynamics of touch largely unexplored. Today, we want to bridge this gap. We need an entirely new dynamic data ecosystem with corresponding datasets, along with a general-purpose model that can comprehensively cover tactile perception abilities—especially dynamic tactile perception.

Tactile Dynamic Pyramid & ToucHD Dataset

To establish a systematic paradigm for dynamic tactile perception, introduce a tactile dynamic pyramid that organizes tactile data into five tiers based on the complexity level of the perception capabilities they support:

  • Tier 5 (Press-Only): Collected by only pressing the sensor against objects using either handheld operation or a robot arm. It mainly supports the recognition of object-level attribute. (Touch and Go, ObjectFolder, VisGel, TVL, etc.)
  • Tier 4 (Press-Only): Collected by pressing the sensor against objects, followed by random sliding and rotation . It enables perception of surface-related dynamics but lacking task relevance. (YCB-Slide, TacQuad, etc.)
  • Tier 3 (Specific Action): Collected by controlling the sensor to press and slide along the object surface following specific predefined actions. It can facilitate action-level tactile understanding.
  • Tier 2 (Manipulation Data): Collected during real object manipulation tasks using a robot arm or a UMI device. It is essential for learning real-world manipulation skills
  • Tier 1 (Force Data): Collected by a robot arm equipped with a force sensor. It enables reasoning about force–deformation relationships and supporting fine-grained, force-sensitive manipulation tasks. (e.g. FeelAnyForce)
Most existing tactile datasets reside in Tier 4 and 5, offering insufficient support for advanced dynamic perception tasks such as dexterous manipulation, while higher-tier data remain scarce. To address this gap, we present ToucHD, a large-scale tactile dataset with 2,426,174 contact samples designed as a Tactile Hierarchical Dynamic resource to enrich higher-tier dynamic tactile data. Compared with existing tactile datasets, ToucHD has advantages in terms of scale, sensor diversity, label diversity, and dynamic diversity. The dataset comprises three subsets corresponding to the highest 3 tiers of the pyramid:
  • Simulated Atomic Action Data (Sim). We collect 1,118,896 multi-sensor contact frames from five optical tactile sensors performing 6 atomic actions—sliding left/right/up/down and rotating clockwise/counterclockwise on 1,043 3D objects.
  • Real-World Manipulation Data (Mani). We modify FastUMI by equipping its two grippers with different tactile sensors and collect 584,842 contact frames from 46 carefully designed manipulation tasks, while simultaneously recording the interaction videos.
  • Touch-Force Paired Data (Force). We collect 722,436 touch–force pairs using five carefully selected tactile sensors and 71 distinct indenters. Under programmatic control, each indenter performs sliding motions in four directions—forward, backward, left, and right—across the sensor surface, while a wrist-mounted force sensor records 3D contact force sequences.

AnyTouch 2 Model

AnyTouch 2 Model

Building on this dynamic tactile data ecosystem, we introduce AnyTouch 2, a general tactile representation learning framework with comprehensive multi-level dynamic perception capabilities:

  • Pixel-Level Dynamic Details. We employ video masked autoencoder to reconstruct the masked video frames and the frame differences, enabling the model to capture fine-grained temporal variations essential for dynamic perception.

  • Semantic-Level Tactile Features. We employ multi-modal alignment, object matching and cross-sensor matching to capture both static objectlevel and dynamic action-aware semantic features, effectively bridging low-level tactile signals with high-level perceptual understanding.

  • Dynamic Physical Properties. We introduce the force and delta force prediction task to explicitly model the physical properties underlying tactile interactions based on ToucHD. This design enables a comprehensive, physically grounded representation spanning all tiers of the tactile dynamic pyramid, supports dexterous manipulation, and allows the model to stand out from prior methods.

Experiments

Online Real-World Manipulation

Tactile Grasping (Tier 5)

Whiteboard Wiping (Tier 4 & 3)

USB Insertion (Tier 2)

Chip Moving (Tier 1)

We design four challenging real-world manipulation tasks that explicitly span the tactile dynamic pyramid: Tactile Grasping (Tier 5), Whiteboard Wiping (Tier 4 & 3), USB Insertion (Tier 2) and Chip Moving (Tier 1). These tasks comprehensively cover all tiers of the dynamic pyramid, from object-level property recognition TO force-sensitive precision manipulation.

Online

AnyTouch 2 achieves the strongest Tier-1 dynamic perception capability, outperforming all baselines across all 4 real-world tasks. This also marks a successful effort to incorporate UMI device with tactile sensors into model training.

Sensor Generalization

Tactile Grasping (DIGIT)

Whiteboard Wiping (DIGIT)

USB Insertion (DIGIT)

AnyTouch 2 integrates multiple optical tactile sensors, such as GelSight, GelSight Mini, DIGIT, GelSlim, and Duragel, and exhibits sensor generalization capability across various downstream tasks.

Offline Benchmark Evaluation

Offline

AnyTouch 2 demonstrates superior performance across all tactile perception tasks, covering both object-level static attribute understanding and physical-level dynamic perception.

BibTeX

Coming Soon