Description
TrueTrack is an independent research effort focused on advancing machine learning approaches for particle track reconstruction, with a particular emphasis on the TrackML dataset and related high-energy physics challenges.
The project brings together model design, data representation, and training systems into a unified framework aimed at solving hit-to-track association under realistic detector conditions.
Model design
At its core, TrueTrack explores the development of more expressive and physically grounded Transformer-based architectures. Rather than framing track reconstruction as a fixed-class classification problem, the project treats it as a relational learning task between detector hits.
A central direction is the use of specialised attention mechanisms. This includes separate attention heads dedicated to track association, as well as explicit modelling of noise through dedicated "noise heads". Such architectural choices aim to disentangle structured particle trajectories from detector noise, allowing the model to learn both coherent track patterns and stochastic background signals in a principled manner.
Working with full TrackML data
A key principle of TrueTrack is operating on the full, unreduced TrackML dataset. Many existing approaches simplify the problem by filtering noise or reducing event complexity. In contrast, TrueTrack retains all detector effects, including noise hits.
This decision reflects the goal of developing models that remain robust under realistic conditions, where ambiguity, noise, and combinatorial complexity are inherent to the data rather than artifacts to be removed.
Data representation and sharding
To make large-scale events tractable, TrueTrack introduces a structured data representation based on a global detector occupancy perspective. Hits are mapped into a consistent detector space, enabling the study of co-occurrence patterns and relational structure across events.
Building on this, events are decomposed into overlapping subsets, or shards, which serve as the fundamental units for training. Sharding allows the model to process manageable portions of the detector while still capturing local geometric and topological relationships between hits.
This representation also supports flexible experimentation, including different levels of locality, overlap, and feature construction.
Scalable training pipeline
TrueTrack places strong emphasis on practical scalability. The training pipeline is designed to run on moderate hardware, avoiding dependence on large-memory systems or high-end GPUs.
This is achieved through a fully streaming pipeline, where data is processed incrementally rather than loaded in full into memory. Combined with the sharding approach, this significantly reduces memory pressure and enables training on complex, high-density events without requiring specialised infrastructure.
The overarching goal is to shift the primary bottleneck away from hardware constraints and toward model quality, data representation, and learning efficiency.
Outlook
TrueTrack sits at the intersection of machine learning systems design and physics-driven modelling. By combining expressive architectures, realistic data assumptions, and scalable infrastructure, the project aims to provide a foundation for next-generation approaches to particle track reconstruction.