FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Research demo for:
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length, arXiv 2025
This demo uses the FlexTok tokenizer to autoencode the given RGB input, using EPFL-VILAB/flextok_d18_d28_dfn, running on NVIDIA A100-SXM4-80GB MIG 3g.40gb. The FlexTok encoder produces a 1D sequence of discrete tokens that are ordered in a coarse-to-fine manner. We show reconstructions from truncated subsequences, using the first 1, 2, 4, 8, ..., 256 tokens. As you will see, the first tokens capture more high-level semantic content, while subsequent ones add fine-grained detail.
The FlexTok decoder is a rectified flow model. The following settings control the seed of the initial noise, the number of denoising timesteps, the guidance scale, and whether to perform Adaptive Projected Guidance (we recommend enabling it).
This FlexTok model operates at 256x256 resolution. You can optionally super-resolve the reconstructions to 1024x1024 using Aura-SR for sharper details, whithout changing the underlying reconstructed image too much.