FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Website | arXiv | GitHub

Research demo for:
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length, arXiv 2025

This demo uses the FlexTok tokenizer to autoencode the given RGB input, using EPFL-VILAB/flextok_d18_d28_dfn, running on NVIDIA A100-SXM4-80GB MIG 3g.40gb. The FlexTok encoder produces a 1D sequence of discrete tokens that are ordered in a coarse-to-fine manner. We show reconstructions from truncated subsequences, using the first 1, 2, 4, 8, ..., 256 tokens. As you will see, the first tokens capture more high-level semantic content, while subsequent ones add fine-grained detail.

RGB input image

Reconstructions

Examples