Human's vision reaction time is ~300ms. If the training data is collected by human, each action sequence/chunk should be at the same order of magnitudes to that1.
Diffusion policy works well both in joint-space and action-space. However, working in action space requires a good IK.
Created repository.
Draft plan:
huggingface/lerobot: start from the training and evaluation scripts there. Maybe reproduce lerobot/diffusion_pusht if feasible.