One of the things we wanted to get right early was model routing. Not every question needs Taipei 3.1 with Extended Thinking. Sending a "what's the capital of France?" to our most powerful model is wasteful — it adds latency, increases cost, and produces an identical answer to what Majuli would give in 0.3 seconds.
The routing classifier
We built a lightweight classifier that runs before the main model inference. It takes the last 2-3 messages of conversation context and outputs a routing score from 0 to 1, representing estimated task complexity. The classifier runs in under 20ms on CPU and adds no perceptible latency to the user.
Features the classifier uses: prompt length, presence of code, mathematical notation, multi-step instructions, and a small vocabulary of complexity signals ("prove", "architecture", "refactor", "debug") versus simplicity signals ("what is", "list", "translate").
Routing logic
Score < 0.3: Route to Majuli 3.1 (fast, concise). Score 0.3-0.7: Route to Suzhou 3.1 (balanced). Score > 0.7: Route to Taipei 3.1 (reasoning). Extended Thinking is activated for scores > 0.9.
In Auto mode (the default), routing is invisible. Users with Pro and Max plans can override by selecting a specific model in the selector. We log routing decisions and use them as training signal to improve the classifier over time.
Results
Auto routing reduces average inference cost per message by 42% compared to always routing to Taipei. User satisfaction scores are identical between Auto and manual Taipei selection, which tells us the routing is working as intended.
The most interesting failure mode we found during testing: the classifier over-indexed on prompt length and would route long but simple messages to Taipei. We fixed this by adding a specificity feature that penalizes verbose but semantically simple prompts.