Pure Go implementation of OpenAI's tiktoken tokenizer

  • By null
  • Last update: Apr 12, 2023
  • Comments: 2

Tests

Tokenizer

This is a pure go port of OpenAI's tokenizer.

Buy Me A Coffee

Usage

package main

import (
    "fmt"
    "github.com/tiktoken-go/tokenizer"
)

func main() {
    enc, err := tokenizer.Get(tokenizer.Cl100kBase)
    if err != nil {
        panic("oh oh")
    }

    // this should print a list of token ids
    ids, _, _ := enc.Encode("supercalifragilistic")
    fmt.Println(ids)

    // this should print the original string back
    text, _ := enc.Decode(ids)
    fmt.Println(text)
}

Alternatively you can use the included command-line tool

> tokenizer -h

Usage of tokenizer:
  -decode string
        tokens to decode
  -encode string
        text to encode
  -token string
        text to calculate token

> tokenizer -encode supercalifragilistic

Todo

  • port code
  • cl100k_base encoding
  • r50k_base encoding
  • p50k_base encoding
  • p50k_edit encoding
  • tests
  • handle special tokens
  • gpt-2 model

Caveats

This library embeds OpenAI's vocabularies—which are not small (~4Mb)— as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder.

However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime.

Alternatives

Here is a list of other libraries that do something similar.

Download

tokenizer.zip

Comments(2)

  • 1

    Potential count inconsistency

    I'm using this repo mainly to count the tokens. I have one feature request and one issue. I'll start with the issue:

    Should I expect the same output from this library and https://platform.openai.com/tokenizer? I'm seeing different numbers and tokens. If yes, for which models?

    The feature request is to have a GetTokensCount function. It'd also be helpful if the existing functions had annotations so I can better understand what they do, particularly around the return values. Things like ([]uint, []string, error) aren't super helpful since I'm not sure what the semantic meaning of these return values are without reading the code.

  • 2

    🛂 fixed usage example in readme

    • Indents in the import block were not consistent
    • () was missing in the main function
    • token was unused

    The last 2 points, prevented the example from compiling.