AI Chat: API Usage

Our chat frontend https://ki-chat.uni-mainz.de not only supports interactive chat, but also serves OpenAI-API compatible API endpoints for seamless integration in own applications. This page provides basic instructions to make use of that feature.

Getting Started

To use the API, you'll need to obtain an API key from the KI-Chat@JGU interface:

  1. Log in to the KI-Chat@JGU platform at https://ki-chat.uni-mainz.de
  2.  Click your avatar image in the upper right corner and click Settings.
  3. In the opened window, navigate to Account.
  4. Under API keys, click Create New Key.
  5. Use the copy button to copy the generated key and store it securely.

Important: Please make sure to keep your API key confidential.

API Endpoints

The API endpoints under https://ki-chat.uni-mainz.de/api are compatible with the OpenAI-API standard as described in the official reference (https://platform.openai.com/docs/api-reference), but not all endpoints and features are necessarily supported.

Currently, we offer the endpoints /models,/chat/completions and  embeddings. All endpoints require authentication using your personal API key as Bearer Token in the Authorization header:
(Authorization: Bearer API_KEY).

In the following, you'll find basic instructions for each endpoint, with example requests using the command-line tool curl.

/models

Retrieve a list of all available models.

Request Example:

curl -H "Authorization: Bearer API_KEY" https://ki-chat.uni-mainz.de/api/models

OpenAI API reference: https://platform.openai.com/docs/api-reference/models/list

/chat/completions

Generate a chat message, given the current message history.

Request Example:

curl -X POST https://ki-chat.uni-mainz.de/api/chat/completions \
  -H "Authorization: Bearer API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GPT OSS 120B",
    "messages": [
      {"role": "user", "content": "Why is the sky blue?"}
    ]
  }'

Note: Use the parameter "stream". Like this you will receive a complete answer and a delta stream.

OpenAI API reference: https://platform.openai.com/docs/api-reference/chat/create

/embeddings

Generate embeddings for text parts using the embedding model bge-m3

Request Example:

curl -X POST https://ki-chat.uni-mainz.de/api/embeddings \
  -H "Authorization: Bearer API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-m3",
    "input": ["Open WebUI is great!", "Let's generate embeddings."]
  }'

Note: For optimal performance, batch multiple chunks per requests.

OpenAI API reference: https://platform.openai.com/docs/api-reference/embeddings

Usage Constraints

To ensure fair usage of our system, currently the following constraints are enforced:

Context Size Limits 

  • 65536 input tokens maximum context window
  • 8192 output tokens maximum for non-reasoning models
  • 16384 output tokens maximum for reasoning models

Rate Limits 

  1. Max. 2 parallel /chat/completions requests
  2. Max. 1 parallel /embeddings request (please batch chunks together)
  3. Max. 1 API requests per second over prolonged duration
  4. Per 5 minutes: Max. 200.000 weighted tokens, calculated as 4 × Ausgabetokens + 1 × Eingabetokens (Kein Cache-Treffer) + 0.1 × Eingabetokens (Cache-Treffer)

Staying within these limits should prevent receiving Status 429 (Too Many Requests) responses.

Specific Features

GPT-OSS Reasoning Effort

The GPT-OSS model supports the reasoning_effort parameter to control the amount of reasoning tokens generated:

{
  "model": "GPT OSS 120B",
  "messages": [{"role": "user", "content": "Explain quantum physics simply"}],
  "reasoning_effort": "medium"
}

Available values:

"low" Minimal reasoning tokens (faster response)
"medium" Balanced reasoning (default)
"high" Extensive reasoning (more thorough but slower)

Function Calling

All of our models except for Gemma3 27B support native function calling  via the /chat/completions API.

Learn more: https://platform.openai.com/docs/guides/function-calling

Request Example:

curl -X POST https://ki-chat.uni-mainz.de/api/chat/completions \
  -H "Authorization: Bearer API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GPT OSS 120B",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather like in Boston today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather in a given location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto"
  }'

However, the implementation of a full flow including tool responses back to the model are beyond the scope of these instructions.

Multimodal Image Processing

To send images to multimodal models like Gemma3 27B, encode the image in base64 and include it in the message content. The following Python code serves as an example:


import requests
import base64

# API Config
api_url = "https://ki-chat.uni-mainz.de/api"
api_key = "API_KEY"

# Path of image to send
image_path = "test.jpg"

# Base64-encode the image
with open(image_path, "rb") as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode('utf-8')

# API-Request Payload
payload = {
    "model": "Gemma3 27B",
    "messages": [
        {"role": "user", "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encoded_image}"
                }
            }
        ]}
    ]
}

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

response = requests.post(
    f"{api_url}/chat/completions",
    headers=headers,
    json=payload
)

print(response.json())

Fill In the Middle (FIM)

Qwen3 Coder 30B is the only model that supports classic “fill in the middle,” i.e., autocompletion based on surrounding text (usually code). To use this feature, the prefix and suffix text must be surrounded by special tokens in the query:

`<|fim_prefix|>PREFIX<|fim_suffix|>SUFFIX<|fim_middle|>`

This format must be used exactly as shown and sent as the content of a user message.

Example Request:


```bash
curl -X POST https://ki-chat.uni-mainz.de/api/chat/completions \
-H "Authorization: Bearer API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3 Coder 30B",
"messages": [
{"role": "user", "content": "<|fim_prefix|>def approximate_pi()<|fim_suffix|>    return pi<|fim_middle|>"}
],
"max_tokens": 512
}'
```