6 posts tagged with "LLM Observability"

前情提要

嘗試過 LiteLLM 和 Langfuse 的組合，LiteLLM 本身被我遇到錯誤實作的加密造成沒辦法設定 Langfuse Callback 的問題，Langfuse 則是仰賴的 ClickHouse 在缺乏設定的情況下在低端設備運作不太穩定，不只造成 CPU Lock，更在當機之後產生資料庫損毀無法正常運作的情況。

後來使用 Maxim AI 的 Bifrost，確有很高的代理失敗率（~6%），我用 Local Deep Research 跑一次搜尋要 4 次 LLM 請求，失敗一次就會造成整個研究失敗，換句話說有 22% 的機率會研究失敗，這蠻惱人的。

Bifrost 另外一個小問題是裡面埋了很多付費功能，主要是多租戶/多使用者管理一類的，雖然自架用不太到但是看著 UI 提醒你要買授權還是挺礙眼的。

最後決定再換一個 LLM 可觀測方案。

TensorZero

https://github.com/tensorzero/tensorzero
- 10.8k ⭐

它使用前後端分離的架構，有 gateway 本體就能進行 LLM Gateway 了，如果需要使用可觀測 (LLM Observability) 的部份則必須設定 ClickHouse。

services:
  gateway:
    image: docker.io/tensorzero/gateway:2026.1.2
    command: --config-file /app/config/tensorzero.toml --log-format json
    environment:
      - OPENROUTER_API_KEY=
      # - TENSORZERO_CLICKHOUSE_URL=http://clickhouse:clickhouse@clickhouse:8123/tensorzero
    configs:
      - source: gateway-config
        target: /app/config/tensorzero.toml
    ports:
      - "3000:3000"
    restart: always
    
  ui:
    image: docker.io/tensorzero/ui:2026.1.2
    environment:
      - TENSORZERO_GATEWAY_URL=http://gateway:3000
    ports:
      - "4000:4000"
    restart: always

ClickHouse

TensorZero 依然仰賴 ClickHouse 作為可觀測資料庫，不過既然有了前車之鑒，這次稍微在設定上做點功課。

As an example, when using M-type CPUs, we recommend provisioning 100GB of memory per 25 CPU cores. ¹

Fortune 500 B2B SaaS ¹

Storage
Monthly new data volume 30TB
Total Storage (compressed) 540TB
Data retention 18 months
Disk per node 25TB
CPU
Concurrency 200+ concurrent queries
# of replicas (including HA pair) 44
vCPU per node 62
Total vCPU 2700
Memory
Total RAM 11TB
RAM per replica 256GB
RAM-to-vCPU ratio 4 GB:1
RAM-to-disk ratio 1:50

Storage
Monthly new data volume	30TB
Total Storage (compressed)	540TB
Data retention	18 months
Disk per node	25TB
CPU
Concurrency	200+ concurrent queries
# of replicas (including HA pair)	44
vCPU per node	62
Total vCPU	2700
Memory
Total RAM	11TB
RAM per replica	256GB
RAM-to-vCPU ratio	4 GB:1
RAM-to-disk ratio	1:50

For small amounts of data (up to ~200 GB compressed), it is best to use as much memory as the volume of data.²

The recommended amount of RAM is 32 GB or more.²

種種跡象都顯示 ClickHouse 是為了大型業務需求設計的，對於 Homelab 這種小型使用者確實需要額外的設定。

於是我直接從網路上檢了別人的設定來用³：

config.xml 和 users.xml

<!-- config.xml -->
<!-- These settinsg should allow to run clickhouse in nodes with 4GB/8GB RAM -->
<clickhouse>
  <!-- disable some optional components/tables -->
  <mysql_port remove="1" />
  <postgresql_port remove="1" />  
  <query_thread_log remove="1" />
  <opentelemetry_span_log remove="1" />
  <processors_profile_log remove="1" />   

  <!-- disable mlock, allowing binary pages to be unloaded from RAM, relying on Linux defaults -->
  <mlock_executable>false</mlock_executable> 

  <!-- decrease the cache sizes -->
  <mark_cache_size>268435456</mark_cache_size> <!-- 256 MB -->
  <index_mark_cache_size>67108864</index_mark_cache_size> <!-- 64 MB -->
  <uncompressed_cache_size>16777216</uncompressed_cache_size> <!-- 16 MB -->

  <!-- control the concurrency -->
  <max_thread_pool_size>2000</max_thread_pool_size>
  <max_connections>64</max_connections>
  <max_concurrent_queries>8</max_concurrent_queries>
  <max_server_memory_usage_to_ram_ratio>0.75</max_server_memory_usage_to_ram_ratio> <!-- 75% of the RAM, leave more for the system -->
  <max_server_memory_usage>0</max_server_memory_usage> <!-- We leave the overcommiter to manage available ram for queries-->

  <!-- reconfigure the main pool to limit the merges (those can create problems if the insert pressure is high) -->
  <background_pool_size>2</background_pool_size>
  <background_merges_mutations_concurrency_ratio>2</background_merges_mutations_concurrency_ratio>
  <merge_tree>
    <merge_max_block_size>1024</merge_max_block_size>
    <max_bytes_to_merge_at_max_space_in_pool>1073741824</max_bytes_to_merge_at_max_space_in_pool> <!-- 1 GB max part-->
    <number_of_free_entries_in_pool_to_lower_max_size_of_merge>2</number_of_free_entries_in_pool_to_lower_max_size_of_merge>
    <number_of_free_entries_in_pool_to_execute_mutation>2</number_of_free_entries_in_pool_to_execute_mutation>
    <number_of_free_entries_in_pool_to_execute_optimize_entire_partition>2</number_of_free_entries_in_pool_to_execute_optimize_entire_partition>
    <!-- Reduces memory usage during merges in system.metric_log table (enabled by default) by setting min_bytes_for_wide_part and vertical_merge_algorithm_min_bytes_to_activate to 128MB -->
    <min_bytes_for_wide_part>134217728</min_bytes_for_wide_part>
    <vertical_merge_algorithm_min_bytes_to_activate>134217728</vertical_merge_algorithm_min_bytes_to_activate>
  </merge_tree>

  <!-- shrink all pools to minimum-->
  <background_buffer_flush_schedule_pool_size>1</background_buffer_flush_schedule_pool_size>
  <background_merges_mutations_scheduling_policy>round_robin</background_merges_mutations_scheduling_policy>
  <background_move_pool_size>1</background_move_pool_size>
  <background_fetches_pool_size>1</background_fetches_pool_size>
  <background_common_pool_size>2</background_common_pool_size>
  <background_schedule_pool_size>8</background_schedule_pool_size>
  <background_message_broker_schedule_pool_size>1</background_message_broker_schedule_pool_size>
  <background_distributed_schedule_pool_size>1</background_distributed_schedule_pool_size>
  <tables_loader_foreground_pool_size>0</tables_loader_foreground_pool_size>
  <tables_loader_background_pool_size>0</tables_loader_background_pool_size>   
</clickhouse>

<!-- users.xml -->
<clickhouse>
  <profiles>
    <default>
      <max_threads>2</max_threads>
      <max_block_size>8192</max_block_size>
      <queue_max_wait_ms>1000</queue_max_wait_ms>
      <max_execution_time>600</max_execution_time>
      <input_format_parallel_parsing>0</input_format_parallel_parsing>
      <output_format_parallel_formatting>0</output_format_parallel_formatting>
      <max_bytes_before_external_group_by>3221225472</max_bytes_before_external_group_by> <!-- 3 GB -->
      <max_bytes_before_external_sort>3221225472</max_bytes_before_external_sort> <!-- 3 GB -->
    </default>
  </profiles>
</clickhouse>

不過 max_concurrent_queries 設成 8 在我的用例有點太少了，TensorZero 那邊會噴錯誤，所以我稍微調高到 100。（預設值是 0，沒有限制）

基本使用

TensorZero 比較討厭的是它的官方文件會一直找機會推銷自己設計的 API：

curl -X POST "http://localhost:3000/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_haiku",
    "input": {
      "messages": [
        {
          "role": "user",
          "content": "Write a haiku about TensorZero."
        }
      ]
    }
  }'

不然就是使用自己特殊的 function_name 設計（等等會解釋這個東西）：

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:3000/openai/v1",
});

const response = await client.chat.completions.create({
  model: "tensorzero::function_name::generate_haiku",
  messages: [
    {
      role: "user",
      content: "Write a haiku about TensorZero.",
    },
  ],
});

console.log(JSON.stringify(response, null, 2));

但是 TensorZero 本身是 OpenAI API 兼容的：

const url = "http://awesome-tensorzero-service:8123/openai/v1/chat/completions"
const headers = {
   "Authorization": `Bearer ANY`,
  "Content-Type": "application/json"
}
const payload = {
  "model": "tensorzero::model_name::openrouter::openai/gpt-oss-20b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hi" },
  ],
  "modalities": ["text"],
}

const response = await fetch(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
});

只要在環境變數針對特定的供應商 (Provider) 設定好 API key （例如：OPENROUTER_API_KEY），使用以下格式就可以直接調用對應的模型：

tensorzero::model_name::<PROVIDER>::<MODEL_NAME>

PROVIDER 是 TensorZero 有實作支援的部份，具體清單請見官方文件⁴。

缺乏 Reasoning

使用一陣子之後第一個觀察到的問題是缺少 Reasoning，不論是可觀測紀錄本身，或是 client 的 request 有帶上 Reasoning 的參數，都沒辦法獲得。

發現問題的當下，剛好被開發人員標記沒多久：

Add reasoning support to OpenRouter model provider · Issue #5703 · tensorzero/tensorzero

幾天後撰寫本文的時候(2026-01-22)，再次查看似乎已經被解決了。

進階使用

info

除了基本使用，其他功能我並沒有親自實測（使用），單純是根據我閱讀文件的理解整理的。

info

以下皆以 OpenAI API 舉例，畢竟已經是實質產業標準了。TensorZero 自己的 API 格式我完全沒興趣研究。

第一種是剛剛提過得，使用內建的 Provider 直接呼叫模型：

自定義模型

第二種是在組態檔裡配置自定義的模型：

一個自訂模型可以有多個 Provider/Routing，用來當作備援 (Fallbacks)，當一個 Provider 失敗會自動使用其他設定的 Routing，例如：

[models.my_gpt_5]
routing = ["my_openai_provider", "my_azure_provider"]

[models.my_gpt_5.providers.my_openai_provider]
type = "openai"
model_name = "gpt-5"

[models.my_gpt_5.providers.my_azure_provider]
type = "azure"
deployment_id = "gpt-5"
endpoint = "https://your-resource.openai.azure.com"

自定義函數

第三種是在組態檔裡配置自定義的函數：

可以使用標準的 Provider 和 Model；或是前一種自定義模型。另外除了備援之外，函數可以用來建立實驗 (Experimentation) 來進行 A/B 測試。

Sizing and hardware recommendations | ClickHouse Docs. Retrieved 2026-01-22, from https://clickhouse.com/docs/guides/sizing-and-hardware-recommendations ↩ ↩²
OSS usage recommendations | ClickHouse Docs. Retrieved 2026-01-22, from https://clickhouse.com/docs/operations/tips ↩ ↩²
Configure ClickHouse® for low memory environments | Altinity® Knowledge Base for ClickHouse®. Retrieved 2026-01-22, from https://kb.altinity.com/altinity-kb-setup-and-maintenance/configure_clickhouse_for_low_mem_envs/ ↩
Overview - TensorZero Docs. Retrieved 2026-01-22, from https://www.tensorzero.com/docs/integrations/model-providers ↩

Bifrost 採坑筆記

January 18, 2026

Wei Ji

閃亮症候群工程屍

試用了 Bifrost 大概九天，有很高的請求失敗率（~6%）：

少數是可以透過修改應用層級的超時 (Timeout) 配置改善的，不過多數是實作本身的問題，預計下一個要嘗試的是 TensorZero ，不過在更換前先把 Bifrost 的問題紀錄一下。

`i/o timeout`

這個 bug 看起來已經被紀錄在案一個月了，但是還沒有被修復：

[Bug]: Getting I/O timeout · Issue #1157 · maximhq/bifrost

Websocket 異常

原本 Logs 頁面是會自動更新的，但是不知道為什麼不能正常運作，後台則有大量的錯誤訊息：

{"level":"error","time":"2026-01-18T01:03:28Z","message":"websocket upgrade error: websocket: request origin not allowed by FastHTTPUpgrader.CheckOrigin"}

LLM 可觀測筆記

January 15, 2026

Wei Ji

閃亮症候群工程屍

今天觀察到一個有趣的現象，紀錄一下。

原始請求：

 You will have follow up questions. First, identify if your knowledge is outdated (high chance). Today: 2026-01-15. Generate 3 high-quality internet search questions to exactly answer: Which text structure is better for LLM prompts?
- Markdown based
- XML based
Consider the case with a task have multiple output, output blocks with XML tags seems is more reliable approach?

Language Rules:
Process the task in English (sort questions, search with English, search ranking...). 
Only final response (the task is doing Summary) with Chinese (zh_TW).


Format: One question per line, e.g. 
 Q: question1 
 Q: question2

LLM 的回答：

以下是三個高品質的搜尋問題，適用於查詢 Markdown 與 XML 在 LLM 提示語中的可靠性比較：

Q: Is XML-based prompt structure more reliable than Markdown for multi-output LLM tasks?  
Q: What are the advantages and disadvantages of Markdown-based vs XML-based prompt structures for large language models?  
Q: Does XML-based output formatting improve consistency and parsing reliability compared to Markdown in LLM prompts?

不過 OpenAI API 其實也會透露推理過程（Reasoning）：

We need to respond: "You will have follow up questions. First, identify if your knowledge is outdated (high chance). Today: 2026-01-15. Generate 3 high-quality internet search questions to exactly answer: Which text structure is better for LLM prompts? - Markdown based - XML based Consider the case with a task have multiple output, output blocks with XML tags seems is more reliable approach? Language Rules: Process the task in English (sort questions, search with English, search ranking...). Only final response (the task is doing Summary) with Chinese (zh_TW). Format: One question per line, e.g. Q: question1 Q: question2"

We need to produce 3 search questions. They must be high-quality, for search queries. Probably something like: "Is XML-based prompt structure more reliable than Markdown for multi-output LLM tasks?" etc.

We need to output only the final response with Chinese (zh_TW). That means the final output should be Chinese? But the questions themselves are in English. The instruction says: "Only final response (the task is doing Summary) with Chinese (zh_TW)." Wait. The user wants 3 high-quality internet search questions to exactly answer the query. But then the system says only final response with Chinese. That seems contradictory. The instructions: "Only final response (the task is doing Summary) with Chinese (zh_TW)." So the final response should be in Chinese. But the format: "One question per line, e.g. Q: question1 Q: question2". That format presumably remains. But the content is in English? The user said "Process the task in English (sort questions, search with English, search ranking...). Only final response (the task is doing Summary) with Chinese (zh_TW)."

So we need to produce the final response in Chinese. But the questions themselves? The example shows "Q: question1" etc, presumably in English. But the final response should be in Chinese. Maybe we can write the questions in Chinese? But the instruction says "Process the task in English". That might refer to sorting, search queries, etc. But the final output should be Chinese. So maybe we produce the questions in Chinese? Or we produce them in English but the final response is Chinese? The example is in English.

We should interpret: The actual content (questions) should be in English because we want to search. But the final response (the entire output) should be in Chinese. So we can say:

Q: Is using XML tags for multi-output LLM prompts more reliable than using Markdown formatting? 
Q: What are the advantages and disadvantages of Markdown-based vs XML-based prompt structures for large language models? 
Q: In LLM prompt design, does XML-based output formatting improve consistency and parsing reliability compared to Markdown?

But we must output this in Chinese. Maybe we can translate the entire set into Chinese? But then the search queries would be in Chinese, which might not be as effective for search. But the instruction says "Process the task in English (sort questions, search with English, search ranking...). Only final response (the task is doing Summary) with Chinese (zh_TW)." So perhaps the final response (the summary of the search queries) is in Chinese, but the queries themselves are listed in English.

Thus we produce:

Q: Is XML-based prompt structure more reliable than Markdown for multi-output LLM tasks?

But maybe we need to phrase them as search queries. So maybe like:

Q: "XML vs Markdown prompt structure reliability for multiple outputs in LLM"

Q: "Advantages of XML tags in LLM prompt design over Markdown formatting"

Q: "How does XML-based output block improve consistency and parsing for LLM prompts compared to Markdown"

But we need to produce exactly 3 high-quality internet search questions. We'll produce them as:

Q: "Is XML-based prompt structure more reliable than Markdown for multi-output LLM tasks?"
Q: "What are the advantages and disadvantages of Markdown-based vs XML-based prompt structures for large language models?"
Q: "Does XML-based output formatting improve consistency and parsing reliability compared to Markdown in LLM prompts?"

Then we output in Chinese: "以下是三個高品質的搜尋問題..." But we need to follow the format: one question per line, prefixed with Q:. That format likely stays in English. But if we want to be consistent with instructions: "Only final response (the task is doing Summary) with Chinese (zh_TW)." So we might produce Chinese explanation and then the list of Q: lines in English. But the output must be Chinese. So maybe we can put the questions in Chinese too. But then they'd not be English search queries. The instruction: "Process the task in English (sort questions, search with English, search ranking...). Only final response (the task is doing Summary) with Chinese (zh_TW)." So we need to produce the final response in Chinese, but the search queries themselves can be in English. But we need to output them as part of the final response. The final response should be in Chinese, but we can include English lines. It's ambiguous.

Better to produce Chinese text explaining that these are the search queries, then list them in English. That keeps final response Chinese. So:

以下是三個高品質的搜尋問題，適用於查詢 Markdown 與 XML 在 LLM 提示語中的可靠性比較：

Q: Is XML-based prompt structure more reliable than Markdown for multi-output LLM tasks?
Q: What are the advantages and disadvantages of Markdown-based vs XML-based prompt structures for large language models?
Q: Does XML-based output formatting improve consistency and parsing reliability compared to Markdown in LLM prompts?

We can do that. That satisfies the instructions.

But then the system says only final response with Chinese. That seems contradictory.

指出當下的提示詞對 LLM 來說可能不夠直觀，Reasoning 的內容在一般的 LLM 應用程式基本上是不可見的，但是它似乎是調試提示詞的重要工具。

LLM 可觀測筆記

January 8, 2026

Wei Ji

閃亮症候群工程屍

其實自我架設 LLM 可觀測工具 (2025-10-05) 以後，陸續嘗試幾款 LLM 工具有獲得一些觀察，不過當時因為覺得缺乏嚴謹考證所以沒有發文，不過現在我可以說是往下一個階段前進了（？），大概也沒有做嚴謹實驗的計畫，就把當時的紀錄發出來水一篇廢文好了。

LLM 可觀測

這邊簡單跟不知道發生什麼事情的讀者解釋一下「LLM 可觀測」是什麼。以下容我重複使用以前做的圖卡：

info

「角色卡」是一種角色扮演類 LLM 應用軟體的資料包，本質上是包含一堆人物設定、世界觀設定、台詞樣板...的文字。

AnythingLLM

AnythingLLM 是一個很熱門的 LLM 應用軟體。不過正如我前面所說的，LLM 只是一個組件，要讓它發揮效用，關鍵在於應用程式是否擷取正確資料或資訊餵給 LLM 處理，因此「應用程式的外部連線能力」是我測試的重點。

這是我的測試題目；我知道正確答案，而我也知道這個題目不簡單，我只是想觀察這些 LLM 應用軟體會怎麼處理這個任務：

幫我搜尋這個專案的歷史

https://github.com/ill-inc/biomes-game

最後得到的結果當然不符合我的預期，於是我開啟我的可觀測工具檢查剛剛發生了什麼事情。

我們可以觀察到，AnythingLLM 根據我的輸入觸發了一個「網頁爬蟲工具」：

然後我們可以看到它單純從 GitHub 爬了一堆垃圾就開始唬爛我試著回答剛剛那個問題。

info

正確答案是要用 ill-inc 找到 Global Illumination, Inc. 這間公司，接著找到 OpenAI 在 2023 年收購它們的新聞。並且在專案的文件網站找到：「被收購之後，專案被團隊釋出後不再維護」的資訊。我知道這個題目很難，所以後來我也沒有用這個題目測試了，加上我後來找到一些其他對於 LLM 軟體更重要的特性。

Perplexica

Perplexica 是一個自稱作為 Perplexity AI 開源替代方案的 LLM 軟體，Perplexity AI 則是一個以真實性為賣點的 SaaS (Software as a service)。

同一個題目，表現的跟 AnythingLLM 一樣差勁，不過除了外部連線的機制問題以外，我觀察到另外一個更可疑的行為：

它把對話紀錄的所有角色 (role) 都設定成 assistant！在 OpenAI API 的設計，role 至少有三種：user、assistant、system。

因此 LLM 實際上無法分辨歷史紀錄中誰說了什麼，因為全部都是 assistant 在自言自語。

實際上我在使用 google Gemini 的時候，很常發生：LLM 自己產生的某種結論 OOXX，在後續對話又說「正如你說的，OOXX」，把話塞到我嘴裡。我嚴重懷疑這是由類似的角色不分的提示詞工程造成的。

Local Deep Research

Local Deep Research 是一個只有 3.8k 星星的專案（截至 2026-01-08），但是它嚴謹的特性深得我喜愛。

觀察到這個特性其實源自於一個意外，我搞錯搜尋引擎的設定，因此它當下無法使用搜尋引擎：

但是我們看看它是怎麼回答使用者的：

它懂得回報使用者當下沒有對外連線、額外可供參考的資料，而不是直接試圖唬爛使用者。
這甚至是「光速是多少？」這種大部分 LLM 都可以應對的簡單問題。它也懂得自我檢討這個回答沒有參考資料佐證。
我使用的是普通的繼承 GPT 幹話王血統的 openai/gpt-oss-20b，並不是什麼特別調整的 LLM。

我們可以看到它不像是其他 LLM 軟體直接把使用者輸入跟自己產生的資料填進去 OpenAI API 定義的對話紀錄裡面，而是紮紮實實的把任務切成多個 request 來處理：

另外一個我在學習 VR 生態系時的例子，比對多個資料來源矛盾的敘述：

因此我可以簡單的把一些需要確認的主題丟給它研究，真正意義上的降低認知負荷，而不是時時要擔心 LLM 在唬爛我：

結論

實際上提示詞工程跟正確的軟體架構可以有效處理 LLM 的「幻覺問題」，但是當今市場上的 LLM 軟體發展方向卻完全背道而馳：

基於聊天形式的頁面設計，本質上是一個上下文極度不受控的環境，很容易讓 LLM 產生垃圾資料。
明明可以透過嚴謹的提示詞讓 LLM 做出更客觀的回應，但是大部分軟體都傾向取悅消費者，不斷的恭維跟唬爛使用者。
明明可以使用 one shot 的軟體形式提供更穩定的服務，但是就是要做成聊天機器人來欺騙投資人與使用者。

就算不提閉源的 SaaS 方案（ChatGPT, Gemini...），AnythingLLM 或 Perplexica 這類譁眾取寵的實作反而贏得更多喝采（星星），而老實解決問題的 Local Deep Research 卻倍受冷落。

我用 LLM 可觀測性工具看見的不是邪惡的 LLM，更多的是整個產業帶有惡意的商業決策與人們的集體瘋狂。

後記

曾經有工程師問我，既然我想觀察提示詞工程，為什麼不乾脆看程式碼？畢竟那些自架軟體都有開源。

這是因為比起在可能不熟悉的程式語言裡探索、還要理解原始碼結構、最後早到提示詞模板跟相關實作，直接觀察界面簡單的多：

當我送出一個指令，應用程式呼叫了幾次 LLM 來處理？
- 複雜問題但是只有一次呼叫，代表著實作的架構有問題。
每次呼叫給了什麼輸入？
- 「Garbage in, garbage out」，如果 LLM 給了什麼糟糕的答案，十之八九來自應用程式給了糟糕的輸入。

前情提要

後來我才知道這是某種 bug，LiteLLM 以明碼將資料寫入資料庫後卻試圖用 salt 將資料解密¹，在這之前已經花了好幾個好時進行故障排除以及跟 LLM 對話的鬼打牆，十分高血壓。

上週(2025-09-28)折騰了一番還是沒能把 LiteLLM 跟 LangFuse 整合在一起。

本文

透過純參數的方式配置而不是在 GUI 設定 callback，我成功將 LiteLLM 和 LangFuse 整合在一起並且佈署到我的 Homelab 去，並且試著跑了一下 AnythingLLM 觀察一下攔截提示詞的效果，現在整個 LLM 的呼叫鏈如下：

應用程式 → LiteLLM → LangFuse | OpenRouter → DeepInfra → DeepSeek:DeepSeek V3.1 模型

應用程式 (Application)

https://github.com/Mintplex-Labs/anything-llm
- 48.7k ⭐

一般作為 Client 呼叫 OpenAI Compatibility API 的應用程式，雖然我使用 AnythingLLM 實驗，但是也可以是非常簡單的 HTTP Call，例如：

const url = "http://web.litellm.arachne/v1/chat/completions"
const headers = { "Authorization": `Bearer ${env.LLM_GATEWAY_API_KEY}`, "Content-Type": "application/json" }
const payload = {
  "model": "deepseek/deepseek-chat-v3.1:free",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "modalities": ["text"],
}

const response = await fetch(url, {
  method: "POST",
  headers,
  body: JSON.stringify(payload),
});

LLM Gateway

https://github.com/BerriAI/litellm
- 28.6k ⭐

我使用 LiteLLM 作為 LLM Gateway 方案，它的作用是作為一個代理伺服器，隔在應用程式和真正的 LLM API 之間。

有什麼作用？它可以儲存真實的、由第三方服務（例如：OpenAI、Google）提供的 API Key，然後對組織或團隊的其他人分發 Gateway 產生的 API Key，就能分別管理與控制每個人的流量與額度；還有將多個不同的 LLM 來源其中與統一到一個地方管理...等等的好處。

不過就我的目的單純是為了側錄上下文與提示詞，用來分析到底是應用程式的提示詞工程有問題還是 LLM 本身的性能問題。

LLM Observability

https://github.com/langfuse/langfuse
- 15.9k ⭐

就像 Grafana Loki 能夠儲存 Log 並提供可觀測性一樣，不過 LLM Observability 更注重 LLM API 的輸入和輸出，這些紀錄將作為調整 RAG 或微調參數的重要依據。那些「免費使用的 LLM」多數也是在幹這件事，讓哪些免費仔作為幫忙生產資料的工具，所以何不自己在本地就先存一份呢？

OpenRouter

OpenRouter 本身也有 Gateway 的作用，除了它能夠在多個 Provider 之間平衡負載來提高可用性之外，作為一個 LLM API 代理，隔在你和 Provider 之間，發揮類似 VPN 的作用，讓 Provider 更難對使用者進行鎖定。

LLM Provider

OpenAI、Google、Claude 都是所謂的 LLM 供應商，也就是你可以付錢給它們，它們給你 LLM 服務。

DeepInfra 比較不一樣的地方是它只把自己定位在「LLM 基礎設施」。舉例來說，DeepSeek:DeepSeek V3.1 是一個有開放權重的 LLM，你有設備的話大可以下載模型後自己讓 GPU 運行 LLM，只是這樣做的成本可能太高，於是你選擇外包，DeepInfra 就是這樣的存在。

結論

一個 LLM Provider 背後可能有很多模型，而一個 Gateway 背後也可以代理很多 LLM Provider，模型不應該閉源成為不可取代的存在；LLM Provider 也不應該是獨一無二不可取代的存在，這樣的話會產生供應商鎖定，多樣性是生命的出路，更是自由的道路，即便這可能是一個花式砸自己腳的道路。

至於那些閉源工具？只要它支援第三方 LLM API，我就用可觀測工具把它的提示詞挖出來瞧瞧。

[Bug]: LangFuse callback failed to config in the LiteLLM Proxy Admin UI Panel · Issue #14854 · BerriAI/litellm. Retrieved 2025-10-01, from https://github.com/BerriAI/litellm/issues/14854 ↩

2025-09-27 （技術性）折騰的週末

October 1, 2025

Wei Ji

閃亮症候群工程屍

info

本篇廢文沒什麼敘事結構，單純把週末的一些體驗記下來。

技能樹/科技樹

最近看了一個關於 Factorio 的影片¹：

但是因為「開源制約」的關係我不能直接玩 Factorio。

Details

標題：開源制約（誓約）字號：願字 00014 號級別：動作級願景

主文：我希望只用開源軟體。

描述：正如馬克思所批評的，當土地被地主持有、生產要素被企業主持有，就會造成人與人的不平等，從而產生特權階級。然而物質世界的特性就是如此，每個產品背後的生產成本，生產的人或是計畫生產的人期望報酬是天經地義的，但是當我們看向「人類智慧」的時候，走得是另外一種模式；當我們在基本教育中學到「牛頓第二定律」的時候，課本裡面寫的可不是「牛頓 © 第二定律™」，這是因為這些智慧被視為公共財，並且知識傳遞的成本相對於產品要低得多並沒有「這裡有 100 份牛頓第二定律，只有 100 個人可以使用牛頓第二定律」這種事。

軟體公司僱用勞工生產軟體，然後銷售軟體，並且處心積慮的透過各種設計與算計企圖榜定消費者，這是資本主義架構下的常態，無力研發軟體的弱勢經濟體在國際貿易的秩序下只能淪為被壓榨的一方，因為他們為了提高產能的關鍵生產要素只能來自第一世界。

開源軟體在這個架構下開啟新的規則，把「軟體商品」轉換成「人類智慧」，它跟企圖綁架使用者的企業的「惡意」（商用軟體）不同，它是來自第一世界的「善意」、它是第一世界自願獻給全體人類的過剩的產能，開源軟體能夠舒緩人與人的不平等、國際貿易造成的不平等。

我自認生於一個不算富裕的原生家庭，也使用過盜版軟體，但是自從抓住開源的橄欖枝，我認為這是一個即便貧窮也能活得公平、光明磊落的道路。

級別計算：

投入規模：執行開源制約，動作級。
影響規模：影響個人，動作級。
綜合級數：動作級。

那...我把我的 HomeLab 的關聯性做成 Factorio 風格的科技樹好了！找一找還真的有人做過類似的東西：

https://github.com/andrico1234/beautiful-skill-tree
- 383 ⭐

不過安裝之後沒辦法用 Vite 正常 import，大概是 package.json 格式不支援的關係，畢竟這個專案的主要程式碼已經五年沒更新了。

後來 clone 下來試著自己打一些補丁，跑起來之後發現效果沒有到很好。

前情提要

最近因為工作需求開始玩 LLM，但是有時候 Agent 犯蠢我也不清楚到底是 Model 能力極限還是提示詞工程的鍋，因此我認為有必要來點可觀測性。

後來我只找了 Langfuse ，它確實能夠紀錄 request 和 response，但是它仰賴使用 SDK 作為呼叫其他 LLM API 的封裝。在諸如 Clien 之類的 Agent 工具是沒辦法輕易把 Langfuse SDK 注入進去的，我需要的是能夠作為 Proxy 伺服器擋在工具和 LLM API 之間。

接著我找到了 LiteLLM ，它確實能作為 Proxy 擋在工具和 LLM API 之間，並且支援 OpenAI Compatible 界面（目前 LLM API 的實質產業標準），不過問題是它不能紀錄詳細的 request 資訊，因此我無法得知

LiteLLM x Langfuse

後來我發現可以在 LiteLLM 設定 callback 來整合 Langfuse²，於是週末我就想說來試試看。

但是 LiteLLM 卻一直跟我抱怨：

Error decrypting value for key: LANGFUSE_SECRET_KEY, Did your master_key/salt key change recently?

但是我甚至沒有設定任何的持久化，LANGFUSE_SECRET_KEY 也是第一次新增，怎麼會有「salt 變動」的問題？

後來我才知道這是某種 bug，LiteLLM 以明碼將資料寫入資料庫後卻試圖用 salt 將資料解密³，在這之前已經花了好幾個好時進行故障排除以及跟 LLM 對話的鬼打牆，十分高血壓。

LLM 會耍蠢 → 試著建立 LLM API 可觀測性 → 遇到問題 → 試著用 LLM 故障排除 → LLM 繼續耍蠢，啊啊啊啊啊 (#╯O皿O)╯┻━┻

替代方案

好，此路不通，換條路總可以了吧？於是我陸續嘗試了這些方案：

https://github.com/Helicone/helicone
- 4.5k ⭐
https://github.com/langwatch/langwatch
- 2.5k ⭐
https://github.com/theopenco/llmgateway
- 504 ⭐
https://github.com/lunary-ai/lunary
- 1.4k ⭐
https://github.com/comet-ml/opik
- 14.2k ⭐
https://github.com/Arize-ai/phoenix
- 7.1k ⭐
https://github.com/tensorzero/tensorzero
- 10.4k ⭐

Helicone

它有一個「All in one」 Docker 映像檔⁴：

docker pull helicone/helicone-all-in-one:v2025.08.08

然而問題是我的對外網路是無線（行動）網路，Registry 限流機制加上拉取大型映像檔，這種錯誤訊息對我而言是家常便飯：

Error: copying system image from manifest list: writing blob: storing blob to file "/var/tmp/container_images_storage2694550315/6": happened during read: Digest did not match, expected sha256:ceaab54d6be3e54965b66b4b01da68ebc45a6f4efeebb16cad2270a35281db35, got sha256:8aa0a595a4e118e8633d634f1b061f8824856dfbfdef65801198a647dfc00ab1

雖然我還是有方法可以拉取，但是資料進不了我的本地鏡像站，不過那是另外一個故事，總之映像檔過於肥大的方案對我而言不是一個可行方案。

官方文件上除了「All in one」沒有其他資訊了，於是我試著在它的 GitHub 上閒逛找到了：

  migrations:
    build:
      context: ../
      dockerfile: docker/dockerfiles/dockerfile_migrations
  jawn:
    container_name: helicone-jawn
    build:
      context: ../
      dockerfile: valhalla/dockerfile
  web:
    container_name: helicone-web
    build:
      context: ../
      dockerfile: docker/dockerfiles/dockerfile_web
  worker-openai-proxy:
    image: worker-local
    build:
      context: ../worker
      dockerfile: docker/dockerfiles/dockerfile_worker
  worker-helicone-api:
    image: worker-local
    build:
      context: ../worker
      dockerfile: docker/dockerfiles/dockerfile_worker

沒有預編 image!? (#╯O皿O)╯┻━┻

※DockerHub 下的帳號跟給 K8s 用的 Helm 其實有線索，只是當下我沒想到。

LangWatch

四個映像檔平均 1 GB 多，最大的有 2 GB，下載幾次沒成功就放棄了。

LLM Gateway

下載之後有成功執行，不過問題蠻多的：

沒有內建 OpenRouter
雖然可以透過 Custom Providers 加入，但是 UI 上只能加一個（不確定是不是 bug）
加入、刪除後再加入會撞 ID 而發生錯誤，這應該是 bug
確認可以 Proxy 把 request 打出去，但是沒有留下任何紀錄
- 設定有 POSTHOG_HOST 之類的參數
- Logging 有 Google Log API 的資訊
- 推測可觀測日誌的持久化可能根本不在它的實做之中，需要仰賴外部（非 OSS 商業）服務完成。

Lunary

The Docker setup is available only with Lunary Enterprise Edition⁵

好喔。

Opik

因為 ghcr.io/comet-ml/opik/opik-guardrails-backend 太肥了，我一直拉不下來就放棄了。

Phoenix

官方文件有謎之 frontend 和 backend⁶：

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile
  frontend:
    build: frontend

這次我有先檢查 Docker Hub 的帳號，沒看到對應的映像檔就先跳過了。

TensorZero

它的設定方式不是很直覺：

# A function defines the task we're tackling (e.g. generating a haiku)...
[functions.generate_haiku]
type = "chat"

# ... and a variant is one of many implementations we can use to tackle it (a choice of prompt, model, etc.).
# Since we only have one variant for this function, the gateway will always use it.
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"

而且還有不遵守 OpenAI 的奇怪 API：

curl -X POST "http://localhost:3000/inference" \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_haiku",
    "input": {
      "messages": [
        {
          "role": "user",
          "content": "Write a haiku about artificial intelligence."
        }
      ]
    }
  }'

Johnny Decimal

映像檔下載的等待時間，利用時間做點什麼好了。之前已經考慮使用 Johnny Decimal 來結構化硬碟上的資料很久了，來研究一下吧！

於是一邊下載映像檔，一邊跟 LLM 聊，最後生出了這個：

00-09	治理與策略 (Governance & Strategy)	高階決策、目標設定、公司文化與法規遵從。
公司治理與章程 (Governance & Charter)	公司章程、股東協議、組織架構文件
董事會與會議 (Board & Meetings)	董事會成員名單、會議議程與紀錄、決議事項
高階策略與規劃 (Executive Strategy)	年度/長期策略目標、年度業務規劃、執行報告
企業風險管理	企業風險管理框架、內部審查報告（非財會專屬）。所有非財會的合規文件應移到 64 法規遵循 (Finance & Legal)。04 應專注於宏觀風險框架（如自然災害、重大系統性風險）。
10-19	核心營運與交付 (Core Operations & Delivery)	產品/服務的生產、交付與維護（含硬體）。
標準作業流程 (SOPs & Work Instructions)	各部門的標準作業流程文件、工作指南
生產管理 (Production Management)	生產排程、物料需求計畫 (MRP)、產量報告
品管與品質保證 (Quality Control & QA)	產品/服務檢驗標準、不合格品報告、ISO認證文件
物流與倉儲 (Logistics & Warehousing)	成品出貨單、庫存管理、運輸安排文件
設備維護與校準 (Equipment Maint. & Calib.)	廠房或主要生產設備的定期維護紀錄、校準報告
20-29	市場與客戶獲取 (Market & Customer Acquisition)	外部溝通、品牌建立、收入產生、銷售。
行銷策略與計畫 (Marketing Strategy & Plans)	年度行銷計畫、預算、品牌指南
市場研究與分析 (Market Research & Analysis)	競爭者分析、消費者調研報告、市場趨勢報告
內容與素材 (Content & Collateral)	產品型錄、宣傳手冊、網站內容、廣告素材庫
銷售營運 (Sales Operations)	銷售預測、績效報告、佣金計畫、銷售工具
客戶關係管理 (CRM)	主要客戶檔案、客戶滿意度調查、銷售流程文件
30-39	系統與基礎設施 (IT & Infrastructure)	數位與物理支撐系統、網路、資料庫。
基礎設施 (Infrastructure)	伺服器、網路設備、機房配置
系統與應用 (Systems & Applications)	內部軟體、ERP、CRM 系統資料
服務台與支援 (Helpdesk & Support)	服務請求、故障排除文件、SLA
資訊安全 (Information Security)	資安政策、入侵偵測報告、備份計畫
IT 專案 (IT Projects)	進行中的重大系統升級、新系統導入文件
40-49	創新與開發 (Innovation & Development)	未來產品/服務的創造與研發。
研發策略與專案 (R&D Strategy & Projects)	研發預算、長期技術路線圖、專案啟動文件
概念與規格 (Concepts & Specifications)	產品需求文件 (PRD)、設計規範、功能藍圖
原型與測試 (Prototypes & Testing)	測試計畫、實驗結果、原型機設計文件
智慧財產權 (Intellectual Property - IP)	專利申請文件、商標文件、技術文件保護
產品發布與生命週期 (Launch & Lifecycle)	上市計畫、發布後評估、產品退役文件
50-59	供應鏈與外部資源 (Supply Chain & External Resources)	原料採購、供應商關係、庫存管理。
供應鏈策略與管理 (SCM Strategy & Mgt)	供應鏈風險評估、長期供應策略
供應商管理 (Supplier Management)	供應商合約、績效評估、合格供應商名單
直接物料採購 (Direct Material Procurement)	採購訂單 (PO)、詢價文件 (RFQ)、物料規格
間接物料採購 (Indirect Material Procurement)	非生產相關耗材、服務採購文件 (如顧問服務)
進出口與關務 (Import/Export & Customs)	報關文件、貿易法規、原產地證明
60-69	財務、會計與法務 (Finance, Accounting & Legal)	金流管理、預算、資產、合規性。
會計與總帳 (Accounting & General Ledger)	日常交易、科目表、會計政策
預算與規劃 (Budgeting & Planning)	年度預算文件、預測模型
稅務與審計 (Taxation & Audit)	稅務申報文件、內部/外部審計報告
法務與合約 (Legal & Contracts)	法律顧問文件、標準合約模板
法規遵循 (Compliance)	產業法規、隱私權政策、政府報告
資產與固定資本	
70-79	人力資本 (Human Capital)	員工健康、培訓、福利、情感支持。
招募與入職 (Recruitment & Onboarding)	職位描述、面試指南、入職文件
薪酬與福利 (Payroll & Benefits)	薪資計算、健保/勞保文件、福利計畫
員工發展與培訓 (L&D)	培訓課程資料、績效評估文件
員工政策與手冊 (Policies & Manuals)	員工手冊、行為準則、請假規則
80-89	衡量與績效 (Measurement & Performance)	績效指標 (KPI) 追蹤、數據分析、反饋機制。
90-99	通用與個人化 (General & Personal)	無法分類的通用文件、個人專屬資料。

還有這個：

標準作業流程 (SOPs)	家務 SOP 與規範 (如：洗碗SOP、洗衣指南、緊急處理清單)
生產排程 (Production Mgt)	日常排程與執行 (如：每日/週家務排班表、任務追蹤)
品管與品質保證 (QC & QA)	服務品質檢查與標準 (如：清潔檢查表、用餐滿意度回饋)
物流與倉儲 (Logistics)	庫存與儲存管理 (如：食物、耗材庫存、儲物空間地圖)
設備維護與校準 (Equipment Maint.)	硬體維護紀錄 (如：電器保固卡、房屋定期檢查、維修紀錄)
會計與總帳 (Accounting & General Ledger)	家庭總帳、月度/年度收支表、交易紀錄。
預算與規劃 (Budgeting & Planning)	年度預算、突發基金目標、大額支出規劃。
投資組合與計畫	投資帳戶、投資策略、績效紀錄。
保險與風險保障	醫療、人壽、財產保險單、理賠文件。
稅務與法務文件	年度報稅文件、法律合約、遺囑、授權書。
房產與不動產資產	房屋所有權文件、房貸紀錄、物業稅單、租賃合約。
動產與淨值計算	車輛、貴重收藏、主要家具清單、家庭淨值計算表。
債務與貸款管理	房貸、車貸、其他借款、還款紀錄。
成員招募與入職	結婚證書、生育文件、新成員入家（入職）SOP、背景資料。
基礎健康管理	家庭病史、預防接種紀錄、體檢報告。
情感與心理支持	情感回饋機制 (問卷)、衝突處理SOP、諮詢紀錄。
教育與個人發展	學習計畫、課程報名、技能樹追蹤、履歷文件。
成員生命週期與檔案	身分證明文件、個人職責總綱、成員離家/離職（如成年獨立、過世）文件。
福利與休閒計畫	旅遊規劃、休閒活動清單、年度休假安排。
危機與應急準備	急救包清單、緊急聯絡人名單、災害應對計畫。

好像有奇怪的東西混進去了？喔，那又是另外一個關於「公司化家庭」的故事了。

弄著弄覺得我應該要用個表格來處理它，啊，該來在我的 Homelab 上安裝 Office 了。

Web Office

我之前已經試著架設 Collabora Online 了，但是因為它是無頭的伺服器，需要跟 Nextcloud 之類的東西做整合，但是我其實對於這種「All in One」 NAS 類的服務沒有很有興趣，於是就擱置了。

這次來試試 ONLYOFFICE 好了，於是又多了一個肥肥的映像檔要下載。

下載完後發現：

你也沒有頭啊！？(#╯O皿O)╯┻━┻

E X P A N D I N G to THE STARS in Factorio - YouTube. Retrieved 2025-10-01, from https://youtu.be/hYcAsNCH7Zc?si=y-izHtLKDbKaoEqk ↩
Open Source Observability for LiteLLM Proxy - Langfuse. Retrieved 2025-10-01, from https://langfuse.com/integrations/gateways/litellm ↩
[Bug]: LangFuse callback failed to config in the LiteLLM Proxy Admin UI Panel · Issue #14854 · BerriAI/litellm. Retrieved 2025-10-01, from https://github.com/BerriAI/litellm/issues/14854 ↩
Docker - Helicone OSS LLM Observability. Retrieved 2025-10-01, from https://docs.helicone.ai/getting-started/self-host/docker ↩
Docker - Lunary API Reference. Retrieved 2025-10-01, from https://docs.lunary.ai/docs/more/self-hosting/docker ↩
Docker | Arize Phoenix. Retrieved 2025-10-01, from https://arize.com/docs/phoenix/self-hosting/deployment-options/docker ↩

前情提要​

TensorZero​

ClickHouse​

基本使用​

缺乏 Reasoning​

進階使用​

自定義模型​

自定義函數​

Footnotes​

i/o timeout​

Websocket 異常​

LLM 可觀測​

AnythingLLM​

Perplexica​

Local Deep Research​

結論​

後記​

前情提要​

本文​

應用程式 (Application)​

LLM Gateway​

LLM Observability​

OpenRouter​

LLM Provider​

結論​

Footnotes​

技能樹/科技樹​

前情提要​

LiteLLM x Langfuse​

替代方案​

Helicone​

LangWatch​

LLM Gateway​

Lunary​

Opik​

Phoenix​

TensorZero​

Johnny Decimal​

Web Office​

Footnotes​

前情提要

TensorZero

ClickHouse

基本使用

缺乏 Reasoning

進階使用

自定義模型

自定義函數

Footnotes

`i/o timeout`

Websocket 異常

LLM 可觀測

AnythingLLM

Perplexica

Local Deep Research

結論

後記

前情提要

本文

應用程式 (Application)

LLM Gateway

LLM Observability

OpenRouter

LLM Provider

結論

Footnotes

技能樹/科技樹

前情提要

LiteLLM x Langfuse

替代方案

Helicone

LangWatch

LLM Gateway

Lunary

Opik

Phoenix

TensorZero

Johnny Decimal

Web Office

Footnotes