Technology Enthusiast Weekly (Issue 390): Large Language Models Are Useless Without Training Data

📅 2026-05-14 00:47:18 👤 DouWen Editorial 💬 8 条评论 👁 6

Technology Enthusiasts Weekly (Issue 390): Without Corpus, Large Models Are Idiots

Technology Enthusiasts Weekly (Issue 390): Without Corpus, Large Models Are Idiots

This newsletter records noteworthy tech content worth sharing each week, published on Fridays.

This magazine is open source and welcomes contributions. There is also a "Who's Hiring" service for publishing programmer recruitment information. For cooperation, please contact by email ([email protected]).

Cover Image

A colorful wind and rain corridor in a residential community in Rizhao, Shandong Province, with a coffee shop set among the trees at the entrance. (via)

Without Corpus, Large Models Are Idiots

If we were to conduct a survey asking people "Do you think large models possess intelligence?"

I believe most people would answer affirmatively.

Even at this early stage of AI, large models can already replace many forms of human intellectual labor, which is truly remarkable.

However, we must not forget the reality: large models are not magic, and certainly not "silicon-based intelligent entities" with autonomous intelligence. Rather, they are language models based on statistical patterns, and all their behavior is based on mathematical calculations.

The best evidence for this is that if you ask them to solve problems they haven't been trained on—that is, problems where no statistical patterns exist—they cannot solve them at all.

This is what I want to share today: an experiment.

Two foreign researchers selected five mainstream large models: GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2.

They asked these large models to program solutions using five obscure programming languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare.

The common characteristic of these niche languages is that there is very little information about them online, so they cannot be used to train large models. So what do you think the results were?

The experimental results can be summarized in one sentence: the large models performed terribly.

The average correct answer rate for these five large models was merely 3.8%, meaning they could answer correctly only 3.8 out of 100 questions. In comparison, their correct answer rate for Python problems reaches 90%.

More embarrassingly, the few questions they did answer correctly were only at the beginner level. For higher difficulty levels (elementary, intermediate, and advanced), all five large models achieved a 0% correct rate.

This experiment fully demonstrates that large models' performance (intelligence level) is primarily determined by their training materials: the more training corpus available, the better they perform, like Python where training materials are everywhere, making large models extremely skilled at solving Python problems; the less training corpus available, the worse large models perform, almost like they are idiots, with little practical use.

So, a curious question arises: if a niche language has no corpus, but there is a very detailed "User Manual," could we teach a large model by having it learn from this manual? Would it then be able to program in this niche language?

MAI-Image-2

This week, Microsoft released its own image generation model MAI-Image-2.

The image quality generated by this model is extremely high, with some reviews claiming it is currently second only to Google's nano-banana-2.

Microsoft has opened a website called MAI Playground (shown below) where you can now generate images for free.

After testing it, the image texture is indeed very good and highly realistic. For example, a dog riding a bicycle in the sea.

However, it has many usage restrictions: (1) Controversial or potentially offensive images will be refused; (2) The free daily quota is 15 images, with a 30-second interval between generations; (3) It can only generate images with a 1:1 aspect ratio, and other resolutions are not supported; (4) It does not provide image editing and processing, only supports "text-to-image."

If you need to generate high-quality images from text, you can try it.

Tech News

1、Playable Magazine Cover

Red Bull released a physical gaming magazine called "GamePop."

Its cover features a playable "Tetris" game—the world's first book with a playable game on its cover.

The secret is that the cover has embedded a very thin flexible circuit board.

This board is equipped with 180 RGB LEDs, 7 capacitive touch buttons, and a 32-bit ARM chip.

It also includes a rechargeable battery that can be charged via USB Type-C.

Unfortunately, this cover is a limited edition and not for public sale. It received official authorization from Tetris, with only 150 units released globally, each with its own unique number.

2、Paid Live Customer Service

Enterprises don't like providing live telephone customer service because the cost is high; they prefer switching to machine-answered customer service.

HP came up with an idea to push users toward machine customer service.

When users call HP's customer service, they hear a voice prompt telling them to visit the official website to find answers themselves. If you insist on speaking to a live person, you have to wait online for 15 minutes.

If you hang up during this time, calling again requires a new 15-minute wait. The system will remind you at the 5th, 10th, and 13th minutes that you can visit the website or contact by email.

Although this approach is terrible, it may become standard in the future: free service gets only AI or machine customer service, and you need to pay extra for live customer service.

3、Frisbee Throwing Techniques

How should you throw a frisbee to make it both fast and far?

An American physicist conducted experiments with dozens of students, throwing frisbees with different hand gestures and angles. He measured flight speed and torque and published the results in a paper.

He found that placing the thumb about 3 centimeters from the outer edge of the frisbee yields the best results for both average spin rate and initial velocity.

He also discovered a linear correlation between spin rate and initial velocity—the higher the spin rate, the higher the initial velocity.

So, the next time you play frisbee, place your thumb correctly, put enough force into it, and throw it with a backhand motion to get the best results.

Articles

1、The Slow Collapse of MkDocs (English)

MkDocs is a famous documentation website generation tool, but serious conflicts between main contributors led to infighting and fragmentation of the project. This article reviews what happened.

2、Large Models Predicting Coffee Cooling (English)

The author asked various large models to provide formulas for coffee cooling time, then measured actual cooling times and created a ranking.

3、Your Next App Is Probably Headless (English)

If we all use smartphones through AI assistants in the future, then various apps won't need display modules (headless) and only need to provide data interfaces to AI assistants.

4、One Method for Front-end Data Compression on Web (English)

This article introduces how to compress data on the front-end into an image using canvas (drawing surface).

5、Ruby Is the Best Language for Building AI Applications (English)

The author wrote an AI Agent using three languages—Python, JavaScript, and Ruby—and after comparison, concluded that Ruby is the most convenient for writing AI applications.

6、Ancient Roman Concrete Architecture (English)

The Romans discovered concrete and learned how to use it to cast buildings. The result was that Roman architecture had the largest interior spaces of any ancient civilization and was extremely durable, surviving to this day.

Tools

1、proxychains-rs

A Rust implementation of proxychains4 that allows a specific process to route through a proxy chain. (@tianrking contribution)

2、Flare Stack Blog

A blog system based on Cloudflare Worker that integrates services like D1, R2, KV, and Workflow. (@du2333 contribution)

3、Tunelo

Expose local services to the public internet with a single command, requiring only a single 4MB binary file and using the QUIC protocol. (@jiweiyuan contribution)

4、ReadAny

An e-book reading tool with both desktop and Android versions, featuring built-in AI functionality, voice reading, and multi-device synchronization. (@codedogQBY contribution)

5、RaTeX

A pure Rust implementation of a KaTeX-compatible mathematical rendering engine that natively parses and typsets LaTeX mathematical formulas, supporting various environments. (@erweixin contribution)

6、Work Review

An open-source Win/Mac desktop application that continuously records in the background applications used and websites visited on a given day, making it easy to organize into a personal work trajectory. (@w

📝 本文来自抖文 www.douwen.me ,转载请保留出处。

💬 评论 (8)

A
Alex_Chen 2026-05-13 08:30 回复

Absolutely spot on. I've been frustrated with LLM limitations for months, and this finally articulates why. Without quality training data, these models are just expensive pattern-matching machines. Great issue.

S
Sarah_Dev 2026-05-13 18:22 回复

Question: does this mean smaller, well-trained models might outperform larger ones with mediocre training data? Would love to see a deep dive on that tradeoff.

M
Marcus_2024 2026-05-13 14:20 回复

This is obvious though? Of course models need data. The real question is how we ensure the data quality and ethical sourcing going forward.

J
Jordan_K 2026-05-13 14:11 回复

The title made me laugh out loud — "without corpus, large models are idiots" 😂 But seriously, this frames the problem perfectly. We're throwing computing power at garbage-in-garbage-out problems.

D
Dr. Patricia Wu 2026-05-13 17:48 回复

Excellent framing. In my research group, we've found that curating training datasets takes 10x longer than building model architecture. The bottleneck is data, not compute. This newsletter gets it.

T
TechNewbie88 2026-05-13 20:56 回复

Can someone explain what "corpus" means? Is it just the training data or is there more to it? Asking for a friend who's learning about AI.

R
RyanB_Notes 2026-05-13 10:33 回复

Adding to this: the open source community's biggest contribution might not be the models themselves, but collaborative datasets. Seeing more initiatives like Common Crawl makes me optimistic.

L
Lisa_Skeptical 2026-05-13 17:03 回复

Bold claim but I'm not entirely convinced. Haven't we seen fine-tuned smaller models punch way above their weight? The relationship between data quality and model size seems more nuanced than the title suggests, no?