--- title: "DeepSeek OCR paper ignites the internet! Andrej Karpathy: I really like it; Musk: 99% of the future is photons" description: "The DeepSeek OCR paper has sparked heated discussions, with AI expert Andrej Karpathy expressing his strong appreciation for it, considering it an excellent OCR model. He explored the potential of pix" type: "news" locale: "en" url: "https://longbridge.com/en/news/261984305.md" published_at: "2025-10-21T05:55:39.000Z" --- # DeepSeek OCR paper ignites the internet! Andrej Karpathy: I really like it; Musk: 99% of the future is photons > The DeepSeek OCR paper has sparked heated discussions, with AI expert Andrej Karpathy expressing his strong appreciation for it, considering it an excellent OCR model. He explored the potential of pixels as input for LLMs, arguing that pixels may be superior to text, presenting four main reasons to support this view, including higher information compression efficiency, versatility, advantages of bidirectional attention processing, and criticism of Tokenizers Just now, AI guru Andrej Karpathy expressed his strong liking for the DeepSeek OCR paper, stating: > I quite like the new DeepSeek-OCR paper. It's a great OCR model (maybe slightly worse than dots), yes, data collection and so on, but that doesn't matter anyway. The more interesting part for me (especially as someone who is core to computer vision, temporarily masquerading as natural language) is whether pixels are better suited as input for LLMs than text. Are text tokens wasteful and bad as input? Still unsure about the situation, check out my article from yesterday: DeepSeek's bombshell: 10 times compression rate, 97% decoding accuracy! Contextual optical compression makes its debut. Karpathy believes that aside from the model itself, the DeepSeek paper raises a more thought-provoking question: Is pixel input superior to text input for LLMs? Are text tokens both wasteful and poor? He further speculates that perhaps all LLM inputs should only be images. Even pure text content should be rendered as images before being input to the model. Karpathy provided four core reasons supporting this idea: **1\. Higher information compression efficiency** Rendering text as images can achieve higher information compression, which means shorter context windows and higher operational efficiency. **2\. More universal information flow** Pixels are a far more universal information flow than text. They can represent pure text, easily capture bold and colored text, and even any charts and photos. **3\. Default implementation of powerful bidirectional attention** Pixelated input can naturally and easily default to using bidirectional attention for processing, which is more powerful than autoregressive attention. **4\. Complete elimination of Tokenizer** Karpathy does not hide his disdain for Tokenizers. He believes Tokenizers are an ugly, independent, non-end-to-end stage. They introduce all the ugliness of Unicode and byte encoding, inherit a lot of historical baggage, and bring security and jailbreak risks (e.g., continuous byte issues). He gave an example where Tokenizers can cause two characters that look completely identical to the human eye to be represented as two completely different tokens internally. A smiley face emoji, in the model's view, is just a strange token rather than a real smiley face made up of pixels, which prevents the model from leveraging the transfer learning advantages brought by its visual information. Tokenizers must disappear, he emphasized Karpathy summarized that OCR is just one of many vision-to-text tasks. Traditional text-to-text tasks can be completely restructured into vision-to-text tasks, but the reverse is not true. The future interaction model he envisions may be: the user's input (Message) is an image, while the output of the decoder (Assistant's response) is still text. It is currently unclear how to realistically output pixels or whether it is necessary to do so. ## Core Controversy: Bidirectional Attention and Image Patching Regarding Karpathy's viewpoint, AI scholar Yoav Goldberg raised two questions: 1. Why is it said that images can easily obtain bidirectional attention, while text cannot? 2. Although there is no Tokenization, isn't splitting the input image into patches a similar and possibly uglier processing method? Karpathy provided an explanation. He responded that, in principle, nothing prevents text from using bidirectional attention. However, for efficiency, text is usually trained in an autoregressive manner. He envisions that a fine-tuning phase could be added in the middle of training to handle conditional information (such as the user's input message, since these Tokens do not need to be generated by the model). But he is unsure if anyone has done this in practice. Theoretically, to predict the next Token, even the entire context window could be bidirectionally encoded, but this would prevent parallelization of training. Finally, he added that perhaps this aspect (bidirectional attention) is not strictly a fundamental difference between pixels and Tokens, but rather that pixels are usually encoded, while Tokens are typically decoded (borrowing terminology from the original Transformer paper). ## Musk: 99% of the Future is Photons At the end of this discussion, Elon Musk also appeared in the comments section and provided a more futuristic judgment: In the long run, over 99% of the input and output of AI models will be photons. Nothing else can be scaled. Musk's comment was not made lightly. He further supplemented it with a hardcore cosmological explanation to illustrate why he believes "photons" are the ultimate scaling solution. In simple terms, the vast majority of particles in the universe are photons The main source of these photons is the Cosmic Microwave Background (CMB). According to calculations, the photon density of the CMB is about 410 photons per cubic centimeter. Multiplying this density by the vast volume of the observable universe (with a radius of about 46.5 billion light-years) yields an astonishing number of photons contributed solely by the CMB: approximately 1.5 x 10⁸⁹. In contrast, the number of photons emitted by all stars (starlight) and other sources (such as neutrino background, black hole radiation, etc.) is negligible. The underlying physical fact revealed here is that photons have an unparalleled advantage in terms of magnitude. This may be the fundamental logic behind Musk's belief that the future input and output of AI will be dominated by photons. AI Cambrian, original title: "DeepSeek OCR Paper Goes Viral! Andrej Karpathy: I really like it; Musk: 99% of the future will be photons." Risk Warning and Disclaimer The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at their own risk ### Related Stocks - [DPSK.NA - DeepSeek](https://longbridge.com/en/quote/DPSK.NA.md) ## Related News & Research | Title | Description | URL | |-------|-------------|-----| | DeepSeek 灰度測試新一代模型,野村: 訓練與推理成本下降或緩解盈利壓力 | DeepSeek 正在進行新一代模型的灰度測試,預計本月中旬推出 V4 模型。該模型在上下文長度和核心能力上有顯著提升,野村證券認為 V4 將通過底層架構創新推動 AI 應用商業化,而非顛覆現有價值鏈。V4 的發布預計將顯著降低訓練與推理成 | [Link](https://longbridge.com/en/news/275723457.md) | | 幻方量化去年狂賺 57%,躋身百億級量化基金業績榜第二! | 幻方量化 2025 年以 56.6% 的平均回報率在中國百億級量化基金中排名第二,其強勁業績源於 2024 年全面轉向純多頭策略的成功轉型。據估算,其收入或超 7 億美元,這為創始人梁文鋒控股並孵化的 AI 公司 DeepSeek 提供了雄 | [Link](https://longbridge.com/en/news/272259241.md) | | 石四藥 (2005) 發盈警料去年盈利倒退 45 至 60% | 石四藥集團發盈警,預期截至去年底止年度股東應佔溢利按年下跌 45 至 60%;2024 年度錄得股東應佔溢利 10.61 億港元。 | [Link](https://longbridge.com/en/news/276431030.md) | | DeepSeek 論文提新框架 減低訓練 AI 能源需求 傳最快農曆新年期間登場 | DeepSeek 發布新框架「流形約束超連接」,旨在提升 AI 系統的擴展性並降低訓練所需的計算和能源需求。該論文由創辦人梁文鋒及 18 名研究人員共同撰寫,已在 arXiv 和 Hugging Face 上發布。新架構通過嚴格的基礎設施優 | [Link](https://longbridge.com/en/news/271295221.md) | | OpenAI 警告國會關於 DeepSeek 蒸餾策略 | OpenAI 已經向美國立法者發出了關於其中國競爭對手深度求索(DeepSeek)的警告,稱該公司可能正在採用先進的提煉策略,從美國的人工智能系統中提取輸出,這引發了競爭和國家安全方面的擔憂。在致眾議院中國特別委員會的備忘錄中,OpenAI | [Link](https://longbridge.com/en/news/275935776.md) | --- > **Disclaimer**: This article is for reference only and does not constitute any investment advice.