Stuff Goes Bad:Erlang In Anger

jopen 9年前發布 | 65K 次閱讀 Erlang開發 ErLang

Stuff Goes Bad: Erlang in Anger by Fred Hébert and Heroku 這是一本幫助開發人員調試用Erlang構建的生產系統代碼詞典。

Introduction

On Running Software

運行時軟件

There’s something rather unique in Erlang in how it approaches failure compared to most other programming languages. There’s this common way of thinking where the language, programming environment, and methodology do everything possible to prevent errors. Something going wrong at run-time is something that needs to be prevented, and if it cannot be prevented, then it’s out of scope for whatever solution people have been thinking about.
The program is written once, and after that, it’s off to production, whatever may happen there. If there are errors, new versions will need to be shipped.
Erlang, on the other hand, takes the approach that failures will happen no matter what, whether they’re developer-, operator-, or hardware-related. It is rarely practical or even possible to get rid of all errors in a program or a system. 1 If you can deal with some errors rather than preventing them at all cost, then most undefined behaviours of a program can go in that "deal with it" approach.

相對于大多數其它編程語言,Erlang處理失敗(failure)方法非常獨特的。 常見處理錯誤的思維模式是費盡心機防止發生錯誤,不允許在運行時出錯,如果不能杜絕錯誤,它的行為就會超出設計者原本的設想(變得不可控)。
一旦程序寫完,并發布為產品放出去,如果程序在這個變化莫測的生產環境中出了問題,,那么必須為修復它重新發布新版本。
但Erlang有辦法對付這種錯誤,不管它是開發者設計不周引起的,還是操作符引起的或與硬件相關的因素,它甚至有可能駕馭來自于程序或系統的所有錯誤1。如果你能處理錯誤,而不是千方百計去杜絕錯誤,那么程序所有的不確定行為都會跑到指定的錯誤處理方法中。 This is where the "Let it Crash" 2 idea comes from: Because you can now deal with failure, and because the cost of weeding out all of the complex bugs from a system before it hits production is often prohibitive, programmers should only deal with the errors they know how to handle, and leave the rest for another process (a supervisor) or the virtual machine to deal with.
Given that most bugs are transient 3, simply restarting processes back to a state known to be stable when encountering an error can be a surprisingly good strategy.
Erlang is a programming environment where the approach taken is equivalent to the human body’s immune system, where as most other languages only care about hygiene to make sure no germ enters the body. Both forms appear extremely important to me. Almost every environment offers varying degrees of hygiene. Nearly no other environment offers the immune system where errors at run time can be dealt with and seen as survivable.
Because the system doesn’t collapse the first time something bad touches it, Erlang/OTP also allows you to be a doctor. You can go in the system, pry it open right there in production, carefully observe everything inside as it runs, and even try to fix it interactively. To continue with the analogy, Erlang allows you to perform extensive tests to diagnose the problem and various degrees of surgery (even very invasive surgery), without the patients needing to sit down or interrupt their daily activities. 這就是"Let it Crash"2 的Erlang理念。這個idea起源于:如果想在系統部署到生產環境前,找出所有復雜的bugs,并把它們斬盡殺絕,這幾乎是不可能的。設計者應當先處理他們知道原因的錯誤,讓其它不可遇料的錯都交給另一個進程(supervisor進程)或虛擬機(virtual machine)來處理。
鑒于大多數bugs轉瞬即逝3,簡單地重啟出現錯誤的進程,讓其重新工作到正常狀態,無疑是一個非常好的策略。
Erlang 這種處理方式與類身體的免疫系統非常像,但其它語言只是關注環境是否可以防止病菌進入身體。這兩者的區別對我來說非常重要,因為幾乎所有的環境都只提供不同程度的衛生,幾乎沒有其它系統能提供類似于免疫系統一樣機制,可以在運行時容許不明病毒進入,并把病毒隔離或消滅掉。
人體免疫系統運行時可以處理病毒被視為生存的希望。這種機制使得系統不會在第一次遇到錯誤時崩潰掉,Erlang/OTP也可以讓你成為一位這樣的醫生,你可以深入系統,在運行時糾正錯誤行為,仔細查看運行時的所有信息,甚至可以一步步交互式地去修復它。 Erlang 允許你做各種各樣的測試來診斷問題,在不打擾病人的日常活動情況下執行不同程度的手術(甚至開刀手術)。 This book intends to be a little guide about how to be the Erlang medic in a time of war. It is first and foremost a collection of tips and tricks to help understand where failures come from, and a dictionary of different code snipp ets and practices that helped developers debug production systems that were built in Erlang. 這本書意在:怎樣在這個戰爭的時代做個稱職的Erlang醫生。 首先,也是最重要的:你要學會收集信息的相關技巧,以便理解失敗是從哪里來的, 這是一本幫助開發人員調試用Erlang構建的生產系統代碼詞典。 Who is this for? 這本書適合什么人? This book is not for beginners. There is a gap left between most tutorials, books, training sessions, and actually being able to operate, diagnose, and debug running systems once they’ve made it to production. There’s a fumbling phase implicit to a programmer’s learning of a new language and environment where they just have to figure how to get out of the guidelines and step into the real world, with the community that go es with it.
This book assumes that the reader is proficient in basic Erlang and the OTP framework. Erlang/OTP features are explained as I see fit — usually when I consider them tricky — and it is expected that a reader who feels confused by usual Erlang/OTP material will have an idea of where to look for explanations if necessary 4. What is not necessarily assumed is that the reader knows how to debug Erlang software, dive into an existing code base, diagnose issues, or has an idea of the best practices about deploying Erlang in a production environment 5. 這本書不適合于初學者。大多數的Erlang書都沒有講如何在生產環境中操作,診斷,和調試運行時系統,但這個摸索階段卻隱含著一個程序員學習新的語言和環境時,擺脫紙上談兵并進入現實世界,與社區共同成長的過程。
閱讀此書需要讀者精通(proficient)基本的Erlang知識和OTP框架,我只會在認為OTP很棘手時解釋 4。希望感到困惑的讀者能仔細閱讀Erlang/OTP相關的資料。但這并不需要讀者掌握調試Erlang 軟件,深入現有的代碼庫,診斷問題,或掌握在生產環境中部署Erlang的最佳實踐5。 How To Read This Book 如何去讀這本書? This book is divided in two parts. Part I focuses on how to write applications. It includes how to dive into a code base (Chapter 1), general tips on writing open source Erlang software (Chapter 2), and how to plan for overload in your system design (Chapter 3).
Part II focuses on being an Erlang medic and concerns existing, living systems. It contains instructions on how to connect to a running node (Chapter 4), and the basic runtime metrics available (Chapter 5). It also explains how to perform a system autopsy using a crash dump (Chapter 6), how to identify and fix memory leaks (Chapter 7), and how to find runaway CPU usage (Chapter 8). The final chapter contains instructions on how to trace Erlang function calls in production using recon 6 to understand issues before they bring the system down (Chapter 9).
Each chapter is followed up by a few optional exercises in the form of questions or hands-on things to try if you feel like making sure you understood everything, or if you want to push things further. 此書分兩部分:
Part I 集中講怎么去寫一個Applications.
包括:
章節1:怎樣了解現有的代碼庫;
章節2:常用寫開源Erlang軟件的注意項;
章節3:如何在為系統過載規劃設計.
Part II 集中于描述一個稱職的Erlang醫生怎么去關注現有的生命系統中的各項指標.

包括:
章節4:如何連接一個運行中的節點;
章節5:基本的運行時狀態指標信息;
章節6:如何通過crash dump文件來解剖系統;
章節7:如何識別和修復內存泄露;
章節8:如何找到CPU失控的原因;
章節9:怎樣去在生產環境中使用recon6來trace Erlang函數,從而在系統崩潰前就定位到問題。
每個章節都有練習,其中有一些可用于動手嘗試,讓你更上一層樓。 [1]. life-critical systems are usually excluded from this category
[2]. Erlang people now seem to favour "let it fail", given it makes people far less nervous.
[3]. 131 out of 132 bugs are transient bugs (they’re non-deterministic and go away when you look at them, and trying again may solve the problem entirely), according to Jim Gray in Why Do Computers Stop and What Can Be Done About It?
[4]. I do recommend visiting Learn You Some Erlang or the regular Erlang Documentation if a free resource is required
[5]. Running Erlang in a screen or tmux session is not a deployment strategy.
[6]. http://ferd.github.io/recon/ — a library used to make the text lighter, and with generally productionsafe functions. [注1]:生死攸關的系統不在這個討論范疇。
[注2]:更確切地說是"let it fail",但這會讓人更加緊張。
[注3]:131/132的bugs都是短暫的bugs(Bugs出現得不確定,當你去找他們時,他們又消失了,再次嘗試或許可以解決這個問題)根據 Jim Gray的 Why Do Computers Stop and What Can Be Done About It?
[注4]:非常推薦你先看看Learn You Some Erlang 或 Erlang的官方文檔。
[注5]:在屏幕中或tmux session運行Erlang并不能算得上是一個部署策略。
[注6]:http://ferd.github.io/recon/ 一個輕量級的用于tracer 代碼庫。

項目主頁:http://www.baiduhome.net/lib/view/home/1419994976000

 本文由用戶 jopen 自行上傳分享,僅供網友學習交流。所有權歸原作者,若您的權利被侵害,請聯系管理員。
 轉載本站原創文章,請注明出處,并保留原始鏈接、圖片水印。
 本站是一個以用戶分享為主的開源技術平臺,歡迎各類分享!