Go -調度器
寫得稍微有點亂,主要是按自己看代碼的順序來記錄的,也不是出書,就這樣吧。
PS: 新人不推薦剛學 Golang 就去看調度器代碼,這部分代碼個人覺得寫得很亂。
調度
基本數據結構
goroutine 在 runtime 中的數據結構:
// stack 描述的是 Go 的執行棧,下界和上界分別為 [lo, hi]
// 如果從傳統內存布局的角度來講,Go 的棧實際上是分配在 C 語言中的堆區的
// 所以才能比 ulimit -s 的 stack size 還要大(1GB)
type stack struct {
lo uintptr
hi uintptr
}
// g 的運行現場
type gobuf struct {
sp uintptr // sp 寄存器
pc uintptr // pc 寄存器
g guintptr // g 指針
ctxt unsafe.Pointer // 這個似乎是用來輔助 gc 的
ret sys.Uintreg
lr uintptr // 這是在 arm 上用的寄存器,不用關心
bp uintptr // 開啟 GOEXPERIMENT=framepointer,才會有這個
}
type g struct {
// 簡單數據結構,lo 和 hi 成員描述了棧的下界和上界內存地址
stack stack
// 在函數的棧增長 prologue 中用 sp 寄存器和 stackguard0 來做比較
// 如果 sp 比 stackguard0 小(因為棧向低地址方向增長),那么就觸發棧拷貝和調度
// 正常情況下 stackguard0 = stack.lo + StackGuard
// 不過 stackguard0 在需要進行調度時,會被修改為 StackPreempt
// 以觸發搶占s
stackguard0 uintptr
// stackguard1 是在 C 棧增長 prologue 作對比的對象
// 在 g0 和 gsignal 棧上,其值為 stack.lo+StackGuard
// 在其它的棧上這個值是 ~0(按 0 取反)以觸發 morestack 調用(并 crash)
stackguard1 uintptr
_panic *_panic
_defer *_defer
m *m // 當前與 g 綁定的 m
sched gobuf // goroutine 的現場
syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
stktopsp uintptr // expected sp at top of stack, to check in traceback
param unsafe.Pointer // wakeup 時的傳入參數
atomicstatus uint32
stackLock uint32 // sigprof/scang lock; TODO: fold in to atomicstatus
goid int64 // goroutine id
waitsince int64 // g 被阻塞之后的近似時間
waitreason string // if status==Gwaiting
schedlink guintptr
preempt bool // 搶占標記,這個為 true 時,stackguard0 是等于 stackpreempt 的
throwsplit bool // must not split stack
raceignore int8 // ignore race detection events
sysblocktraced bool // StartTrace has emitted EvGoInSyscall about this goroutine
sysexitticks int64 // syscall 返回之后的 cputicks,用來做 tracing
traceseq uint64 // trace event sequencer
tracelastp puintptr // last P emitted an event for this goroutine
lockedm muintptr // 如果調用了 LockOsThread,那么這個 g 會綁定到某個 m 上
sig uint32
writebuf []byte
sigcode0 uintptr
sigcode1 uintptr
sigpc uintptr
gopc uintptr // 創建該 goroutine 的語句的指令地址
startpc uintptr // goroutine 函數的指令地址
racectx uintptr
waiting *sudog // sudog structures this g is waiting on (that have a valid elem ptr); in lock order
cgoCtxt []uintptr // cgo traceback context
labels unsafe.Pointer // profiler labels
timer *timer // time.Sleep 緩存的定時器
selectDone uint32 // 該 g 是否正在參與 select,是否已經有人從 select 中勝出
}
當 g 遇到阻塞,或需要等待的場景時,會被打包成 sudog 這樣一個結構。一個 g 可能被打包為多個 sudog 分別掛在不同的等待隊列上:
// sudog 代表在等待列表里的 g,比如向 channel 發送/接收內容時
// 之所以需要 sudog 是因為 g 和同步對象之間的關系是多對多的
// 一個 g 可能會在多個等待隊列中,所以一個 g 可能被打包為多個 sudog
// 多個 g 也可以等待在同一個同步對象上
// 因此對于一個同步對象就會有很多 sudog 了
// sudog 是從一個特殊的池中進行分配的。用 acquireSudog 和 releaseSudog 來分配和釋放 sudog
type sudog struct {
// 之后的這些字段都是被該 g 所掛在的 channel 中的 hchan.lock 來保護的
// shrinkstack depends on
// this for sudogs involved in channel ops.
g *g
// isSelect 表示一個 g 是否正在參與 select 操作
// 所以 g.selectDone 必須用 CAS 來操作,以勝出喚醒的競爭
isSelect bool
next *sudog
prev *sudog
elem unsafe.Pointer // data element (may point to stack)
// 下面這些字段則永遠都不會被并發訪問
// 對于 channel 來說,waitlink 只會被 g 訪問
// 對于信號量來說,所有的字段,包括上面的那些字段都只在持有 semaRoot 鎖時才可以訪問
acquiretime int64
releasetime int64
ticket uint32
parent *sudog // semaRoot binary tree
waitlink *sudog // g.waiting list or semaRoot
waittail *sudog // semaRoot
c *hchan // channel
}
線程在 runtime 中的結構,對應一個 pthread,pthread 也會對應唯一的內核線程(task_struct):
type m struct {
g0 *g // 用來執行調度指令的 goroutine
morebuf gobuf // gobuf arg to morestack
divmod uint32 // div/mod denominator for arm - known to liblink
// Fields not known to debuggers.
procid uint64 // for debuggers, but offset not hard-coded
gsignal *g // signal-handling g
goSigStack gsignalStack // Go-allocated signal handling stack
sigmask sigset // storage for saved signal mask
tls [6]uintptr // thread-local storage (for x86 extern register)
mstartfn func()
curg *g // 當前運行的用戶 goroutine
caughtsig guintptr // goroutine running during fatal signal
p puintptr // attached p for executing go code (nil if not executing go code)
nextp puintptr
id int64
mallocing int32
throwing int32
preemptoff string // 該字段不等于空字符串的話,要保持 curg 始終在這個 m 上運行
locks int32
softfloat int32
dying int32
profilehz int32
helpgc int32
spinning bool // m 失業了,正在積極尋找工作~
blocked bool // m 正阻塞在 note 上
inwb bool // m 正在執行 write barrier
newSigstack bool // minit on C thread called sigaltstack
printlock int8
incgo bool // m 正在執行 cgo call
freeWait uint32 // if == 0, safe to free g0 and delete m (atomic)
fastrand [2]uint32
needextram bool
traceback uint8
ncgocall uint64 // cgo 調用總計數
ncgo int32 // 當前正在執行的 cgo 訂單計數
cgoCallersUse uint32 // if non-zero, cgoCallers in use temporarily
cgoCallers *cgoCallers // cgo traceback if crashing in cgo call
park note
alllink *m // on allm
schedlink muintptr
mcache *mcache
lockedg guintptr
createstack [32]uintptr // stack that created this thread.
freglo [16]uint32 // d[i] lsb and f[i]
freghi [16]uint32 // d[i] msb and f[i+16]
fflag uint32 // floating point compare flags
lockedExt uint32 // tracking for external LockOSThread
lockedInt uint32 // tracking for internal lockOSThread
nextwaitm muintptr // 正在等待鎖的下一個 m
waitunlockf unsafe.Pointer // todo go func(*g, unsafe.pointer) bool
waitlock unsafe.Pointer
waittraceev byte
waittraceskip int
startingtrace bool
syscalltick uint32
thread uintptr // thread handle
freelink *m // on sched.freem
// these are here because they are too large to be on the stack
// of low-level NOSPLIT functions.
libcall libcall
libcallpc uintptr // for cpu profiler
libcallsp uintptr
libcallg guintptr
syscall libcall // 存儲 windows 平臺的 syscall 參數
mOS
}
抽象數據結構,可以認為是 processor 的抽象,代表了任務執行時的上下文,m 必須獲得 p 才能執行:
type p struct {
lock mutex
id int32
status uint32 // one of pidle/prunning/...
link puintptr
schedtick uint32 // 每次調用 schedule 時會加一
syscalltick uint32 // 每次系統調用時加一
sysmontick sysmontick // 上次 sysmon 觀察到的 tick 時間
m muintptr // 和相關聯的 m 的反向指針,如果 p 是 idle 的話,那這個指針是 nil
mcache *mcache
racectx uintptr
deferpool [5][]*_defer // pool of available defer structs of different sizes (see panic.go)
deferpoolbuf [5][32]*_defer
// Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
goidcache uint64
goidcacheend uint64
// runnable 狀態的 goroutine。訪問時是不加鎖的
runqhead uint32
runqtail uint32
runq [256]guintptr
// runnext 非空時,代表的是一個 runnable 狀態的 G,
// 這個 G 是被 當前 G 修改為 ready 狀態的,
// 并且相比在 runq 中的 G 有更高的優先級
// 如果當前 G 的還有剩余的可用時間,那么就應該運行這個 G
// 運行之后,該 G 會繼承當前 G 的剩余時間
// If a set of goroutines is locked in a
// communicate-and-wait pattern, this schedules that set as a
// unit and eliminates the (potentially large) scheduling
// latency that otherwise arises from adding the ready'd
// goroutines to the end of the run queue.
runnext guintptr
// Available G's (status == Gdead)
gfree *g
gfreecnt int32
sudogcache []*sudog
sudogbuf [128]*sudog
tracebuf traceBufPtr
// traceSweep indicates the sweep events should be traced.
// This is used to defer the sweep start event until a span
// has actually been swept.
traceSweep bool
// traceSwept and traceReclaimed track the number of bytes
// swept and reclaimed by sweeping in the current sweep loop.
traceSwept, traceReclaimed uintptr
palloc persistentAlloc // per-P to avoid mutex
// Per-P GC state
gcAssistTime int64 // Nanoseconds in assistAlloc
gcFractionalMarkTime int64 // Nanoseconds in fractional mark worker
gcBgMarkWorker guintptr
gcMarkWorkerMode gcMarkWorkerMode
// 當前標記 worker 的開始時間,單位納秒
gcMarkWorkerStartTime int64
// gcw is this P's GC work buffer cache. The work buffer is
// filled by write barriers, drained by mutator assists, and
// disposed on certain GC state transitions.
gcw gcWork
// wbBuf is this P's GC write barrier buffer.
//
// TODO: Consider caching this in the running G.
wbBuf wbBuf
runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
pad [sys.CacheLineSize]byte
}
全局調度器,全局只有一個 schedt 類型的實例:
type schedt struct {
// 下面兩個變量需以原子訪問訪問。保持在 struct 頂部,以使其在 32 位系統上可以對齊
goidgen uint64
lastpoll uint64
lock mutex
// 當修改 nmidle,nmidlelocked,nmsys,nmfreed 這些數值時
// 需要記得調用 checkdead
midle muintptr // idle m's waiting for work
nmidle int32 // 當前等待工作的空閑 m 計數
nmidlelocked int32 // 當前等待工作的被 lock 的 m 計數
mnext int64 // 當前預繳創建的 m 數,并且該值會作為下一個創建的 m 的 ID
maxmcount int32 // 允許創建的最大的 m 數量
nmsys int32 // number of system m's not counted for deadlock
nmfreed int64 // cumulative number of freed m's
ngsys uint32 // number of system goroutines; updated atomically
pidle puintptr // 空閑 p's
npidle uint32
nmspinning uint32 // See "Worker thread parking/unparking" comment in proc.go.
// 全局的可運行 g 隊列
runqhead guintptr
runqtail guintptr
runqsize int32
// dead G 的全局緩存
gflock mutex
gfreeStack *g
gfreeNoStack *g
ngfree int32
// sudog 結構的集中緩存
sudoglock mutex
sudogcache *sudog
// 不同大小的可用的 defer struct 的集中緩存池
deferlock mutex
deferpool [5]*_defer
// 被設置了 m.exited 標記之后的 m,這些 m 正在 freem 這個鏈表上等待被 free
// 鏈表用 m.freelink 字段進行鏈接
freem *m
gcwaiting uint32 // gc is waiting to run
stopwait int32
stopnote note
sysmonwait uint32
sysmonnote note
// safepointFn should be called on each P at the next GC
// safepoint if p.runSafePointFn is set.
safePointFn func(*p)
safePointWait int32
safePointNote note
profilehz int32 // cpu profiling rate
procresizetime int64 // 上次修改 gomaxprocs 的納秒時間
totaltime int64 // ∫gomaxprocs dt up to procresizetime
}
g/p/m 的關系
Go 實現了所謂的 M:N 模型,執行用戶代碼的 goroutine 可以認為都是對等的 goroutine。不考慮 g0 和 gsignal 的話,我們可以簡單地認為調度就是將 m 綁定到 p,然后在 m 中不斷循環執行調度函數(runtime.schedule),尋找可用的 g 來執行,下圖為 m 綁定到 p 時,可能得到的 g 的來源:
+--------------+
| binded +-------------------------------------+
+-------+------+ |
+------------------------------------+ | v +------------------------------------+
| | | +------------------------------------+ | |
| +------------------+ | | | | | +------------------+ |
| | Local Run Queue | | | | +------------------+ | | | Global Run Queue | |
| other P +-+-+-+-+-+-+-+-+--+ | | | | Local Run Queue | | | schedt +--+-+-+-+-+-+-+---+ |
| |G|G|G|G|G|G|G| | | | P +-+-+-+-+-+-+-+-+--+ | | |G|G|G|G|G|G| |
| +-+-+-+-+-+-+-+ | | | |G|G|G|G|G|G|G| | | +-+-+-+-+-+-+ |
| ^ | | | +-+-+-+-+-+-+-+ | | ^ |
+----------------+-------------------+ | | ^ | +----------------+-------------------+
| | +----------------+-------------------+ |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| v | |
+------+-------+ .-. +----------------+ | |
| steal +----------------------------( M )-----+ runqget +-----------------+ |
+--------------+ `-' +----------------+ |
| |
| +-----------+-----+
+---------------------------------------------------------------------------+ globrunqget |
| +-----------------+
|
|
|
|
|
|
+----------+--------+
| get netpoll g |
+----------+--------+
|
|
|
|
|
+--------------+--------------------+
| | |
| | |
| netpoll v |
| +-+-+-+-+ |
| |G|G|G|G| |
| +-+-+-+-+ |
| |
+-----------------------------------+
這張圖展示了 g、p、m 三者之間的大致關系。m 是執行實體,對應的是操作系統線程。可以看到 m 會從綁定的 p 的本地隊列、sched 中的全局隊列、netpoll 中獲取可運行的 g,實在找不著還會去其它的 p 那里去偷。
p 如何初始化
程序啟動時,會依次調用:
graph TD
runtime.schedinit --> runtime.procresize
在 procresize 中會將全局 p 數組初始化,并將這些 p 串成鏈表放進 sched 全局調度器的 pidle 隊列中:
for i := nprocs - 1; i >= 0; i-- {
p := allp[i]
// ...
// 設置 p 的狀態
p.status = _Pidle
// 初始化時,所有 p 的 runq 都是空的,所以一定會走這個 if
if runqempty(p) {
// 將 p 放到全局調度器的 pidle 隊列中
pidleput(p)
} else {
// ...
}
}
pidleput 也比較簡單,沒啥可說的:
func pidleput(_p_ *p) {
if !runqempty(_p_) {
throw("pidleput: P has non-empty run queue")
}
// 簡單的鏈表操作
_p_.link = sched.pidle
sched.pidle.set(_p_)
// pidle count + 1
atomic.Xadd(&sched.npidle, 1)
}
所有 p 在程序啟動的時候就已經被初始化完畢了,除非手動調用 runtime.GOMAXPROCS。
func GOMAXPROCS(n int) int {
lock(&sched.lock)
ret := int(gomaxprocs)
unlock(&sched.lock)
if n <= 0 || n == ret {
return ret
}
stopTheWorld("GOMAXPROCS")
// newprocs will be processed by startTheWorld
newprocs = int32(n)
startTheWorld()
return ret
}
在 startTheWorld 中會調用 procresize。
g 如何創建
在用戶代碼里一般這么寫:
go func() {
// do the stuff
}()
實際上會被翻譯成 runtime.newproc ,特權語法只是個語法糖。如果你要在其它語言里實現類似的東西,只要實現編譯器翻譯之后的內容就好了。具體流程:
graph TD
runtime.newproc --> runtime.newproc1
newproc 干的事情也比較簡單
func newproc(siz int32, fn *funcval) {
// add 是一個指針運算,跳過函數指針
// 把棧上的參數起始地址找到
argp := add(unsafe.Pointer(&fn), sys.PtrSize)
pc := getcallerpc()
systemstack(func() {
newproc1(fn, (*uint8)(argp), siz, pc)
})
}
// funcval 是一個變長結構,第一個成員是函數指針
// 所以上面的 add 是跳過這個 fn
type funcval struct {
fn uintptr
// variable-size, fn-specific data here
}
runtime 里比較常見的 getcallerpc 和 getcallersp,代碼里的注釋寫的比較明白了:
// For example:
//
// func f(arg1, arg2, arg3 int) {
// pc := getcallerpc()
// sp := getcallersp(unsafe.Pointer(&arg1))
//}
//
// These two lines find the PC and SP immediately following
// the call to f (where f will return).
//
getcallerpc 返回的是調用函數之后的那條程序指令的地址,即 callee 函數返回時要執行的下一條指令的地址。
systemstack 在 runtime 中用的也比較多,其功能為讓 m 切換到 g0 上執行各種調度函數。至于啥是 g0,在講 m 的時候再說。
newproc1 的工作流程也比較簡單:
graph TD
newproc1 --> newg
newg[gfget] --> nil{is nil?}
nil -->|yes|E[init stack]
nil -->|no|C[malg]
C --> D[set g status=> idle->dead]
D --> allgadd
E --> G[set g status=> dead-> runnable]
allgadd --> G
G --> runqput
刪掉了不關心的細節后的代碼:
func newproc1(fn *funcval, argp *uint8, narg int32, callerpc uintptr) {
_g_ := getg()
if fn == nil {
_g_.m.throwing = -1 // do not dump full stacks
throw("go of nil func value")
}
_g_.m.locks++ // disable preemption because it can be holding p in a local var
siz := narg
siz = (siz + 7) &^ 7
_p_ := _g_.m.p.ptr()
newg := gfget(_p_)
if newg == nil {
newg = malg(_StackMin)
casgstatus(newg, _Gidle, _Gdead)
allgadd(newg) // publishes with a g->status of Gdead so GC scanner doesn't look at uninitialized stack.
}
totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
totalSize += -totalSize & (sys.SpAlign - 1) // align to spAlign
sp := newg.stack.hi - totalSize
spArg := sp
// 初始化 g,g 的 gobuf 現場,g 的 m 的 curg
// 以及各種寄存器
memclrNoHeapPointers(unsafe.Pointer(&newg.sched), unsafe.Sizeof(newg.sched))
newg.sched.sp = sp
newg.stktopsp = sp
newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
newg.sched.g = guintptr(unsafe.Pointer(newg))
gostartcallfn(&newg.sched, fn)
newg.gopc = callerpc
newg.startpc = fn.fn
if _g_.m.curg != nil {
newg.labels = _g_.m.curg.labels
}
casgstatus(newg, _Gdead, _Grunnable)
newg.goid = int64(_p_.goidcache)
_p_.goidcache++
runqput(_p_, newg, true)
if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 && mainStarted {
wakep()
}
_g_.m.locks--
if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
_g_.stackguard0 = stackPreempt
}
}
所以 go func 執行的結果是調用 runqput 將 g 放進了執行隊列。但在放隊列之前還做了點小動作:
newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
gostartcallfn
// adjust Gobuf as if it executed a call to fn
// and then did an immediate gosave.
func gostartcallfn(gobuf *gobuf, fv *funcval) {
var fn unsafe.Pointer
if fv != nil {
fn = unsafe.Pointer(fv.fn)
} else {
fn = unsafe.Pointer(funcPC(nilfunc))
}
gostartcall(gobuf, fn, unsafe.Pointer(fv))
}
// adjust Gobuf as if it executed a call to fn with context ctxt
// and then did an immediate gosave.
func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) {
sp := buf.sp
if sys.RegSize > sys.PtrSize {
sp -= sys.PtrSize
*(*uintptr)(unsafe.Pointer(sp)) = 0
}
sp -= sys.PtrSize
*(*uintptr)(unsafe.Pointer(sp)) = buf.pc // 注意這里,這個,這里的 buf.pc 實際上是 goexit 的 pc
buf.sp = sp
buf.pc = uintptr(fn)
buf.ctxt = ctxt
}
在 gostartcall 中把 newproc1 時設置到 buf.pc 中的 goexit 的函數地址放到了 goroutine 的棧頂,然后重新設置 buf.pc 為 goroutine 函數的位置。這樣做的目的是為了在執行完任何 goroutine 的函數時,通過 RET 指令,都能從棧頂把 sp 保存的 goexit 的指令 pop 到 pc 寄存器,效果相當于任何 goroutine 執行函數執行完之后,都會去執行 runtime.goexit,完成一些清理工作后再進入 schedule。
在之后的 m 的 schedule 講解中會看到更詳細的調度循環過程。
runqput
因為是放 runq 而不是直接執行,因而什么時候開始執行并不是用戶代碼能決定得了的。再看看 runqput 這個函數:
// runqput 嘗試把 g 放到本地執行隊列中
// next 參數如果是 false 的話,runqput 會將 g 放到運行隊列的尾部
// If next if false, runqput adds g to the tail of the runnable queue.
// If next is true, runqput puts g in the _p_.runnext slot.
// If the run queue is full, runnext puts g on the global queue.
// Executed only by the owner P.
func runqput(_p_ *p, gp *g, next bool) {
if randomizeScheduler && next && fastrand()%2 == 0 {
next = false
}
if next {
retryNext:
oldnext := _p_.runnext
if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
goto retryNext
}
if oldnext == 0 {
return
}
// 把之前的 runnext 踢到正常的 runq 中
gp = oldnext.ptr()
}
retry:
h := atomic.Load(&_p_.runqhead) // load-acquire, synchronize with consumers
t := _p_.runqtail
if t-h < uint32(len(_p_.runq)) {
_p_.runq[t%uint32(len(_p_.runq))].set(gp)
atomic.Store(&_p_.runqtail, t+1) // store-release, makes the item available for consumption
return
}
if runqputslow(_p_, gp, h, t) {
return
}
// 隊列沒有滿的話,上面的 put 操作會成功
goto retry
}
runqputslow
// 因為 slow,所以會一次性把本地隊列里的多個 g (包含當前的這個) 放到全局隊列
// 只會被 g 的 owner P 執行
func runqputslow(_p_ *p, gp *g, h, t uint32) bool {
var batch [len(_p_.runq)/2 + 1]*g
// 先從本地隊列抓一批 g
n := t - h
n = n / 2
if n != uint32(len(_p_.runq)/2) {
throw("runqputslow: queue is not full")
}
for i := uint32(0); i < n; i++ {
batch[i] = _p_.runq[(h+i)%uint32(len(_p_.runq))].ptr()
}
if !atomic.Cas(&_p_.runqhead, h, h+n) { // cas-release, commits consume
return false
}
batch[n] = gp
if randomizeScheduler {
for i := uint32(1); i <= n; i++ {
j := fastrandn(i + 1)
batch[i], batch[j] = batch[j], batch[i]
}
}
// 把這些 goroutine 構造成鏈表
for i := uint32(0); i < n; i++ {
batch[i].schedlink.set(batch[i+1])
}
// 將鏈表放到全局隊列中
lock(&sched.lock)
globrunqputbatch(batch[0], batch[n], int32(n+1))
unlock(&sched.lock)
return true
}
操作全局 sched 時,需要獲取全局 sched.lock 鎖,全局鎖爭搶的開銷較大,所以才稱之為 slow。p 和 g 在 m 中交互時,因為現場永遠是單線程,所以很多時候不用加鎖。
m 工作機制
在 runtime 中有三種線程,一種是主線程,一種是用來跑 sysmon 的線程,一種是普通的用戶線程。主線程在 runtime 由對應的全局變量: runtime.m0 來表示。用戶線程就是普通的線程了,和 p 綁定,執行 g 中的任務。雖然說是有三種,實際上前兩種線程整個 runtime 就只有一個實例。用戶線程才會有很多實例。
主線程 m0
主線程中用來跑 runtime.main ,流程線性執行,沒有跳轉:
graph TD
runtime.main --> A[init max stack size]
A --> B[systemstack execute -> newm -> sysmon]
B --> runtime.lockOsThread
runtime.lockOsThread --> runtime.init
runtime.init --> runtime.gcenable
runtime.gcenable --> main.init
main.init --> main.main
sysmon 線程
sysmon 是在 runtime.main 中啟動的,不過需要注意的是 sysmon 并不是在 m0 上執行的。因為:
systemstack(func() {
newm(sysmon, nil)
})
創建了新的 m,但這個 m 又與普通的線程不一樣,因為不需要綁定 p 就可以執行。是與整個調度系統脫離的。
sysmon 內部是個死循環,主要負責以下幾件事情:
-
checkdead,檢查是否所有 goroutine 都已經鎖死,如果是的話,直接調用 runtime.throw,強制退出。這個操作只在啟動的時候做一次
-
將 netpoll 返回的結果注入到全局 sched 的任務隊列
-
收回因為 syscall 而長時間阻塞的 p,同時搶占那些執行時間過長的 g
-
如果 span 內存閑置超過 5min,那么釋放掉
流程圖:
graph TD
sysmon --> usleep
usleep --> checkdead
checkdead --> |every 10ms|C[netpollinited && lastpoll != 0]
C --> |yes|netpoll
netpoll --> injectglist
injectglist --> retake
C --> |no|retake
retake --> A[check forcegc needed]
A --> B[scavenge heap once in a while]
B --> usleep
// sysmon 不需要綁定 P 就可以運行,所以不允許 write barriers
//
//go:nowritebarrierrec
func sysmon() {
lock(&sched.lock)
sched.nmsys++
checkdead()
unlock(&sched.lock)
// 如果一個 heap span 在一次GC 之后 5min 都沒有被使用過
// 那么把它交還給操作系統
scavengelimit := int64(5 * 60 * 1e9)
if debug.scavenge > 0 {
// Scavenge-a-lot for testing.
forcegcperiod = 10 * 1e6
scavengelimit = 20 * 1e6
}
lastscavenge := nanotime()
nscavenge := 0
lasttrace := int64(0)
idle := 0 // how many cycles in succession we had not wokeup somebody
delay := uint32(0)
for {
if idle == 0 { // 初始化時 20us sleep
delay = 20
} else if idle > 50 { // start doubling the sleep after 1ms...
delay *= 2
}
if delay > 10*1000 { // 最多到 10ms
delay = 10 * 1000
}
usleep(delay)
if debug.schedtrace <= 0 && (sched.gcwaiting != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs)) {
lock(&sched.lock)
if atomic.Load(&sched.gcwaiting) != 0 || atomic.Load(&sched.npidle) == uint32(gomaxprocs) {
atomic.Store(&sched.sysmonwait, 1)
unlock(&sched.lock)
// Make wake-up period small enough
// for the sampling to be correct.
maxsleep := forcegcperiod / 2
if scavengelimit < forcegcperiod {
maxsleep = scavengelimit / 2
}
shouldRelax := true
if osRelaxMinNS > 0 {
next := timeSleepUntil()
now := nanotime()
if next-now < osRelaxMinNS {
shouldRelax = false
}
}
if shouldRelax {
osRelax(true)
}
notetsleep(&sched.sysmonnote, maxsleep)
if shouldRelax {
osRelax(false)
}
lock(&sched.lock)
atomic.Store(&sched.sysmonwait, 0)
noteclear(&sched.sysmonnote)
idle = 0
delay = 20
}
unlock(&sched.lock)
}
// trigger libc interceptors if needed
if *cgo_yield != nil {
asmcgocall(*cgo_yield, nil)
}
// 如果 10ms 沒有 poll 過 network,那么就 netpoll 一次
lastpoll := int64(atomic.Load64(&sched.lastpoll))
now := nanotime()
if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))
gp := netpoll(false) // 非阻塞 -- 返回一個 goroutine 的列表
if gp != nil {
// Need to decrement number of idle locked M's
// (pretending that one more is running) before injectglist.
// Otherwise it can lead to the following situation:
// injectglist grabs all P's but before it starts M's to run the P's,
// another M returns from syscall, finishes running its G,
// observes that there is no work to do and no other running M's
// and reports deadlock.
incidlelocked(-1)
injectglist(gp)
incidlelocked(1)
}
}
// 接收在 syscall 狀態阻塞的 P
// 搶占長時間運行的 G
if retake(now) != 0 {
idle = 0
} else {
idle++
}
// 檢查是否需要 force GC(兩分鐘一次的)
if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() && atomic.Load(&forcegc.idle) != 0 {
lock(&forcegc.lock)
forcegc.idle = 0
forcegc.g.schedlink = 0
injectglist(forcegc.g)
unlock(&forcegc.lock)
}
// 每過一段時間掃描一次堆
if lastscavenge+scavengelimit/2 < now {
mheap_.scavenge(int32(nscavenge), uint64(now), uint64(scavengelimit))
lastscavenge = now
nscavenge++
}
if debug.schedtrace > 0 && lasttrace+int64(debug.schedtrace)*1000000 <= now {
lasttrace = now
schedtrace(debug.scheddetail > 0)
}
}
}
checkdead
// 檢查死鎖的場景
// 該檢查基于當前正在運行的 M 的數量,如果 0,那么就是 deadlock 了
// 檢查的時候必須持有 sched.lock 鎖
func checkdead() {
// 對于 -buildmode=c-shared 或者 -buildmode=c-archive 來說
// 沒有 goroutine 正在運行也是 OK 的。因為調用這個庫的程序應該是在運行的
if islibrary || isarchive {
return
}
// If we are dying because of a signal caught on an already idle thread,
// freezetheworld will cause all running threads to block.
// And runtime will essentially enter into deadlock state,
// except that there is a thread that will call exit soon.
if panicking > 0 {
return
}
run := mcount() - sched.nmidle - sched.nmidlelocked - sched.nmsys
if run > 0 {
return
}
if run < 0 {
print("runtime: checkdead: nmidle=", sched.nmidle, " nmidlelocked=", sched.nmidlelocked, " mcount=", mcount(), " nmsys=", sched.nmsys, "\n")
throw("checkdead: inconsistent counts")
}
grunning := 0
lock(&allglock)
for i := 0; i < len(allgs); i++ {
gp := allgs[i]
if isSystemGoroutine(gp) {
continue
}
s := readgstatus(gp)
switch s &^ _Gscan {
case _Gwaiting:
grunning++
case _Grunnable,
_Grunning,
_Gsyscall:
unlock(&allglock)
print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
throw("checkdead: runnable g")
}
}
unlock(&allglock)
if grunning == 0 { // possible if main goroutine calls runtime·Goexit()
throw("no goroutines (main called runtime.Goexit) - deadlock!")
}
// Maybe jump time forward for playground.
gp := timejump()
if gp != nil {
casgstatus(gp, _Gwaiting, _Grunnable)
globrunqput(gp)
_p_ := pidleget()
if _p_ == nil {
throw("checkdead: no p for timer")
}
mp := mget()
if mp == nil {
// There should always be a free M since
// nothing is running.
throw("checkdead: no m for timer")
}
mp.nextp.set(_p_)
notewakeup(&mp.park)
return
}
getg().m.throwing = -1 // do not dump full stacks
throw("all goroutines are asleep - deadlock!")
}
retake
// forcePreemptNS is the time slice given to a G before it is
// preempted.
const forcePreemptNS = 10 * 1000 * 1000 // 10ms
func retake(now int64) uint32 {
n := 0
// Prevent allp slice changes. This lock will be completely
// uncontended unless we're already stopping the world.
lock(&allpLock)
// We can't use a range loop over allp because we may
// temporarily drop the allpLock. Hence, we need to re-fetch
// allp each time around the loop.
for i := 0; i < len(allp); i++ {
_p_ := allp[i]
if _p_ == nil {
// 在 procresize 修改了 allp 但還沒有創建新的 p 的時候
// 會有這種情況
continue
}
pd := &_p_.sysmontick
s := _p_.status
if s == _Psyscall {
// 從 syscall 接管 P,如果它進行 syscall 已經經過了一個 sysmon 的 tick(至少 20us)
t := int64(_p_.syscalltick)
if int64(pd.syscalltick) != t {
pd.syscalltick = uint32(t)
pd.syscallwhen = now
continue
}
// 一方面如果沒有其它工作可做的話,我們不想接管 p
// 但另一方面為了避免 sysmon 線程陷入沉睡,我們最終還是會接管這些 p
if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
continue
}
// 解開 allplock 的鎖,然后就可以持有 sched.lock 鎖了
unlock(&allpLock)
// Need to decrement number of idle locked M's
// (pretending that one more is running) before the CAS.
// Otherwise the M from which we retake can exit the syscall,
// increment nmidle and report deadlock.
incidlelocked(-1)
if atomic.Cas(&_p_.status, s, _Pidle) {
if trace.enabled {
traceGoSysBlock(_p_)
traceProcStop(_p_)
}
n++
_p_.syscalltick++
handoffp(_p_)
}
incidlelocked(1)
lock(&allpLock)
} else if s == _Prunning {
// 如果 G 運行時間太長,那么搶占它
t := int64(_p_.schedtick)
if int64(pd.schedtick) != t {
pd.schedtick = uint32(t)
pd.schedwhen = now
continue
}
if pd.schedwhen+forcePreemptNS > now {
continue
}
preemptone(_p_)
}
}
unlock(&allpLock)
return uint32(n)
}
普通線程
普通線程就是我們 G/P/M 模型里的 M 了,M 對應的就是操作系統的線程。
線程創建
上面在創建 sysmon 線程的時候也看到了,創建線程的函數是 newm。
graph TD
newm --> newm1
newm1 --> newosproc
newosproc --> clone
最終會走到 linux 創建線程的系統調用 clone ,代碼里大段和 cgo 相關的內容我們就不關心了,摘掉 cgo 相關的邏輯后的代碼如下:
// 創建一個新的 m。該 m 會在啟動時調用函數 fn,或者 schedule 函數
// fn 需要是 static 類型,且不能是在堆上分配的閉包。
// 運行 m 時,m.p 是有可能為 nil 的,所以不允許 write barriers
//go:nowritebarrierrec
func newm(fn func(), _p_ *p) {
mp := allocm(_p_, fn)
mp.nextp.set(_p_)
mp.sigmask = initSigmask
newm1(mp)
}
傳入的 p 會被賦值給 m 的 nextp 成員,在 m 執行 schedule 時,會將 nextp 拿出來,進行之后真正的綁定操作(其實就是把 nextp 賦值為 nil,并把這個 nextp 賦值給 m.p,把 m 賦值給 p.m)。
func newm1(mp *m) {
execLock.rlock() // Prevent process clone.
newosproc(mp, unsafe.Pointer(mp.g0.stack.hi))
execLock.runlock()
}
func newosproc(mp *m, stk unsafe.Pointer) {
// Disable signals during clone, so that the new thread starts
// with signals disabled. It will enable them in minit.
var oset sigset
sigprocmask(_SIG_SETMASK, &sigset_all, &oset)
ret := clone(cloneFlags, stk, unsafe.Pointer(mp), unsafe.Pointer(mp.g0), unsafe.Pointer(funcPC(mstart)))
sigprocmask(_SIG_SETMASK, &oset, nil)
if ret < 0 {
print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
if ret == -_EAGAIN {
println("runtime: may need to increase max user processes (ulimit -u)")
}
throw("newosproc")
}
}
工作流程
首先空閑的 m 會被丟進全局調度器的 midle 隊列中,在需要 m 的時候,會先從這里取:
//go:nowritebarrierrec
// 嘗試從 midle 列表中獲取一個 m
// 必須鎖全局的 sched
// 可能在 STW 期間執行,所以不允許 write barriers
func mget() *m {
mp := sched.midle.ptr()
if mp != nil {
sched.midle = mp.schedlink
sched.nmidle--
}
return mp
}
取不到的話就會調用之前提到的 newm 來創建新線程,創建的線程是不會被銷毀的,哪怕之后不需要這么多 m 了,也就只是會把 m 放在 midle 中。
什么時候會創建線程呢,可以追蹤一下 newm 的調用方:
graph TD
main --> |sysmon|newm
startTheWorld --> startTheWorldWithSema
gcMarkTermination --> startTheWorldWithSema
gcStart--> startTheWorldWithSema
startTheWorldWithSema --> |helpgc|newm
startTheWorldWithSema --> |run p|newm
startm --> mget
mget --> |if no free m|newm
startTemplateThread --> |templateThread|newm
LockOsThread --> startTemplateThread
main --> |iscgo|startTemplateThread
handoffp --> startm
wakep --> startm
injectglist --> startm
基本上來講,m 都是按需創建的。如果 sched.midle 中沒有空閑的 m 了,現在又需要,那么就會去創建一個。
創建好的線程需要綁定到 p 之后才會開始執行,執行過程中也可能被剝奪掉 p。比如前面 retake 的流程,就會將 g 的 stackguard0 修改為 stackPreempt,待下一次進入 newstack 時,會判斷是否有該搶占標記,有的話,就會放棄運行。這也就是所謂的 協作式搶占 。
工作線程執行的內容核心其實就只有倆: schedule() 和 findrunnable() 。
schedule
graph TD
schedule --> A[schedtick%61 == 0]
A --> |yes|globrunqget
A --> |no|runqget
globrunqget --> C[gp == nil]
C --> |no|execute
C --> |yes|runqget
runqget --> B[gp == nil]
B --> |no|execute
B --> |yes|findrunnable
findrunnable --> execute
// 調度器調度一輪要執行的函數: 尋找一個 runnable 狀態的 goroutine,并 execute 它
// 調度函數是循環,永遠都不會返回
func schedule() {
_g_ := getg()
if _g_.m.locks != 0 {
throw("schedule: holding locks")
}
if _g_.m.lockedg != 0 {
stoplockedm()
execute(_g_.m.lockedg.ptr(), false) // Never returns.
}
// 執行 cgo 調用的 g 不能被 schedule 走
// 因為 cgo 調用使用 m 的 g0 棧
if _g_.m.incgo {
throw("schedule: in cgo")
}
top:
if sched.gcwaiting != 0 {
gcstopm()
goto top
}
if _g_.m.p.ptr().runSafePointFn != 0 {
runSafePointFn()
}
var gp *g
var inheritTime bool
if trace.enabled || trace.shutdown {
gp = traceReader()
if gp != nil {
casgstatus(gp, _Gwaiting, _Grunnable)
traceGoUnpark(gp, 0)
}
}
if gp == nil && gcBlackenEnabled != 0 {
gp = gcController.findRunnableGCWorker(_g_.m.p.ptr())
}
if gp == nil {
// 每調度幾次就檢查一下全局的 runq 來確保公平
// 否則兩個 goroutine 就可以通過互相調用
// 完全占用本地的 runq 了
if _g_.m.p.ptr().schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp = globrunqget(_g_.m.p.ptr(), 1)
unlock(&sched.lock)
}
}
if gp == nil {
gp, inheritTime = runqget(_g_.m.p.ptr())
if gp != nil && _g_.m.spinning {
throw("schedule: spinning with local work")
}
}
if gp == nil {
gp, inheritTime = findrunnable() // 在找到 goroutine 之前會一直阻塞下去
}
// 當前線程將要執行 goroutine,并且不會再進入 spinning 狀態
// 所以如果它被標記為 spinning,我們需要 reset 這個狀態
// 可能會重啟一個新的 spinning 狀態的 M
if _g_.m.spinning {
resetspinning()
}
if gp.lockedm != 0 {
// Hands off own p to the locked m,
// then blocks waiting for a new p.
startlockedm(gp)
goto top
}
execute(gp, inheritTime)
}
m 中所謂的調度循環實際上就是一直在執行下圖中的 loop:
graph TD
schedule --> execute
execute --> gogo
gogo --> goexit
goexit --> goexit1
goexit1 --> goexit0
goexit0 --> schedule
execute
// Schedules gp to run on the current M.
// If inheritTime is true, gp inherits the remaining time in the
// current time slice. Otherwise, it starts a new time slice.
// Never returns.
//
// Write barriers are allowed because this is called immediately after
// acquiring a P in several places.
//
//go:yeswritebarrierrec
func execute(gp *g, inheritTime bool) {
_g_ := getg() // 這個可能是 m 的 g0
casgstatus(gp, _Grunnable, _Grunning)
gp.waitsince = 0
gp.preempt = false
gp.stackguard0 = gp.stack.lo + _StackGuard
if !inheritTime {
_g_.m.p.ptr().schedtick++
}
_g_.m.curg = gp // 把當前 g 的位置讓給 m
gp.m = _g_.m // 把 gp 指向 m,建立雙向關系
gogo(&gp.sched)
}
比較簡單,綁定 g 和 m,然后 gogo 執行綁定的 g 中的函數。
gogo
runtime.gogo 是匯編完成的,功能就是執行 go func() 的這個 func() ,可以看到功能主要是把 g 對象的 gobuf 里的內容搬到寄存器里。然后從 gobuf.pc 寄存器存儲的指令位置開始繼續向后執行。
// void gogo(Gobuf*)
// restore state from Gobuf; longjmp
TEXT runtime·gogo(SB), NOSPLIT, $16-8
MOVQ buf+0(FP), BX // gobuf
MOVQ gobuf_g(BX), DX
MOVQ 0(DX), CX // make sure g != nil
get_tls(CX)
MOVQ DX, g(CX)
MOVQ gobuf_sp(BX), SP // restore SP
MOVQ gobuf_ret(BX), AX
MOVQ gobuf_ctxt(BX), DX
MOVQ gobuf_bp(BX), BP
MOVQ $0, gobuf_sp(BX) // clear to help garbage collector
MOVQ $0, gobuf_ret(BX)
MOVQ $0, gobuf_ctxt(BX)
MOVQ $0, gobuf_bp(BX)
MOVQ gobuf_pc(BX), BX
JMP BX
當然,這里還是有一些和手寫匯編不太一樣的,看著比較奇怪的地方, gobuf_sp(BX) 這種寫法按說標準 plan9 匯編中 gobuf_sp 只是個 symbol ,沒有任何偏移量的意思,但這里卻用名字來代替了其偏移量,這是怎么回事呢?
實際上這是 runtime 的特權,是需要鏈接器配合完成的,再來看看 gobuf 在 runtime 中的 struct 定義開頭部分的注釋:
// The offsets of sp, pc, and g are known to (hard-coded in) libmach.
這下知道怎么回事了吧,鏈接器會幫助我們把這個換成偏移量。。
Goexit
Goexit :
// Goexit terminates the goroutine that calls it. No other goroutine is affected.
// Goexit runs all deferred calls before terminating the goroutine. Because Goexit
// is not a panic, any recover calls in those deferred functions will return nil.
//
// Calling Goexit from the main goroutine terminates that goroutine
// without func main returning. Since func main has not returned,
// the program continues execution of other goroutines.
// If all other goroutines exit, the program crashes.
func Goexit() {
// Run all deferred functions for the current goroutine.
// This code is similar to gopanic, see that implementation
// for detailed comments.
gp := getg()
for {
d := gp._defer
if d == nil {
break
}
if d.started {
if d._panic != nil {
d._panic.aborted = true
d._panic = nil
}
d.fn = nil
gp._defer = d.link
freedefer(d)
continue
}
d.started = true
reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
if gp._defer != d {
throw("bad defer entry in Goexit")
}
d._panic = nil
d.fn = nil
gp._defer = d.link
freedefer(d)
// Note: we ignore recovers here because Goexit isn't a panic
}
goexit1()
}
// Finishes execution of the current goroutine.
func goexit1() {
if raceenabled {
racegoend()
}
if trace.enabled {
traceGoEnd()
}
mcall(goexit0)
}
// The top-most function running on a goroutine
// returns to goexit+PCQuantum.
TEXT runtime·goexit(SB),NOSPLIT,$0-0
BYTE $0x90 // NOP
CALL runtime·goexit1(SB) // does not return
// traceback from goexit1 must hit code range of goexit
BYTE $0x90 // NOP
mcall :
// func mcall(fn func(*g))
// Switch to m->g0's stack, call fn(g).
// Fn must never return. It should gogo(&g->sched)
// to keep running g.
TEXT runtime·mcall(SB), NOSPLIT, $0-8
MOVQ fn+0(FP), DI
get_tls(CX)
MOVQ g(CX), AX // save state in g->sched
MOVQ 0(SP), BX // caller's PC
MOVQ BX, (g_sched+gobuf_pc)(AX)
LEAQ fn+0(FP), BX // caller's SP
MOVQ BX, (g_sched+gobuf_sp)(AX)
MOVQ AX, (g_sched+gobuf_g)(AX)
MOVQ BP, (g_sched+gobuf_bp)(AX)
// switch to m->g0 & its stack, call fn
MOVQ g(CX), BX
MOVQ g_m(BX), BX
MOVQ m_g0(BX), SI
CMPQ SI, AX // if g == m->g0 call badmcall
JNE 3(PC)
MOVQ $runtime·badmcall(SB), AX
JMP AX
MOVQ SI, g(CX) // g = m->g0
MOVQ (g_sched+gobuf_sp)(SI), SP // sp = m->g0->sched.sp
PUSHQ AX
MOVQ DI, DX
MOVQ 0(DI), DI
CALL DI
POPQ AX
MOVQ $runtime·badmcall2(SB), AX
JMP AX
RET
wakep
// Tries to add one more P to execute G's.
// Called when a G is made runnable (newproc, ready).
func wakep() {
// be conservative about spinning threads
if !atomic.Cas(&sched.nmspinning, 0, 1) {
return
}
startm(nil, true)
}
// Schedules some M to run the p (creates an M if necessary).
// If p==nil, tries to get an idle P, if no idle P's does nothing.
// May run with m.p==nil, so write barriers are not allowed.
// If spinning is set, the caller has incremented nmspinning and startm will
// either decrement nmspinning or set m.spinning in the newly started M.
//go:nowritebarrierrec
func startm(_p_ *p, spinning bool) {
lock(&sched.lock)
if _p_ == nil {
_p_ = pidleget()
if _p_ == nil {
unlock(&sched.lock)
if spinning {
// The caller incremented nmspinning, but there are no idle Ps,
// so it's okay to just undo the increment and give up.
if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
throw("startm: negative nmspinning")
}
}
return
}
}
mp := mget()
unlock(&sched.lock)
if mp == nil {
var fn func()
if spinning {
// The caller incremented nmspinning, so set m.spinning in the new M.
fn = mspinning
}
newm(fn, _p_)
return
}
if mp.spinning {
throw("startm: m is spinning")
}
if mp.nextp != 0 {
throw("startm: m has p")
}
if spinning && !runqempty(_p_) {
throw("startm: p has runnable gs")
}
// The caller incremented nmspinning, so set m.spinning in the new M.
mp.spinning = spinning
mp.nextp.set(_p_)
notewakeup(&mp.park)
}
goroutine 掛起
// Puts the current goroutine into a waiting state and calls unlockf.
// If unlockf returns false, the goroutine is resumed.
// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
func gopark(unlockf func(*g, unsafe.Pointer) bool, lock unsafe.Pointer, reason string, traceEv byte, traceskip int) {
mp := acquirem()
gp := mp.curg
status := readgstatus(gp)
if status != _Grunning && status != _Gscanrunning {
throw("gopark: bad g status")
}
mp.waitlock = lock
mp.waitunlockf = *(*unsafe.Pointer)(unsafe.Pointer(&unlockf))
gp.waitreason = reason
mp.waittraceev = traceEv
mp.waittraceskip = traceskip
releasem(mp)
// can't do anything that might move the G between Ms here.
mcall(park_m)
}
func goready(gp *g, traceskip int) {
systemstack(func() {
ready(gp, traceskip, true)
})
}
// Mark gp ready to run.
func ready(gp *g, traceskip int, next bool) {
if trace.enabled {
traceGoUnpark(gp, traceskip)
}
status := readgstatus(gp)
// Mark runnable.
_g_ := getg()
_g_.m.locks++ // disable preemption because it can be holding p in a local var
if status&^_Gscan != _Gwaiting {
dumpgstatus(gp)
throw("bad g->status in ready")
}
// status is Gwaiting or Gscanwaiting, make Grunnable and put on runq
casgstatus(gp, _Gwaiting, _Grunnable)
runqput(_g_.m.p.ptr(), gp, next)
if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
wakep()
}
_g_.m.locks--
if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in Case we've cleared it in newstack
_g_.stackguard0 = stackPreempt
}
}
func notesleep(n *note) {
gp := getg()
if gp != gp.m.g0 {
throw("notesleep not on g0")
}
ns := int64(-1)
if *cgo_yield != nil {
// Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
ns = 10e6
}
for atomic.Load(key32(&n.key)) == 0 {
gp.m.blocked = true
futexsleep(key32(&n.key), 0, ns)
if *cgo_yield != nil {
asmcgocall(*cgo_yield, nil)
}
gp.m.blocked = false
}
}
// One-time notifications.
func noteclear(n *note) {
n.key = 0
}
func notewakeup(n *note) {
old := atomic.Xchg(key32(&n.key), 1)
if old != 0 {
print("notewakeup - double wakeup (", old, ")\n")
throw("notewakeup - double wakeup")
}
futexwakeup(key32(&n.key), 1)
}
findrunnable
findrunnable 比較復雜,流程圖先把 gc 相關的省略掉了:
graph TD
runqget --> A[gp == nil]
A --> |no|return
A --> |yes|globrunqget
globrunqget --> B[gp == nil]
B --> |no| return
B --> |yes| C[netpollinited && lastpoll != 0]
C --> |yes|netpoll
netpoll --> K[gp == nil]
K --> |no|return
K --> |yes|runqsteal
C --> |no|runqsteal
runqsteal --> D[gp == nil]
D --> |no|return
D --> |yes|E[globrunqget]
E --> F[gp == nil]
F --> |no| return
F --> |yes| G[check all p's runq]
G --> H[runq is empty]
H --> |no|runqget
H --> |yes|I[netpoll]
I --> J[gp == nil]
J --> |no| return
J --> |yes| stopm
stopm --> runqget
// 找到一個可執行的 goroutine 來 execute
// 會嘗試從其它的 P 那里偷 g,從全局隊列中拿,或者 network 中 poll
func findrunnable() (gp *g, inheritTime bool) {
_g_ := getg()
// The conditions here and in handoffp must agree: if
// findrunnable would return a G to run, handoffp must start
// an M.
top:
_p_ := _g_.m.p.ptr()
if sched.gcwaiting != 0 {
gcstopm()
goto top
}
if _p_.runSafePointFn != 0 {
runSafePointFn()
}
if fingwait && fingwake {
if gp := wakefing(); gp != nil {
ready(gp, 0, true)
}
}
if *cgo_yield != nil {
asmcgocall(*cgo_yield, nil)
}
// 本地 runq
if gp, inheritTime := runqget(_p_); gp != nil {
return gp, inheritTime
}
// 全局 runq
if sched.runqsize != 0 {
lock(&sched.lock)
gp := globrunqget(_p_, 0)
unlock(&sched.lock)
if gp != nil {
return gp, false
}
}
// Poll network.
// netpoll 是我們執行 work-stealing 之前的一個優化
// 如果沒有任何的 netpoll 等待者,或者線程被阻塞在 netpoll 中,我們可以安全地跳過這段邏輯
// 如果在阻塞的線程中存在任何邏輯上的競爭(e.g. 已經從 netpoll 中返回,但還沒有設置 lastpoll)
// 該線程還是會將下面的 netpoll 阻塞住
if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Load64(&sched.lastpoll) != 0 {
if gp := netpoll(false); gp != nil { // 非阻塞
// netpoll 返回 goroutine 鏈表,用 schedlink 連接
injectglist(gp.schedlink.ptr())
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.enabled {
traceGoUnpark(gp, 0)
}
return gp, false
}
}
// 從其它 p 那里偷 g
procs := uint32(gomaxprocs)
if atomic.Load(&sched.npidle) == procs-1 {
// GOMAXPROCS=1 或者除了我們其它的 p 都是 idle
// 新的工作可能從 syscall/cgocall,網絡或者定時器中來。
// 上面這些任務都不會被放到本地的 runq,所有沒有可以 stealing 的點
goto stop
}
// 如果正在自旋的 M 的數量 >= 忙著的 P,那么阻塞
// 這是為了
// 當 GOMAXPROCS 遠大于 1,但程序的并行度又很低的時候
// 防止過量的 CPU 消耗
if !_g_.m.spinning && 2*atomic.Load(&sched.nmspinning) >= procs-atomic.Load(&sched.npidle) {
goto stop
}
if !_g_.m.spinning {
_g_.m.spinning = true
atomic.Xadd(&sched.nmspinning, 1)
}
for i := 0; i < 4; i++ {
for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
if sched.gcwaiting != 0 {
goto top
}
stealRunNextG := i > 2 // first look for ready queues with more than 1 g
if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {
return gp, false
}
}
}
stop:
// 沒有可以干的事情。如果我們正在 GC 的標記階段,可以安全地掃描和加深對象的顏色,
// 這樣可以進行空閑時間的標記,而不是直接放棄 P
if gcBlackenEnabled != 0 && _p_.gcBgMarkWorker != 0 && gcMarkWorkAvailable(_p_) {
_p_.gcMarkWorkerMode = gcMarkWorkerIdleMode
gp := _p_.gcBgMarkWorker.ptr()
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.enabled {
traceGoUnpark(gp, 0)
}
return gp, false
}
// Before we drop our P, make a snapshot of the allp slice,
// which can change underfoot once we no longer block
// safe-points. We don't need to snapshot the contents because
// everything up to cap(allp) is immutable.
allpSnapshot := allp
// 返回 P 并阻塞
lock(&sched.lock)
if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 {
unlock(&sched.lock)
goto top
}
if sched.runqsize != 0 {
gp := globrunqget(_p_, 0)
unlock(&sched.lock)
return gp, false
}
if releasep() != _p_ {
throw("findrunnable: wrong p")
}
pidleput(_p_)
unlock(&sched.lock)
// Delicate dance: thread transitions from spinning to non-spinning state,
// potentially concurrently with submission of new goroutines. We must
// drop nmspinning first and then check all per-P queues again (with
// #StoreLoad memory barrier in between). If we do it the other way around,
// another thread can submit a goroutine after we've checked all run queues
// but before we drop nmspinning; as the result nobody will unpark a thread
// to run the goroutine.
// If we discover new work below, we need to restore m.spinning as a signal
// for resetspinning to unpark a new worker thread (because there can be more
// than one starving goroutine). However, if after discovering new work
// we also observe no idle Ps, it is OK to just park the current thread:
// the system is fully loaded so no spinning threads are required.
// Also see "Worker thread parking/unparking" comment at the top of the file.
wasSpinning := _g_.m.spinning
if _g_.m.spinning {
_g_.m.spinning = false
if int32(atomic.Xadd(&sched.nmspinning, -1)) < 0 {
throw("findrunnable: negative nmspinning")
}
}
// 再檢查一下所有的 runq
for _, _p_ := range allpSnapshot {
if !runqempty(_p_) {
lock(&sched.lock)
_p_ = pidleget()
unlock(&sched.lock)
if _p_ != nil {
acquirep(_p_)
if wasSpinning {
_g_.m.spinning = true
atomic.Xadd(&sched.nmspinning, 1)
}
goto top
}
break
}
}
// 再檢查 gc 空閑 g
if gcBlackenEnabled != 0 && gcMarkWorkAvailable(nil) {
lock(&sched.lock)
_p_ = pidleget()
if _p_ != nil && _p_.gcBgMarkWorker == 0 {
pidleput(_p_)
_p_ = nil
}
unlock(&sched.lock)
if _p_ != nil {
acquirep(_p_)
if wasSpinning {
_g_.m.spinning = true
atomic.Xadd(&sched.nmspinning, 1)
}
// Go back to idle GC check.
goto stop
}
}
// poll network
if netpollinited() && atomic.Load(&netpollWaiters) > 0 && atomic.Xchg64(&sched.lastpoll, 0) != 0 {
if _g_.m.p != 0 {
throw("findrunnable: netpoll with p")
}
if _g_.m.spinning {
throw("findrunnable: netpoll with spinning")
}
gp := netpoll(true) // 阻塞到返回為止
atomic.Store64(&sched.lastpoll, uint64(nanotime()))
if gp != nil {
lock(&sched.lock)
_p_ = pidleget()
unlock(&sched.lock)
if _p_ != nil {
acquirep(_p_)
injectglist(gp.schedlink.ptr())
casgstatus(gp, _Gwaiting, _Grunnable)
if trace.enabled {
traceGoUnpark(gp, 0)
}
return gp, false
}
injectglist(gp)
}
}
stopm()
goto top
}
m 和 p 解綁定
handoffp
graph TD
mexit --> A[is m0?]
A --> |yes|B[handoffp]
A --> |no| C[iterate allm]
C --> |m found|handoffp
C --> |m not found| throw
forEachP --> |p status == syscall| handoffp
stoplockedm --> handoffp
entersyscallblock --> entersyscallblock_handoff
entersyscallblock_handoff --> handoffp
retake --> |p status == syscall| handoffp
最終會把 p 放回全局的 pidle 隊列中:
// Hands off P from syscall or locked M.
// Always runs without a P, so write barriers are not allowed.
//go:nowritebarrierrec
func handoffp(_p_ *p) {
// handoffp must start an M in any situation where
// findrunnable would return a G to run on _p_.
// if it has local work, start it straight away
if !runqempty(_p_) || sched.runqsize != 0 {
startm(_p_, false)
return
}
// if it has GC work, start it straight away
if gcBlackenEnabled != 0 && gcMarkWorkAvailable(_p_) {
startm(_p_, false)
return
}
// no local work, check that there are no spinning/idle M's,
// otherwise our help is not required
if atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) == 0 && atomic.Cas(&sched.nmspinning, 0, 1) { // TODO: fast atomic
startm(_p_, true)
return
}
lock(&sched.lock)
if sched.gcwaiting != 0 {
_p_.status = _Pgcstop
sched.stopwait--
if sched.stopwait == 0 {
notewakeup(&sched.stopnote)
}
unlock(&sched.lock)
return
}
if _p_.runSafePointFn != 0 && atomic.Cas(&_p_.runSafePointFn, 1, 0) {
sched.safePointFn(_p_)
sched.safePointWait--
if sched.safePointWait == 0 {
notewakeup(&sched.safePointNote)
}
}
if sched.runqsize != 0 {
unlock(&sched.lock)
startm(_p_, false)
return
}
// If this is the last running P and nobody is polling network,
// need to wakeup another M to poll network.
if sched.npidle == uint32(gomaxprocs-1) && atomic.Load64(&sched.lastpoll) != 0 {
unlock(&sched.lock)
startm(_p_, false)
return
}
pidleput(_p_)
unlock(&sched.lock)
}
g 的狀態遷移
graph LR
start{newg} --> Gidle
Gidle --> |oneNewExtraM|Gdead
Gidle --> |newproc1|Gdead
Gdead --> |newproc1|Grunnable
Gdead --> |needm|Gsyscall
Gscanrunning --> |scang|Grunning
Grunnable --> |execute|Grunning
Gany --> |casgcopystack|Gcopystack
Gcopystack --> |todotodo|Grunning
Gsyscall --> |dropm|Gdead
Gsyscall --> |exitsyscall0|Grunnable
Gsyscall --> |exitsyscall|Grunning
Grunning --> |goschedImpl|Grunnable
Grunning --> |goexit0|Gdead
Grunning --> |newstack|Gcopystack
Grunning --> |reentersyscall|Gsyscall
Grunning --> |entersyscallblock|Gsyscall
Grunning --> |markroot|Gwaiting
Grunning --> |gcAssistAlloc1|Gwaiting
Grunning --> |park_m|Gwaiting
Grunning --> |gcMarkTermination|Gwaiting
Grunning --> |gcBgMarkWorker|Gwaiting
Grunning --> |newstack|Gwaiting
Gwaiting --> |gcMarkTermination|Grunning
Gwaiting --> |gcBgMarkWorker|Grunning
Gwaiting --> |markroot|Grunning
Gwaiting --> |gcAssistAlloc1|Grunning
Gwaiting --> |newstack|Grunning
Gwaiting --> |findRunnableGCWorker|Grunnable
Gwaiting --> |ready|Grunnable
Gwaiting --> |findrunnable|Grunnable
Gwaiting --> |injectglist|Grunnable
Gwaiting --> |schedule|Grunnable
Gwaiting --> |park_m|Grunnable
Gwaiting --> |procresize|Grunnable
Gwaiting --> |checkdead|Grunnable
圖上的 Gany 代表任意狀態,GC 時的狀態切換比較多,如果只關注正常情況下的狀態轉換,可以把 markroot、gcMark 之類的先忽略掉。
p 的狀態遷移
graph LR
Pidle --> |acquirep1|Prunning
Psyscall --> |retake|Pidle
Psyscall --> |entersyscall_gcwait|Pgcstop
Psyscall --> |exitsyscallfast|Prunning
Pany --> |gcstopm|Pgcstop
Pany --> |forEachP|Pidle
Pany --> |releasep|Pidle
Pany --> |handoffp|Pgcstop
Pany --> |procresize release current p use allp 0|Pidle
Pany --> |procresize when init|Pgcstop
Pany --> |procresize when free old p| Pdead
Pany --> |procresize after resize use current p|Prunning
Pany --> |reentersyscall|Psyscall
Pany --> |stopTheWorldWithSema|Pgcstop
搶占流程
函數執行是在 goroutine 的棧上,這個棧在函數執行期間是有可能溢出的,我們前面也看到了,如果一個函數用到了棧,會將 stackguard0 和 sp 寄存器進行比較,如果 sp > stackguard0,說明棧已經增長到溢出,因為棧是從內存高地址向低地址方向增長的。
那么這個比較過程是在哪里完成的呢?這一步是由編譯器完成的,我們看看一個函數編譯后的結果,這段代碼來自 go-internals:
0x0000 TEXT "".main(SB), $24-0
;; stack-split prologue
0x0000 MOVQ (TLS), CX
0x0009 CMPQ SP, 16(CX)
0x000d JLS 58
0x000f SUBQ $24, SP
0x0013 MOVQ BP, 16(SP)
0x0018 LEAQ 16(SP), BP
;; ...omitted FUNCDATA stuff...
0x001d MOVQ $137438953482, AX
0x0027 MOVQ AX, (SP)
;; ...omitted PCDATA stuff...
0x002b CALL "".add(SB)
0x0030 MOVQ 16(SP), BP
0x0035 ADDQ $24, SP
0x0039 RET
;; stack-split epilogue
0x003a NOP
;; ...omitted PCDATA stuff...
0x003a CALL runtime.morestack_noctxt(SB)
0x003f JMP 0
函數開頭被插的這段指令,即是將 g struct 中的 stackguard 與 SP 寄存器進行對比,JLS 表示 SP < 16(CX) 的話即跳轉。
;; stack-split prologue
0x0000 MOVQ (TLS), CX
0x0009 CMPQ SP, 16(CX)
0x000d JLS 58
這里因為 CX 寄存器存儲的是 g 的起始地址,而 16(CX) 指的是 g 結構體偏移 16 個字節的位置,可以回顧一下 g 結構體定義,16 個字節恰好是跳過了第一個成員 stack(16字節) 之后的 stackguard0 的位置。
58 轉為 16 進制即是 0x3a。
;; stack-split epilogue
0x003a NOP
;; ...omitted PCDATA stuff...
0x003a CALL runtime.morestack_noctxt(SB)
0x003f JMP 0
morestack_noctxt:
// morestack but not preserving ctxt.
TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
MOVL $0, DX
JMP runtime·morestack(SB)
morestack:
TEXT runtime·morestack(SB),NOSPLIT,$0-0
// Cannot grow scheduler stack (m->g0).
get_tls(CX)
MOVQ g(CX), BX
MOVQ g_m(BX), BX
MOVQ m_g0(BX), SI
CMPQ g(CX), SI
JNE 3(PC)
CALL runtime·badmorestackg0(SB)
INT $3
// Cannot grow signal stack (m->gsignal).
MOVQ m_gsignal(BX), SI
CMPQ g(CX), SI
JNE 3(PC)
CALL runtime·badmorestackgsignal(SB)
INT $3
// Called from f.
// Set m->morebuf to f's caller.
MOVQ 8(SP), AX // f's caller's PC
MOVQ AX, (m_morebuf+gobuf_pc)(BX)
LEAQ 16(SP), AX // f's caller's SP
MOVQ AX, (m_morebuf+gobuf_sp)(BX)
get_tls(CX)
MOVQ g(CX), SI
MOVQ SI, (m_morebuf+gobuf_g)(BX)
// Set g->sched to context in f.
MOVQ 0(SP), AX // f's PC
MOVQ AX, (g_sched+gobuf_pc)(SI)
MOVQ SI, (g_sched+gobuf_g)(SI)
LEAQ 8(SP), AX // f's SP
MOVQ AX, (g_sched+gobuf_sp)(SI)
MOVQ BP, (g_sched+gobuf_bp)(SI)
MOVQ DX, (g_sched+gobuf_ctxt)(SI)
// Call newstack on m->g0's stack.
MOVQ m_g0(BX), BX
MOVQ BX, g(CX)
MOVQ (g_sched+gobuf_sp)(BX), SP
CALL runtime·newstack(SB)
MOVQ $0, 0x1003 // crash if newstack returns
RET
newstack:
// Called from runtime·morestack when more stack is needed.
// Allocate larger stack and relocate to new stack.
// Stack growth is multiplicative, for constant amortized cost.
//
// g->atomicstatus will be Grunning or Gscanrunning upon entry.
// If the GC is trying to stop this g then it will set preemptscan to true.
//
// This must be nowritebarrierrec because it can be called as part of
// stack growth from other nowritebarrierrec functions, but the
// compiler doesn't check this.
//
//go:nowritebarrierrec
func newstack() {
thisg := getg()
// TODO: double check all gp. shouldn't be getg().
if thisg.m.morebuf.g.ptr().stackguard0 == stackFork {
throw("stack growth after fork")
}
if thisg.m.morebuf.g.ptr() != thisg.m.curg {
print("runtime: newstack called from g=", hex(thisg.m.morebuf.g), "\n"+"\tm=", thisg.m, " m->curg=", thisg.m.curg, " m->g0=", thisg.m.g0, " m->gsignal=", thisg.m.gsignal, "\n")
morebuf := thisg.m.morebuf
traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr())
throw("runtime: wrong goroutine in newstack")
}
gp := thisg.m.curg
if thisg.m.curg.throwsplit {
// Update syscallsp, syscallpc in case traceback uses them.
morebuf := thisg.m.morebuf
gp.syscallsp = morebuf.sp
gp.syscallpc = morebuf.pc
pcname, pcoff := "(unknown)", uintptr(0)
f := findfunc(gp.sched.pc)
if f.valid() {
pcname = funcname(f)
pcoff = gp.sched.pc - f.entry
}
print("runtime: newstack at ", pcname, "+", hex(pcoff),
" sp=", hex(gp.sched.sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n",
"\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n",
"\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n")
thisg.m.traceback = 2 // Include runtime frames
traceback(morebuf.pc, morebuf.sp, morebuf.lr, gp)
throw("runtime: stack split at bad time")
}
morebuf := thisg.m.morebuf
thisg.m.morebuf.pc = 0
thisg.m.morebuf.lr = 0
thisg.m.morebuf.sp = 0
thisg.m.morebuf.g = 0
// NOTE: stackguard0 may change underfoot, if another thread
// is about to try to preempt gp. Read it just once and use that same
// value now and below.
preempt := atomic.Loaduintptr(&gp.stackguard0) == stackPreempt
// Be conservative about where we preempt.
// We are interested in preempting user Go code, not runtime code.
// If we're holding locks, mallocing, or preemption is disabled, don't
// preempt.
// This check is very early in newstack so that even the status change
// from Grunning to Gwaiting and back doesn't happen in this case.
// That status change by itself can be viewed as a small preemption,
// because the GC might change Gwaiting to Gscanwaiting, and then
// this goroutine has to wait for the GC to finish before continuing.
// If the GC is in some way dependent on this goroutine (for example,
// it needs a lock held by the goroutine), that small preemption turns
// into a real deadlock.
if preempt {
if thisg.m.locks != 0 || thisg.m.mallocing != 0 || thisg.m.preemptoff != "" || thisg.m.p.ptr().status != _Prunning {
// Let the goroutine keep running for now.
// gp->preempt is set, so it will be preempted next time.
gp.stackguard0 = gp.stack.lo + _StackGuard
gogo(&gp.sched) // never return
}
}
if gp.stack.lo == 0 {
throw("missing stack in newstack")
}
sp := gp.sched.sp
if sys.ArchFamily == sys.AMD64 || sys.ArchFamily == sys.I386 {
// The call to morestack cost a word.
sp -= sys.PtrSize
}
if stackDebug >= 1 || sp < gp.stack.lo {
print("runtime: newstack sp=", hex(sp), " stack=[", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n",
"\tmorebuf={pc:", hex(morebuf.pc), " sp:", hex(morebuf.sp), " lr:", hex(morebuf.lr), "}\n",
"\tsched={pc:", hex(gp.sched.pc), " sp:", hex(gp.sched.sp), " lr:", hex(gp.sched.lr), " ctxt:", gp.sched.ctxt, "}\n")
}
if sp < gp.stack.lo {
print("runtime: gp=", gp, ", gp->status=", hex(readgstatus(gp)), "\n ")
print("runtime: split stack overflow: ", hex(sp), " < ", hex(gp.stack.lo), "\n")
throw("runtime: split stack overflow")
}
if preempt {
if gp == thisg.m.g0 {
throw("runtime: preempt g0")
}
if thisg.m.p == 0 && thisg.m.locks == 0 {
throw("runtime: g is running but p is not")
}
// Synchronize with scang.
casgstatus(gp, _Grunning, _Gwaiting)
if gp.preemptscan {
for !castogscanstatus(gp, _Gwaiting, _Gscanwaiting) {
// Likely to be racing with the GC as
// it sees a _Gwaiting and does the
// stack scan. If so, gcworkdone will
// be set and gcphasework will simply
// return.
}
if !gp.gcscandone {
// gcw is safe because we're on the
// system stack.
gcw := &gp.m.p.ptr().gcw
scanstack(gp, gcw)
if gcBlackenPromptly {
gcw.dispose()
}
gp.gcscandone = true
}
gp.preemptscan = false
gp.preempt = false
casfrom_Gscanstatus(gp, _Gscanwaiting, _Gwaiting)
// This clears gcscanvalid.
casgstatus(gp, _Gwaiting, _Grunning)
gp.stackguard0 = gp.stack.lo + _StackGuard
gogo(&gp.sched) // never return
}
// Act like goroutine called runtime.Gosched.
casgstatus(gp, _Gwaiting, _Grunning)
gopreempt_m(gp) // never return
}
// Allocate a bigger segment and move the stack.
oldsize := gp.stack.hi - gp.stack.lo
newsize := oldsize * 2
if newsize > maxstacksize {
print("runtime: goroutine stack exceeds ", maxstacksize, "-byte limit\n")
throw("stack overflow")
}
// The goroutine must be executing in order to call newstack,
// so it must be Grunning (or Gscanrunning).
casgstatus(gp, _Grunning, _Gcopystack)
// The concurrent GC will not scan the stack while we are doing the copy since
// the gp is in a Gcopystack status.
copystack(gp, newsize, true)
if stackDebug >= 1 {
print("stack grow done\n")
}
casgstatus(gp, _Gcopystack, _Grunning)
gogo(&gp.sched)
}
總結一下流程:
graph TD
start[entering func] --> cmp[sp < stackguard0]
cmp --> |yes| morestack_noctxt
cmp --> |no|final[execute func]
morestack_noctxt --> morestack
morestack --> newstack
newstack --> preempt
搶占都是在 newstack 中完成,但搶占標記是在 Go 源代碼中的其它位置來進行標記的:
我們來看看 stackPreempt 是在哪些位置賦值給 stackguard0 的:
graph LR
unlock --> |in case cleared in newstack|restorePreempt
ready --> |in case cleared in newstack|restorePreempt
startTheWorldWithSema --> |in case cleared in newstack|restorePreempt
allocm --> |in case cleared in newstack|restorePreempt
exitsyscall --> |in case cleared in newstack|restorePreempt
newproc1--> |in case cleared in newstack|restorePreempt
releasem --> |in case cleared in newstack|restorePreempt
scang --> setPreempt
reentersyscall --> setPreempt
entersyscallblock --> setPreempt
preemptone--> setPreempt
enlistWorker --> preemptone
retake --> preemptone
preemptall --> preemptone
freezetheworld --> preemptall
stopTheWorldWithSema --> preemptall
forEachP --> preemptall
startpanic_m --> freezetheworld
gcMarkDone --> forEachP
可見只有 gc 和 retake 才會去真正地搶占 g,并沒有其它的入口,其它的地方就只是恢復一下可能在 newstack 中被清除掉的搶占標記。
當然,這里 entersyscall 和 entersyscallblock 比較特殊,雖然這倆函數的實現中有設置搶占標記,但實際上這兩段邏輯是不會被走到的。因為 syscall 執行時是在 m 的 g0 棧上,如果在執行時被搶占,那么會直接 throw,而無法恢復。
來自:http://xargin.com/go-scheduler/