proto: optimize global (un)marshal lock using RWMutex #1004

TennyZhuang · 2019-12-25T10:46:15Z

Signed-off-by: TennyZhuang [email protected]

Thie PR Use RWMutex to optimize getMarshalInfo and getUnmarshalInfo, for these functions, only n (number of message type) will hit Write, and m(number of message) - n will hit Read, it's the best case to use RWMutex instead of Mutex.

This optimization introduce huge improvement in our scenario.

We have 1000 worker and 1 controller, and work and controller keep heartbeat with gRPC. They also exchange the job info with each other.

The message is like

message Job {
    uint64 job_id = 1;
    // Some other info
}

message Heartbeat {
    repeated Job jobs = 1;
}

About 10000 jobs in every Heartbeat, and the heartbeat QPS in controller is about 1000.

The controller handle the Heartbeat in about 10ms, and the network latency is about 10ms, but the client will use about 30s in maximum to finish a RPC call.

We use golang pprof block profile, and it seems that almost all block is caused by one global Mutex in protobuf package.

After optimization, in our use case, the rpc call from client will only use about 30ms, as our expected.

Signed-off-by: TennyZhuang <[email protected]>

TennyZhuang · 2019-12-25T10:57:27Z

@dsnet PTAL

puellanivis

The most computational taxing thing here is acquiring the write lock. Rechecking your critical conditions is insignificant overhead, prevents overhead from reinitialization.

Eliding the recheck of critical conditions not only goes against good locking sanitation, it is at best a micro–optimization, which I guarantee is not saving more than a few nanoseconds every run of the program. The code under the write lock is not hot–loop code that needs to be microoptimized.

proto/table_marshal.go

proto/table_unmarshal.go

Signed-off-by: TennyZhuang <[email protected]>

TennyZhuang · 2019-12-25T17:16:35Z

OK, I've added a double check.

@puellanivis

TennyZhuang · 2019-12-26T04:19:40Z

A simple reproduce code can be found at https://github.com/TennyZhuang/protobuf-lock-reproduce

The PR also resolve #888

puellanivis

Looks good now.

TennyZhuang · 2019-12-30T05:48:07Z

Any more reviewers?

dsnet · 2020-01-08T00:31:40Z

getMarshalInfo is expected to be costly, but is only computed once per type in the program. After a program has been running for some time, it is no longer called. This PR is suggesting a solution in a situation where the root problem itself is not well understood.

TennyZhuang · 2020-01-08T02:10:41Z

@dsnet in our scenario, I add counter log at the beginning of getMarshalInfo, and about 1000000 calls in 1 second. Is there some other bug to cause the function is called too many times?

dsnet · 2020-01-08T02:30:53Z

Is there some other bug to cause the function is called too many times?

Possibly? That's the question that is more interesting to figure out. This code here:

protobuf/proto/table_marshal.go

Lines 147 to 159 in ed6926b

    
           u := atomicLoadMarshalInfo(&a.marshal) 
        
           if u == nil { 
        
           	// Get marshal information from type of message. 
        
           	t := reflect.ValueOf(msg).Type() 
        
           	if t.Kind() != reflect.Ptr { 
        
           		panic(fmt.Sprintf("cannot handle non-pointer message type %v", t)) 
        
           	} 
        
           	u = getMarshalInfo(t.Elem()) 
        
           	// Store it in the cache for later users. 
        
           	// a.marshal = u, but atomically. 
        
           	atomicStoreMarshalInfo(&a.marshal, u) 
        
           } 
        
           return u

atomically caches the computed marshalInfo, so it shouldn't happen again and again.

dsnet · 2020-01-08T02:38:33Z

Have you used the most recent protoc-gen-go to generate your .pb.go files? The pprof graph you show shouldn't happen with newly generated .pb.go files.

TennyZhuang · 2020-01-08T02:56:32Z

I reproduce it in https://github.com/TennyZhuang/protobuf-lock-reproduce (very high latency), protoc-gen-go 1.3.2.

I will try to inspect into it later.

TennyZhuang · 2020-01-08T13:34:10Z

Sorry, the reproduce demo is not correct, I will try to create a correct reproduce case.

TennyZhuang · 2020-01-09T12:17:47Z

Sorry, this is a bug of gogo/protobuf#656

dsnet · 2020-01-12T01:27:18Z

Got it. I'm going to close this then. In v2, we use sync.Map to cache for the equivalent code, which scales much better as it reaches steady state.

proto: optimize global (un)marshal lock using RWMutex

29dec20

Signed-off-by: TennyZhuang <[email protected]>

puellanivis reviewed Dec 25, 2019

View reviewed changes

proto/table_marshal.go Outdated Show resolved Hide resolved

proto/table_unmarshal.go Outdated Show resolved Hide resolved

add double check

783e9e3

Signed-off-by: TennyZhuang <[email protected]>

puellanivis approved these changes Dec 26, 2019

View reviewed changes

TennyZhuang mentioned this pull request Jan 6, 2020

proto: optimize global (un)marshal lock using RWMutex gogo/protobuf#655

Open

dsnet closed this Jan 12, 2020

golang locked and limited conversation to collaborators Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proto: optimize global (un)marshal lock using RWMutex #1004

proto: optimize global (un)marshal lock using RWMutex #1004

TennyZhuang commented Dec 25, 2019 •

edited

Loading

TennyZhuang commented Dec 25, 2019

puellanivis left a comment

TennyZhuang commented Dec 25, 2019

TennyZhuang commented Dec 26, 2019

puellanivis left a comment

TennyZhuang commented Dec 30, 2019

dsnet commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020 •

edited

Loading

dsnet commented Jan 8, 2020

dsnet commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020 •

edited

Loading

TennyZhuang commented Jan 9, 2020

dsnet commented Jan 12, 2020

proto: optimize global (un)marshal lock using RWMutex #1004

proto: optimize global (un)marshal lock using RWMutex #1004

Conversation

TennyZhuang commented Dec 25, 2019 • edited Loading

TennyZhuang commented Dec 25, 2019

puellanivis left a comment

Choose a reason for hiding this comment

TennyZhuang commented Dec 25, 2019

TennyZhuang commented Dec 26, 2019

puellanivis left a comment

Choose a reason for hiding this comment

TennyZhuang commented Dec 30, 2019

dsnet commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020 • edited Loading

dsnet commented Jan 8, 2020

dsnet commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020

TennyZhuang commented Jan 8, 2020 • edited Loading

TennyZhuang commented Jan 9, 2020

dsnet commented Jan 12, 2020

TennyZhuang commented Dec 25, 2019 •

edited

Loading

TennyZhuang commented Jan 8, 2020 •

edited

Loading

TennyZhuang commented Jan 8, 2020 •

edited

Loading