brpc Start耗时10秒问题定位分析

一、问题背景

近日在分析类生产环境在进行在线升级时,IO归零时间长问题时,经过分析定位把问题锁定在brpc Start的时候的问题,发现进程启动时调用brpc Start函数时耗时超过10秒导致进程启动慢了。

问题原因

brpc的Start代码中这个日志中的butil::my_hostname() 可能触发 DNS反向解析,而某些环境下的DNS服务器响应缓慢会导致阻塞。默认先查 /etc/hosts ,如果/etc/hosts中没有,会再查 DNS。

二、问题解决过程

锁定是brpc Start的问题后,给brpc代码中添加日志,以确定到底是Start中哪个步骤慢:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
int Server::StartInternal(const butil::EndPoint& endpoint,

                          const PortRange& port_range,

                          const ServerOptions *opt) {

    butil::Timer total_timer;

    total_timer.start();

    std::unique_ptr<Server, RevertServerStatus> revert_server(this);

    if (_failed_to_set_max_concurrency_of_method) {

        _failed_to_set_max_concurrency_of_method = false;

        LOG(ERROR) << "previous call to MaxConcurrencyOf() was failed, "

            "fix it before starting server";

        return -1;

    }

    butil::Timer phase_timer;

    phase_timer.start();

    if (InitializeOnce() != 0) {

        LOG(ERROR) << "Fail to initialize Server[" << version() << ']';

        return -1;

    }

    LOG(INFO) << "InitializeOnce took " << phase_timer.m_elapsed() << "ms";

    const Status st = status();

    if (st != READY) {

        if (st == RUNNING) {

            LOG(ERROR) << "Server[" << version() << "] is already running on "

                       << _listen_addr;

        } else {

            LOG(ERROR) << "Can't start Server[" << version()

                       << "] which is " << status_str(status());

        }

        return -1;

    }

    if (opt) {

        _options = *opt;

    } else {

        // Always reset to default options explicitly since `_options'

        // may be the options for the last run or even bad options

        _options = ServerOptions();

    }

    if (!_options.h2_settings.IsValid(true/*log_error*/)) {

        LOG(ERROR) << "Invalid h2_settings";

        return -1;

    }

    if (_options.use_rdma) {

#if BRPC_WITH_RDMA

        if (!OptionsAvailableOverRdma(&_options)) {

            return -1;

        }

        rdma::GlobalRdmaInitializeOrDie();

#else

        LOG(WARNING) << "Cannot use rdma since brpc does not compile with rdma";

        return -1;

#endif

    }

    if (_options.http_master_service) {

        // Check requirements for http_master_service:

        //  has "default_method" & request/response have no fields

        const google::protobuf::ServiceDescriptor* sd =

            _options.http_master_service->GetDescriptor();

        const google::protobuf::MethodDescriptor* md =

            sd->FindMethodByName("default_method");

        if (md == NULL) {

            LOG(ERROR) << "http_master_service must have a method named `default_method'";

            return -1;

        }

        if (md->input_type()->field_count() != 0) {

            LOG(ERROR) << "The request type of http_master_service must have "

                "no fields, actually " << md->input_type()->field_count();

            return -1;

        }

        if (md->output_type()->field_count() != 0) {

            LOG(ERROR) << "The response type of http_master_service must have "

                "no fields, actually " << md->output_type()->field_count();

            return -1;

        }

    }

    // CAUTION:

    //   Following code may run multiple times if this server is started and

    //   stopped more than once. Reuse or delete previous resources!

    phase_timer.start();

    if (_options.session_local_data_factory) {

        if (_session_local_data_pool == NULL) {

            _session_local_data_pool =

                new (std::nothrow) SimpleDataPool(_options.session_local_data_factory);

            if (NULL == _session_local_data_pool) {

                LOG(ERROR) << "Fail to new SimpleDataPool";

                return -1;

            }

        } else {

            _session_local_data_pool->Reset(_options.session_local_data_factory);

        }

        _session_local_data_pool->Reserve(_options.reserved_session_local_data);

    }

    // Init _keytable_pool always. If the server was stopped before, the pool

    // should be destroyed in Join().

    _keytable_pool = new bthread_keytable_pool_t;

    if (bthread_keytable_pool_init(_keytable_pool) != 0) {

        LOG(ERROR) << "Fail to init _keytable_pool";

        delete _keytable_pool;

        _keytable_pool = NULL;

        return -1;

    }

    if (_options.thread_local_data_factory) {

        _tl_options.thread_local_data_factory = _options.thread_local_data_factory;

        if (bthread_key_create2(&_tl_options.tls_key, DestroyServerTLS,

                                _options.thread_local_data_factory) != 0) {

            LOG(ERROR) << "Fail to create thread-local key";

            return -1;

        }

        if (_options.reserved_thread_local_data) {

            bthread_keytable_pool_reserve(_keytable_pool,

                                          _options.reserved_thread_local_data,

                                          _tl_options.tls_key,

                                          CreateServerTLS,

                                          _options.thread_local_data_factory);

        }

    } else {

        _tl_options = ThreadLocalOptions();

    }

    if (_options.bthread_init_count != 0 &&

        _options.bthread_init_fn != NULL) {

        // Create some special bthreads to call the init functions. The

        // bthreads will not quit until all bthreads finish the init function.

        BthreadInitArgs* init_args

            = new BthreadInitArgs[_options.bthread_init_count];

        size_t ncreated = 0;

        for (size_t i = 0; i < _options.bthread_init_count; ++i, ++ncreated) {

            init_args[i].bthread_init_fn = _options.bthread_init_fn;

            init_args[i].bthread_init_args = _options.bthread_init_args;

            init_args[i].result = false;

            init_args[i].done = false;

            init_args[i].stop = false;

            bthread_attr_t tmp = BTHREAD_ATTR_NORMAL;

            tmp.keytable_pool = _keytable_pool;

            if (bthread_start_background(

                    &init_args[i].th, &tmp, BthreadInitEntry, &init_args[i]) != 0) {

                break;

            }

        }

        // Wait until all created bthreads finish the init function.

        for (size_t i = 0; i < ncreated; ++i) {

            while (!init_args[i].done) {

                bthread_usleep(1000);

            }

        }

        // Stop and join created bthreads.

        for (size_t i = 0; i < ncreated; ++i) {

            init_args[i].stop = true;

        }

        for (size_t i = 0; i < ncreated; ++i) {

            bthread_join(init_args[i].th, NULL);

        }

        size_t num_failed_result = 0;

        for (size_t i = 0; i < ncreated; ++i) {

            if (!init_args[i].result) {

                ++num_failed_result;

            }

        }

        delete [] init_args;

        if (ncreated != _options.bthread_init_count) {

            LOG(ERROR) << "Fail to create "

                       << _options.bthread_init_count - ncreated << " bthreads";

            return -1;

        }

        if (num_failed_result != 0) {

            LOG(ERROR) << num_failed_result << " bthread_init_fn failed";

            return -1;

        }

    }

    LOG(INFO) << "Initial bthread etc... took " << phase_timer.m_elapsed() << "ms";

    phase_timer.start();

    // Free last SSL contexts

    FreeSSLContexts();

    if (_options.has_ssl_options()) {

        CertInfo& default_cert = _options.mutable_ssl_options()->default_cert;

        if (default_cert.certificate.empty()) {

            LOG(ERROR) << "default_cert is empty";

            return -1;

        }

        if (AddCertificate(default_cert) != 0) {

            return -1;

        }

        _default_ssl_ctx = _ssl_ctx_map.begin()->second.ctx;

        const std::vector<CertInfo>& certs = _options.mutable_ssl_options()->certs;

        for (size_t i = 0; i < certs.size(); ++i) {

            if (AddCertificate(certs[i]) != 0) {

                return -1;

            }

        }

    }

    _concurrency = 0;

    if (_options.has_builtin_services &&

        _builtin_service_count <= 0 &&

        AddBuiltinServices() != 0) {

        LOG(ERROR) << "Fail to add builtin services";

        return -1;

    }

    // If a server is started/stopped for mutiple times and one of the options

    // sets has_builtin_service to true, builtin services will be enabled for

    // any later re-start. Check this case and report to user.

    if (!_options.has_builtin_services && _builtin_service_count > 0) {

        LOG(ERROR) << "A server started/stopped for multiple times must be "

            "consistent on ServerOptions.has_builtin_services";

        return -1;

    }

    LOG(INFO) << "SSL init took " << phase_timer.m_elapsed() << "ms";

    // Prepare all restful maps

    for (ServiceMap::const_iterator it = _fullname_service_map.begin();

         it != _fullname_service_map.end(); ++it) {

        if (it->second.restful_map) {

            it->second.restful_map->PrepareForFinding();

        }

    }

    if (_global_restful_map) {

        _global_restful_map->PrepareForFinding();

    }

    if (_options.num_threads > 0) {

        if (FLAGS_usercode_in_pthread) {

            _options.num_threads += FLAGS_usercode_backup_threads;

        }

        if (_options.num_threads < BTHREAD_MIN_CONCURRENCY) {

            _options.num_threads = BTHREAD_MIN_CONCURRENCY;

        }

        bthread_setconcurrency(_options.num_threads);

    }

    for (MethodMap::iterator it = _method_map.begin();

        it != _method_map.end(); ++it) {

        if (it->second.is_builtin_service) {

            it->second.status->SetConcurrencyLimiter(NULL);

        } else {

            const AdaptiveMaxConcurrency* amc = &it->second.max_concurrency;

            if (amc->type() == AdaptiveMaxConcurrency::UNLIMITED()) {

                amc = &_options.method_max_concurrency;

            }

            ConcurrencyLimiter* cl = NULL;

            if (!CreateConcurrencyLimiter(*amc, &cl)) {

                LOG(ERROR) << "Fail to create ConcurrencyLimiter for method";

                return -1;

            }

            it->second.status->SetConcurrencyLimiter(cl);

        }

    }

    phase_timer.start();

    // Create listening ports

    if (port_range.min_port > port_range.max_port) {

        LOG(ERROR) << "Invalid port_range=[" << port_range.min_port << '-'

                   << port_range.max_port << ']';

        return -1;

    }

    if (butil::is_endpoint_extended(endpoint) &&

            (port_range.min_port != endpoint.port || port_range.max_port != endpoint.port)) {

        LOG(ERROR) << "Only IPv4 address supports port range feature";

        return -1;

    }

    _listen_addr = endpoint;

    for (int port = port_range.min_port; port <= port_range.max_port; ++port) {

        _listen_addr.port = port;

        butil::fd_guard sockfd(tcp_listen(_listen_addr));

        if (sockfd < 0) {

            if (port != port_range.max_port) { // not the last port, try next

                continue;

            }

            if (port_range.min_port != port_range.max_port) {

                LOG(ERROR) << "Fail to listen " << _listen_addr.ip

                           << ":[" << port_range.min_port << '-'

                           << port_range.max_port << ']';

            } else {

                LOG(ERROR) << "Fail to listen " << _listen_addr;

            }

            return -1;

        }

        if (_listen_addr.port == 0) {

            // port=0 makes kernel dynamically select a port from

            // https://en.wikipedia.org/wiki/Ephemeral_port

            _listen_addr.port = get_port_from_fd(sockfd);

            if (_listen_addr.port <= 0) {

                LOG(ERROR) << "Fail to get port from fd=" << sockfd;

                return -1;

            }

        }

        if (_am == NULL) {

            _am = BuildAcceptor();

            if (NULL == _am) {

                LOG(ERROR) << "Fail to build acceptor";

                return -1;

            }

            _am->_use_rdma = _options.use_rdma;

        }

        // Set `_status' to RUNNING before accepting connections

        // to prevent requests being rejected as ELOGOFF

        _status = RUNNING;

        time(&_last_start_time);

        GenerateVersionIfNeeded();

        g_running_server_count.fetch_add(1, butil::memory_order_relaxed);

        // Pass ownership of `sockfd' to `_am'

        if (_am->StartAccept(sockfd, _options.idle_timeout_sec,

                             _default_ssl_ctx) != 0) {

            LOG(ERROR) << "Fail to start acceptor";

            return -1;

        }

        sockfd.release();

        break; // stop trying

    }

    LOG(INFO) << "Main port bind took " << phase_timer.m_elapsed() << "ms";

    phase_timer.start();

    if (_options.internal_port >= 0 && _options.has_builtin_services) {

        if (_options.internal_port  == _listen_addr.port) {

            LOG(ERROR) << "ServerOptions.internal_port=" << _options.internal_port

                       << " is same with port=" << _listen_addr.port << " to Start()";

            return -1;

        }

        if (_options.internal_port == 0) {

            LOG(ERROR) << "ServerOptions.internal_port cannot be 0, which"

                " allocates a dynamic and probabaly unfiltered port,"

                " against the purpose of \"being internal\".";

            return -1;

        }

        if (butil::is_endpoint_extended(endpoint)) {

            LOG(ERROR) << "internal_port is available in IPv4 address only";

            return -1;

        }

        butil::EndPoint internal_point = _listen_addr;

        internal_point.port = _options.internal_port;

        butil::fd_guard sockfd(tcp_listen(internal_point));

        if (sockfd < 0) {

            LOG(ERROR) << "Fail to listen " << internal_point << " (internal)";

            return -1;

        }

        if (NULL == _internal_am) {

            _internal_am = BuildAcceptor();

            if (NULL == _internal_am) {

                LOG(ERROR) << "Fail to build internal acceptor";

                return -1;

            }

        }

        // Pass ownership of `sockfd' to `_internal_am'

        if (_internal_am->StartAccept(sockfd, _options.idle_timeout_sec,

                                      _default_ssl_ctx) != 0) {

            LOG(ERROR) << "Fail to start internal_acceptor";

            return -1;

        }

        sockfd.release();

        LOG(INFO) << "Internal port:"<< internal_point <<" bind took " << phase_timer.m_elapsed() << "ms";

    }

    LOG(INFO) << "Internal port bind finish " << phase_timer.m_elapsed() << "ms";

    PutPidFileIfNeeded();

    // Launch _derivative_thread.

    CHECK_EQ(INVALID_BTHREAD, _derivative_thread);

    if (bthread_start_background(&_derivative_thread, NULL,

                                 UpdateDerivedVars, this) != 0) {

        LOG(ERROR) << "Fail to create _derivative_thread";

        return -1;

    }

    // Print tips to server launcher.

    if (butil::is_endpoint_extended(_listen_addr)) {

        const char* builtin_msg = _options.has_builtin_services ? " with builtin service" : "";

        LOG(INFO) << "Server[" << version() << "] is serving on " << _listen_addr

                  << builtin_msg << '.';

        //TODO add TrackMe support

    } else {

        int http_port = _listen_addr.port;

        std::ostringstream server_info;

        server_info << "Server[" << version() << "] is serving on port="

                    << _listen_addr.port;

        if (_options.internal_port >= 0 && _options.has_builtin_services) {

            http_port = _options.internal_port;

            server_info << " and internal_port=" << _options.internal_port;

        }

        LOG(INFO) << server_info.str() << '.';

        if (_options.has_builtin_services) {

            LOG(INFO) << "Check out http://" << butil::my_hostname() << ':'

                    << http_port << " in web browser.";

        } else {

            LOG(WARNING) << "Builtin services are disabled according to "

                "ServerOptions.has_builtin_services";

        }

        // For trackme reporting

        SetTrackMeAddress(butil::EndPoint(butil::my_ip(), http_port));

    }

    revert_server.release();

    total_timer.stop();

    LOG(INFO) << "Total StartInternal took " << total_timer.m_elapsed() << "ms";

    return 0;

}

重新编译二进制后,运行并分析运行日志(其实没有加上述日志也能看出来耗时在哪,加上上述日志后更明显和明确问题在哪):

1
2
3
4
5
6
7
8
9
10
11
Apr 02 13:51:27 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:27.787414 3426752 /root/code/gitcode/lava-engine/src/brpc/src/brpc/socket.cpp:2341] Checking Socket{id=0 addr=55.125.33.193:9210} (0x1c33e000)

Apr 02 13:51:27 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:27.787891 3426749 /root/code/gitcode/lava-engine/src/brpc/src/brpc/socket.cpp:2341] Checking Socket{id=2 addr=55.125.33.195:9210} (0x1c33e480)

Apr 02 13:51:28 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:28.788787 3426749 /root/code/gitcode/lava-engine/src/brpc/src/brpc/socket.cpp:2341] Checking Socket{id=115 addr=55.125.33.194:9399} (0x3db4c480)

Apr 02 13:51:28 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:28.788787 3426752 /root/code/gitcode/lava-engine/src/brpc/src/brpc/socket.cpp:2341] Checking Socket{id=113 addr=55.125.33.193:9399} (0x3db4c000)

Apr 02 13:51:28 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:28.872195 3426698 /root/code/gitcode/lava-engine/src/brpc/src/brpc/server.cpp:1113] Server[cstor::lava::kv::km::GCServerServiceImpl+cstor::lava::cluster::master::LavaMasterServiceImpl+cstor::lava::kv::km::SlotServerRegisterService+cstor::lava::common::hb::NMServiceImpl+cstor::lava::kv::km::KMClusterS>

Apr 02 13:51:38 gz-kwe8-az1-eds-ssd-55e125e160e36 nohup[3426698]: I0402 13:51:38.882870 3426698 /root/code/gitcode/lava-engine/src/brpc/src/brpc/server.cpp:1116] Check out http://gz-kwe8-az1-eds-ssd-55e125e160e36:9001 in web browser.

可以最后两条日志之间是隔了10秒钟,看代码其实之间就两三行代码,然后关键就是有一行是打印日志

1
2
3
4
5

LOG(INFO) << "Check out http://" << butil::my_hostname() << ':'

                    << http_port << " in web browser.";

butil::my_hostname(),这个函数可能触发 DNS反向解析,而某些环境下的DNS服务器响应缓慢会导致阻塞。默认先查 /etc/hosts ,如果/etc/hosts中没有,会再查 DNS。

DNS 配置检查

  • 查看 /etc/nsswitch.conf:
1
2
3

grep hosts /etc/nsswitch.conf

若配置为 hosts: files dns,表示会先查 /etc/hosts 再查 DNS。

  • 检查 /etc/hosts:
1
2
3

grep 55.125.33.193 /etc/hosts

若无对应条目,则可能触发 DNS 查询。

网络策略限制

DNS 端口阻断:

1
2
3
4
5

telnet 8.8.8.8 53  # 测试外部 DNS 可达性

telnet 内网DNS_IP 53  # 测试内网 DNS

MyAddressInfo 结构体构造函数逻辑

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

MyAddressInfo() {

    my_ip = IP_ANY;

    // Step1: 调用 gethostname() 获取主机名

    if (gethostname(my_hostname, sizeof(my_hostname) < 0) {

        my_hostname[0] = '\0';

    }

    // Step2: 尝试将主机名解析为IP(关键耗时点)

    else if (hostname2ip(my_hostname, &my_ip) != 0) {

        my_ip = IP_ANY;

    }

    // Step3: 将IP转为字符串

    my_ip_str = ip2str(my_ip);

}

关键函数行为

  • gethostname()

直接读取内核维护的主机名(如 gz-kwe8-az1-eds-ssd-55e125e160e35),耗时 0ms。

  • hostname2ip()

内部调用 getaddrinfo(),遵循 /etc/nsswitch.conf 的 hosts 配置,典型顺序为:

1
2
3

hosts: files dns  # 先查 /etc/hosts,失败再查DNS

现象解释

  1. 在 /etc/hosts 添加条目后,应用程序启动变快

原因:hostname2ip() 优先查询 /etc/hosts,成功解析到 IP,无需触发 DNS 查询。

执行路径:
my_hostname -> 查/etc/hosts -> 直接返回IP(耗时 0ms)

三、问题总结

排查的过程发现有两台不通的服务器比较典型,两台服务器的DNS服务器都不通:

一台ping是这个结果,有一定卡顿,这台启动就会耗时10秒

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

ping 114.114.114.114

PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data.

From 55.125.163.254 icmp_seq=6 Destination Net Unreachable

From 55.125.163.254 icmp_seq=13 Destination Net Unreachable

From 55.125.163.254 icmp_seq=22 Destination Net Unreachable

From 55.125.163.254 icmp_seq=27 Destination Net Unreachable

From 55.125.163.254 icmp_seq=33 Destination Net Unreachable

From 55.125.163.254 icmp_seq=34 Destination Net Unreachable

另外一台结果是

1
2
3
4
5

 ping 114.114.114.114

ping: connect: Network is unreachable

该服务器立即返回该结果,且启动时很快,不存在耗时10秒问题。

使用strace命令进行分析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
strace -e trace=open,read,connect getent hosts 10.24.171.3

read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\\\2\0\0\0\0\0"..., 832) = 832

read(3, "multi on\n", 4096)             = 9

read(3, "", 4096)                       = 0

read(3, "# Generated by NetworkManager\nse"..., 4096) = 101

read(3, "", 4096)                       = 0

connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)

connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)

read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1516

read(3, "", 4096)                       = 0

read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0003\0\0\0\0\0\0"..., 832) = 832

read(3, "127.0.0.1   localhost localhost."..., 4096) = 312

read(3, "", 4096)                       = 0

read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320!\0\0\0\0\0\0"..., 832) = 832

read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220C\0\0\0\0\0\0"..., 832) = 832

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("114.114.114.114")}, 16) = -1 ENETUNREACH (Network is unreachable)

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.100.153")}, 16) = -1 ENETUNREACH (Network is unreachable)

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("114.114.114.114")}, 16) = -1 ENETUNREACH (Network is unreachable)

connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.100.153")}, 16) = -1 ENETUNREACH (Network is unreachable)

read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\00004\0\0\0\0\0\0"..., 832) = 832

read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P|\0\0\0\0\0\0"..., 832) = 832

read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3402\0\0\0\0\0\0"..., 832) = 832

10.24.171.3     NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05

10.24.171.129   NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05

10.8.96.127     NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05

+++ exited with 0 +++

使用该命令可以发现也很快返回结果,通过上述分析可知,其实两台服务器都是没有配置/etc/hosts,所以第一步在hosts配置文件中没有找到,然后触发DNS,但是第一台直接返回失败,就触发第3步,直接使用系统维护的hostname。

网络不通但 my_hostname() 快速返回的机制

  • gethostname() 不依赖网络

  • MyAddressInfo 构造函数中的 gethostname() 仅读取内核维护的主机名(如 NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05),是纯本地操作,耗时 0ms。

  • hostname2ip() 走 /etc/hosts

若 /etc/hosts 中存在主机名到 IP 的正向映射(如 NMA04-… 10.24.171.3),解析直接命中文件,无需触发 DNS 查询。

/etc/nsswitch.conf 配置中 hosts 项包含 myhostname:

1
2
3

hosts: files dns myhostname

myhostname 模块的行为:

当 files(即 /etc/hosts)和 dns 均未找到匹配项时,myhostname 会尝试将系统主机名与本地 IP 关联。

自动生成解析规则:若系统主机名为 NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05,且该机器有 IP 10.24.171.3,myhostname 会自动创建如下映射:

1
2
3

10.24.171.3     NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05

可以调整顺序解决该问题

1
2
3
4
5

grep hosts /etc/nsswitch.conf

hosts:      files dns myhostname

1
2
3
4
5
6
7
8
9
10
11
12
13
time getent hosts 55.125.33.196

55.125.33.196   gz-kwe8-az1-eds-ssd-55e125e160e38

55.125.65.132   gz-kwe8-az1-eds-ssd-55e125e160e38

55.125.160.38   gz-kwe8-az1-eds-ssd-55e125e160e38

real 0m10.011s

user 0m0.000s

sys 0m0.001s

改为

1
hosts:      files myhostname dns

再执行发现

1
2
3
4
5
6
7
8
9
10
time getent hosts 55.125.33.196

55.125.33.196   gz-kwe8-az1-eds-ssd-55e125e160e38

55.125.65.132   gz-kwe8-az1-eds-ssd-55e125e160e38

55.125.160.38   gz-kwe8-az1-eds-ssd-55e125e160e38
real 0m0.001s
user 0m0.000s
sys 0m0.001s

耗时果然降到毫秒级别了。

最后,再简单总结一下:

  • 先看nsswitch.conf文件
1
grep hosts /etc/nsswitch.conf

如果结果是:

1
hosts:      files dns myhostname

表示获取ip-hostname路径是:

(1)/etc/hosts

(2)DNS服务

(3)使用myhostname

使用myhostname会尝试将系统主机名与本地 IP 关联。

自动生成解析规则:若系统主机名为 NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05,且该机器有 IP 10.24.171.3,myhostname 会自动创建如下映射

1
10.24.171.3     NMA04-303-H-21-A2P2-SEV-SC5212M7-02U05

顺序是依次的,如果第一个方式就成功了就不会走到后面的方式。

其实出现说的10秒的原因是因为:

(1) /etc/hosts中没有配相应的ip-hostname

(2) DNS服务器无效(如果直接是Network is unreachable,直接就返回了,不会导致10秒问题,如果是Destination Net Unreachable,则需要等超时)

出现问题的环境ping DNS服务器出现的是:Destination Net Unreachable会等超时,造成10秒耗时。

服务器类型 网络状态 DNS 行为 解决方案
无默认路由 Network is unreachable 立即失败,快速回退 保持现状
有路由但不可达 Destination Net Unreachable 等待超时(10s) 删除路由、减少 DNS 超时或调整顺序或禁用DNS

最终结论:网络层配置差异是导致两台服务器 DNS 解析速度不同的根本原因。通过调整路由或 DNS 策略,可统一行为并消除延迟。

如果你觉得本文对你有帮助,欢迎打赏