Dial tcp: lookup i/o timeout

5 min readFeb 3, 2024

For data confidentiality, I’ve mocked out the real URLs and ip in this post.

behaviors

Scroll Bridge history api upgraded to a new version. In this version, we use Redis to cache data. the redis client we used is https://github.com/redis/go-redis. There have a interesting symptom.

When deployed to sepolia , it seems work well. But call the bridge history v2 api

curl 'http://xx.sepolia.scroll.tech/api/txsbyhashes' --data-raw '{"txs":["0x40a5ee05a8ef54363d7b13203d4174f665e7aec1bacfe234151b1cf8990c3929"]}' | jq

There many ERROR log:

dial tcp: lookup xx.amazonaws.com: i/o timeout

there maybe exist a bug in bridge history v2 codebase.

So we debug the code on local mac, the redis connect to sepolia redis. there is no issues. it works very well.

So, I write a piece of code to test on sepolia vm.

func main() {
 opts := &redis.Options{
  Addr:      "",
  Username:  "",
  Password:  "",
  TLSConfig: &tls.Config{MinVersion: tls.VersionTLS12},
 }

 redisClient := redis.NewClient(opts)
 ctx := context.Background()

 err := redisClient.Set(ctx, "key", "key", 0).Err()
 if err != nil {
   fmt.Printf("redis set error:%v\n", err)
 }

 val, err := redisClient.Get(ctx, "key").Result()
 if err != nil {
  fmt.Printf("redis get error:%v\n", err)
 }
 fmt.Println("key", val)

 dsn := ""
 db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{})

 sqlDB, err := db.DB()
 if err != nil {
  panic(err)
 }

 if err = sqlDB.Ping(); err != nil {
  panic(err)
 }
 fmt.Println("ping db success")
}

go build -o test_redis main.go && ./test_redis

The result:
key: key
ping db success

Weird, it still works. So I suspect there must have some issues on docker environment.

So I run a docker container with ubunut:22:04

docker run -it -d -v /home/ubuntu/xx/:/test ubuntu:22.04 /bin/bash

For mapping the host’s /home/ubuntu/xx/ to container’s /test , so just run ./test_redis . The error rebuild. But…, another interesting thing is appeared

dial tcp: lookup xx.amazonaws.com: i/o timeout
dial tcp: lookup xx.amazonaws.com: i/o timeout
key
ping db success

Weird things:

Why vm works fine, but container can’t?
Why postgres works fine, but redis can’t?

Q1: Why vm works fine, but container can’t?

From the error log dial tcp: lookup xx.amazonaws.com: i/o timeout find DNS resolution can’t works.

The DNS resolution need use the /etc/resolv.conf to resolve the url.

cat /etc/resolv.conf

search xx.compute.internal
nameserver 1.1.0.2
nameserver 2.2.0.2
nameserver 1.1.1.1

We need use net tool to track the connection.

dig @1.1.0.2 xx.amazonaws.com

;; communications error to 1.1.0.2#53: timed out
;; communications error to 1.1.0.2#53: timed out
;; communications error to 1.1.0.2#53: timed out

; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> @1.1.0.2 xx.amazonaws.com
; (1 server found)
;; global options: +cmd
;; no servers could be reached

dig @2.2.0.2 xx.amazonaws.com

; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> @2.2.0.2 xx.amazonaws.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11019
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
; xx.amazonaws.com. IN A

;; ANSWER SECTION:
xx.amazonaws.com. 15 IN CNAME xx.amazonaws.com.
xx.amazonaws.com. 15 IN A 2.2.12.170

;; Query time: 0 msec
;; SERVER: 2.2.0.2#53(2.2.0.2) (UDP)
;; WHEN: Fri Jan 05 18:10:15 UTC 2024
;; MSG SIZE  rcvd: 129

we can find that 1.1.0.2 is not a right DNS resolution.

After discuss with SRE, the 1.1.0.2 is another vpc dns resolution. So adjust the dns resolver’s priority.

nameserver 2.2.0.2 ⬆️
nameserver 1.1.0.2 ⬇️
nameserver 1.1.1.1

After adjusted, test_redis can work in docker container.

Another question, Where the /etc/resolv.conf content come from in docker ?

Answer:

All the docker container’s resolv.conf will inherit docker daemon’s config /etc/docker/daemon.json

ubuntu@ip-xx:~/$ cat /etc/docker/daemon.json
{
  "default-address-pool":[
    {
     "base":"172.10.0.1/18",
     "size":24
    }
  ],
  "dns" : [
    "1.1.0.2",
    "2.2.0.2",
    "1.1.1.1"
  ],
  "log-driver":"json-file",
  "log-opts": {"max-size":"100m", "max-file":"3"}
}

Q2: Why postgres works fine, but redis can’t?

still the output:

dial tcp: lookup xx.amazonaws.com: i/o timeout
dial tcp: lookup xx.amazonaws.com: i/o timeout
key
ping db success

Can find that redis can’t work, but postgres works well. What’s reason? this need investigate the codebase.

After read the codebase, I also write a piece code to rebuild this symptom.

netDialer := &net.Dialer{
	KeepAlive: 5 * time.Minute,
}
cc, err := tls.DialWithDialer(netDialer, "tcp", "xx.amazonaws.com:6379", &tls.Config{MinVersion: tls.VersionTLS12})
if err != nil {
	panic(err)
}
fmt.Println("works")

Don’t change the /etc/resolv.conf, can also get the result:

panic: dial tcp: lookup xx.amazonaws.com: i/o timeout

But, if comment the Timeout: 5 * time.Second, you will find it works.

https://github.com/golang/go/blob/master/src/net/dial.go#L72

// Timeout is the maximum amount of time a dial will wait for
// a connect to complete. If Deadline is also set, it may fail
// earlier.
//
// The default is no timeout.
//
// When using TCP and dialing a host name with multiple IP
// addresses, the timeout may be divided between them.
//
// With or without a timeout, the operating system may impose
// its own earlier timeout. For instance, TCP timeouts are
// often around 3 minutes.
Timeout time.Duration

So, we can find the root case: tls.DialWithDialer try to connect to redis by domain name. But when resolve the domain name, it costs more than 5 seconds, so dial return a timeout error. When change to no timeout, dns resolver will try dns1 (failure),and the try dns2…

why go-redis can’t work, but postgre works?

go-redis codebase

https://github.com/redis/go-redis/blob/v9.3.1/options.go#L161

go-redis is more robust. If developer don’t set Timeout , it will set the default timeout is 5 seconds. So if the dns list is wrong order, it will cause the nslookup xx timeout error

postgres codebase

https://github.com/jackc/pgx/blob/master/pgconn/config.go#L278

postgres don’t set Timeout config unless the posgtres connection dsn contains connect_timeout

The dsn is postgres://xx:yy@zz.amazonaws.com:5432/db, there don’t have a connect_timeout parameter in the dsn, the Timeout don’t set.

So postgres will try the next dns resolver if the first one is wrong. But start postgres driver will cost more time.

conclusion

The /etc/resolv.conf need contains the right dns resolver. if contains wrong dns resolver, there will confronts weird problems.

Dial tcp: lookup i/o timeout

behaviors

Q1: Why vm works fine, but container can’t?

Q2: Why postgres works fine, but redis can’t?

conclusion

Written by George Hao

Responses (1)