Dial tcp: lookup i/o timeout
For data confidentiality, I’ve mocked out the real URLs and ip in this post.
behaviors
Scroll Bridge history api upgraded to a new version. In this version, we use Redis to cache data. the redis client we used is https://github.com/redis/go-redis. There have a interesting symptom.
When deployed to sepolia
, it seems work well. But call the bridge history v2 api
curl 'http://xx.sepolia.scroll.tech/api/txsbyhashes' --data-raw '{"txs":["0x40a5ee05a8ef54363d7b13203d4174f665e7aec1bacfe234151b1cf8990c3929"]}' | jq
There many ERROR
log:
dial tcp: lookup xx.amazonaws.com: i/o timeout
there maybe exist a bug in bridge history v2 codebase.
So we debug the code on local mac, the redis connect to sepolia redis. there is no issues. it works very well.
So, I write a piece of code to test on sepolia vm.
func main() {
opts := &redis.Options{
Addr: "",
Username: "",
Password: "",
TLSConfig: &tls.Config{MinVersion: tls.VersionTLS12},
}
redisClient := redis.NewClient(opts)
ctx := context.Background()
err := redisClient.Set(ctx, "key", "key", 0).Err()
if err != nil {
fmt.Printf("redis set error:%v\n", err)
}
val, err := redisClient.Get(ctx, "key").Result()
if err != nil {
fmt.Printf("redis get error:%v\n", err)
}
fmt.Println("key", val)
dsn := ""
db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{})
sqlDB, err := db.DB()
if err != nil {
panic(err)
}
if err = sqlDB.Ping(); err != nil {
panic(err)
}
fmt.Println("ping db success")
}
go build -o test_redis main.go && ./test_redis
The result:
key: key
ping db success
Weird, it still works. So I suspect there must have some issues on docker environment.
So I run a docker container with ubunut:22:04
docker run -it -d -v /home/ubuntu/xx/:/test ubuntu:22.04 /bin/bash
For mapping the host’s /home/ubuntu/xx/
to container’s /test
, so just run ./test_redis
. The error rebuild. But…, another interesting thing is appeared
dial tcp: lookup xx.amazonaws.com: i/o timeout
dial tcp: lookup xx.amazonaws.com: i/o timeout
key
ping db success
Weird things:
- Why vm works fine, but container can’t?
- Why postgres works fine, but redis can’t?
Q1: Why vm works fine, but container can’t?
From the error log dial tcp: lookup xx.amazonaws.com: i/o timeout
find DNS resolution can’t works.
The DNS resolution need use the /etc/resolv.conf
to resolve the url.
cat /etc/resolv.conf
search xx.compute.internal
nameserver 1.1.0.2
nameserver 2.2.0.2
nameserver 1.1.1.1
We need use net tool to track the connection.
dig @1.1.0.2 xx.amazonaws.com
;; communications error to 1.1.0.2#53: timed out
;; communications error to 1.1.0.2#53: timed out
;; communications error to 1.1.0.2#53: timed out
; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> @1.1.0.2 xx.amazonaws.com
; (1 server found)
;; global options: +cmd
;; no servers could be reached
dig @2.2.0.2 xx.amazonaws.com
; <<>> DiG 9.18.18-0ubuntu0.22.04.1-Ubuntu <<>> @2.2.0.2 xx.amazonaws.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11019
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
; xx.amazonaws.com. IN A
;; ANSWER SECTION:
xx.amazonaws.com. 15 IN CNAME xx.amazonaws.com.
xx.amazonaws.com. 15 IN A 2.2.12.170
;; Query time: 0 msec
;; SERVER: 2.2.0.2#53(2.2.0.2) (UDP)
;; WHEN: Fri Jan 05 18:10:15 UTC 2024
;; MSG SIZE rcvd: 129
we can find that 1.1.0.2
is not a right DNS resolution.
After discuss with SRE, the 1.1.0.2
is another vpc dns resolution. So adjust the dns resolver’s priority.
nameserver 2.2.0.2 ⬆️
nameserver 1.1.0.2 ⬇️
nameserver 1.1.1.1
After adjusted, test_redis
can work in docker container.
Another question, Where the /etc/resolv.conf
content come from in docker ?
Answer:
All the docker container’s resolv.conf will inherit docker daemon’s config /etc/docker/daemon.json
ubuntu@ip-xx:~/$ cat /etc/docker/daemon.json
{
"default-address-pool":[
{
"base":"172.10.0.1/18",
"size":24
}
],
"dns" : [
"1.1.0.2",
"2.2.0.2",
"1.1.1.1"
],
"log-driver":"json-file",
"log-opts": {"max-size":"100m", "max-file":"3"}
}
Q2: Why postgres works fine, but redis can’t?
still the output:
dial tcp: lookup xx.amazonaws.com: i/o timeout
dial tcp: lookup xx.amazonaws.com: i/o timeout
key
ping db success
Can find that redis can’t work, but postgres works well. What’s reason? this need investigate the codebase.
After read the codebase, I also write a piece code to rebuild this symptom.
netDialer := &net.Dialer{
KeepAlive: 5 * time.Minute,
}
cc, err := tls.DialWithDialer(netDialer, "tcp", "xx.amazonaws.com:6379", &tls.Config{MinVersion: tls.VersionTLS12})
if err != nil {
panic(err)
}
fmt.Println("works")
Don’t change the /etc/resolv.conf
, can also get the result:
panic: dial tcp: lookup xx.amazonaws.com: i/o timeout
But, if comment the Timeout: 5 * time.Second,
you will find it works.
https://github.com/golang/go/blob/master/src/net/dial.go#L72
// Timeout is the maximum amount of time a dial will wait for
// a connect to complete. If Deadline is also set, it may fail
// earlier.
//
// The default is no timeout.
//
// When using TCP and dialing a host name with multiple IP
// addresses, the timeout may be divided between them.
//
// With or without a timeout, the operating system may impose
// its own earlier timeout. For instance, TCP timeouts are
// often around 3 minutes.
Timeout time.Duration
So, we can find the root case: tls.DialWithDialer try to connect to redis by domain name. But when resolve the domain name, it costs more than 5 seconds, so dial return a timeout error. When change to no timeout, dns resolver will try dns1 (failure),and the try dns2…
why go-redis can’t work, but postgre works?
go-redis codebase
https://github.com/redis/go-redis/blob/v9.3.1/options.go#L161
go-redis is more robust. If developer don’t set Timeout
, it will set the default timeout is 5 seconds. So if the dns list is wrong order, it will cause the nslookup xx timeout
error
postgres codebase
https://github.com/jackc/pgx/blob/master/pgconn/config.go#L278
postgres don’t set Timeout config unless the posgtres connection dsn contains connect_timeout
The dsn is postgres://xx:yy@zz.amazonaws.com:5432/db
, there don’t have a connect_timeout
parameter in the dsn, the Timeout don’t set.
So postgres will try the next dns resolver if the first one is wrong. But start postgres driver will cost more time.
conclusion
The /etc/resolv.conf
need contains the right dns resolver. if contains wrong dns resolver, there will confronts weird problems.