抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

说明

最近看到一个项目 build-your-own-x, 一直以来都有使用qbittorrent, bitcomet这样的种子软件,但是一直没有学它的底层原理,所以,想借助这篇文章,一边翻译一边学习.最终搞出一个自己的种子下载器.

Building a BitTorrent client from the ground up in Go

What is the complete path between visiting thepiratebay and sublimating an mp3 file from thin air? In this post, we’ll implement enough of the BitTorrent protocol to download Debian. Look at the Source code or skip to the last bit.

访问 Thepiratebay 和凭空生成 mp3 文件之间的完整路径是什么?在本篇文章中,我们将使用 BitTorrent 协议下载 Debian。查看源代码或跳到最后一点。

This post is also available in Russian, Korean, and Chinese.

本文章还有俄语、韩语和汉语版本。

BitTorrent is a protocol for downloading and distributing files across the Internet. In contrast with the traditional client/server relationship, in which downloaders connect to a central server (for example: watching a movie on Netflix, or loading the web page you’re reading now), participants in the BitTorrent network, called peers, download pieces of files from each other—this is what makes it a peer-to-peer protocol. We’ll investigate how this works, and build our own client that can find peers and exchange data between them.

BitTorrent是一个在互联网中下载和分发文件的协议。和传统的客户端/服务器关系相比,这样的关系中下载者连接到中央服务器(比如:在 Netflix 看一部电影,或者加载一个网页正如你现在在读的),在 BitTorrent 网络中的参与者,称之为 peers, 他们从互相之间下载文件片段,这就是点对点协议的特点。 我们将研究其工作原理,并建立自己的客户端,以便找到对等设备并在它们之间交换数据。

client-server-p2p

The protocol evolved organically over the past 20 years, and various people and organizations added extensions for features like encryption, private torrents, and new ways of finding peers. We’ll be implementing the original spec from 2001 to keep this a weekend-sized project.

在过去的20年中,这一协议在自然演化中逐步发展,不同的个人和组织为其增加了诸如加密、私密种子和新的节点发现方式等功能的扩展。为了确保这个项目能在一个周末内完成,我们将采用2001年的原始规范进行实现。

I’ll be using a Debian ISO file as my guinea pig because it’s big, but not huge, at 350MB. As a popular Linux distribution, there will be lots of fast and cooperative peers for us to connect to. And we’ll avoid the legal and ethical issues related to downloading pirated content.

我将使用Debian的ISO文件作为实验对象,因为它的大小适中,只有350MB。作为一种受欢迎的Linux发行版,我们将能够连接到许多快速而合作的节点。同时,我们也将避免与下载盗版内容相关的法律和道德问题。

Finding peers

Here’s a problem: we want to download a file with BitTorrent, but it’s a peer-to-peer protocol and we have no idea where to find peers to download it from. This is a lot like moving to a new city and trying to make friends—maybe we’ll hit up a local pub or a meetup group! Centralized locations like these are the big idea behind trackers, which are central servers that introduce peers to each other. They’re just web servers running over HTTP* , and you can find Debian’s at http://bttracker.debian.org:6969/

这里有一个问题:我们想使用 BitTorrent 下载一个文件,但它是一种点对点协议,我们不知道从哪里找到其他用户来下载。这就像搬到一个新城市并试图交朋友一样——也许我们会去当地的酒吧或聚会!这些集中的地点是跟踪器背后的大思想,它们是引导用户相互认识的中央服务器。它们只是运行在 HTTP 上的 Web 服务器,你可以在 http://bttracker.debian.org:6969/ 找到 Debian 的服务器。

bt_trackers

Of course, these central servers are liable to get raided by the feds if they facilitate peers exchanging illegal content. You may remember reading about trackers like TorrentSpy, Popcorn Time, and KickassTorrents getting seized and shut down. New methods cut out the middleman by making [even peer discovery] [a distributed process.] We won’t be implementing them, but if you’re interested, some terms you can research are DHT, PEX, and magnet links.

当然,如果这些中央服务器促进用户交换非法内容,它们就有可能被联邦政府突袭。你可能记得读过像 TorrentSpy、Popcorn Time 和 KickassTorrents 这样的跟踪器被查封和关闭的新闻。新方法甚至将发现 peer 也变成了一个分布式过程,从而省去了中间环节。我们不会实现它们,但如果您感兴趣,可以研究一下DHT、PEX和磁力链接。

Parsing a .torrent file

A .torrent file describes the contents of a torrentable file and information for connecting to a tracker. It’s all we need in order to kickstart the process of downloading a torrent. Debian’s .torrent file looks like this:

.torrent 文件描述了可下载文件的内容以及连接到跟踪器的信息。我们只需要它就可以启动下载 torrent 的过程。Debian 的 .torrent 文件看起来像这样:

1
d8:announce41:http://bttracker.debian.org:6969/announce7:comment35:"Debian CD from cdimage.debian.org"13:creation datei1573903810e9:httpseedsl145:https://cdimage.debian.org/cdimage/release/10.2.0//srv/cdbuilder.debian.org/dst/deb-cd/weekly-builds/amd64/iso-cd/debian-10.2.0-amd64-netinst.iso145:https://cdimage.debian.org/cdimage/archive/10.2.0//srv/cdbuilder.debian.org/dst/deb-cd/weekly-builds/amd64/iso-cd/debian-10.2.0-amd64-netinst.isoe4:infod6:lengthi351272960e4:name31:debian-10.2.0-amd64-netinst.iso12:piece lengthi262144e6:pieces26800:�����PS�^�� (binary blob of the hashes of each piece)ee

That mess is encoded in a format called Bencode (pronounced bee-encode), and we’ll need to decode it.

这一堆被编码成叫做 Beancode (pronounced bee-encode) 的格式,我们需要解码它。

Bencode can encode roughly the same types of structures as JSON—strings, integers, lists, and dictionaries. Bencoded data is not as human-readable/writable as JSON, but it can efficiently handle binary data and it’s really simple to parse from a stream. Strings come with a length prefix, and look like 4:spam. Integers go between start and end markers, so 7 would encode to i7e. Lists and dictionaries work in a similar way: l4:spami7ee represents [‘spam’, 7], while d4:spami7ee means {spam: 7}.

Bencode 可编码的结构类型与 JSON 大致相同–字符串、整数、列表和字典。Bencoded 数据不像 JSON 那样可由人类读取/写入,但它能有效处理二进制数据,而且从数据流中进行解析也非常简单。字符串带有长度前缀,看起来像 4:spam。整数介于起始和结束标记之间,因此 7 将编码为 i7e。列表和字典的编码方法长的很像:l4:spami7ee代表[‘spam’, 7], d4:spami7ee 意思是 {spam: 7}

In a prettier format, our .torrent file looks like this:

以更漂亮的格式,我们的 .torrent 文件看起来像这样:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
d
8:announce
41:http://bttracker.debian.org:6969/announce
7:comment
35:"Debian CD from cdimage.debian.org"
13:creation date
i1573903810e
4:info
d
6:length
i351272960e
4:name
31:debian-10.2.0-amd64-netinst.iso
12:piece length
i262144e
6:pieces
26800:�����PS�^�� (binary blob of the hashes of each piece)
e
e

In this file, we can spot the URL of the tracker, the creation date (as a Unix timestamp), the name and size of the file, and a big binary blob containing the SHA-1 hashes of each piece, which are equally-sized parts of the file we want to download. The exact size of a piece varies between torrents, but they are usually somewhere between 256KB and 1MB. This means that a large file might be made up of thousands of pieces. We’ll download these pieces from our peers, check them against the hashes from our torrent file, assemble them together, and boom, we’ve got a file!

在这个文件中,我们可以找到跟踪器的URL、创建日期(以Unix时间戳形式)、文件的名称和大小,以及一个包含SHA-1哈希值的大二进制数据块。这些哈希值对应于我们要下载的文件的等大小部分。每个部分的确切大小因种子文件而异,但通常在256KB到1MB之间。这意味着一个大文件可能由成千上万个部分组成。我们将从其他的用户那里下载这些部分,然后与种子文件中的哈希值进行校验,将它们组装在一起,最终得到完整的文件!

bt_pieces

This mechanism allows us to verify the integrity of each piece as we go. It makes BitTorrent resistant to accidental corruption or intentional torrent poisoning. Unless an attacker is capable of breaking SHA-1 with a preimage attack, we will get exactly the content we asked for.

通过这一机制,我们可以验证每个片段的完整性。这使得BitTorrent对于意外损坏或故意的种子污染具有一定的抵抗能力。除非攻击者能够通过预像攻击破解SHA-1,否则我们将会得到我们所请求的确切内容。

It would be really fun to write a bencode parser, but parsing isn’t our focus today. But I found Fredrik Lundh’s 50 line parser to be especially illuminating. For this project, I used github.com/jackpal/bencode-go:

编写一个bencode解析器确实会很有趣,但是解析并不是我们今天的重点。不过我发现Fredrik Lundh的50行解析器特别有启发性。在这个项目中,我使用了github.com/jackpal/bencode-go。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import (
"github.com/jackpal/bencode-go"
)

type bencodeInfo struct {
Pieces string `bencode:"pieces"`
PieceLength int `bencode:"piece length"`
Length int `bencode:"length"`
Name string `bencode:"name"`
}

type bencodeTorrent struct {
Announce string `bencode:"announce"`
Info bencodeInfo `bencode:"info"`
}

// Open parses a torrent file
func Open(r io.Reader) (*bencodeTorrent, error) {
bto := bencodeTorrent{}
err := bencode.Unmarshal(r, &bto)
if err != nil {
return nil, err
}
return &bto, nil
}

Because I like to keep my structures relatively flat, and I like to keep my application structs separate from my serialization structs, I exported a different, flatter struct named TorrentFile and wrote a few helper functions to convert between the two.

因为我喜欢保持我的结构相对扁平化,并且喜欢将我的应用程序结构与序列化结构分开,所以我导出了一个不同的、更加扁平的结构体,命名为TorrentFile,并编写了一些辅助函数来在这两者之间进行转换。

Notably, I split pieces (previously a string) into a slice of hashes (each [20]byte) so that I can easily access individual hashes later. I also computed the SHA-1 hash of the entire bencoded info dict (the one which contained the name, size, and piece hashes). We know this as the infohash and it uniquely identifies files when we talk to trackers and peers. More on this later.

值得注意的是,我将之前作为字符串的pieces拆分成了一个哈希切片(每个哈希为[20]byte),这样我可以轻松地在以后访问单个哈希。我还计算了整个bencoded info字典(包含名称、大小和分块哈希)的SHA-1哈希值。我们称之为infohash,它在我们与跟踪器和其他用户通信时唯一地标识文件。关于这个后面会详细介绍。

torrent-info-hash

1
2
3
4
5
6
7
8
9
10
11
type TorrentFile struct {
Announce string
InfoHash [20]byte
PieceHashes [][20]byte
PieceLength int
Length int
Name string
}
func (bto bencodeTorrent) toTorrentFile() (TorrentFile, error) {
// …
}

Retrieving peers from the tracker

Now that we have information about the file and its tracker, let’s talk to the tracker to announce our presence as a peer and to retrieve a list of other peers. We just need to make a GET request to the announce URL supplied in the .torrent file, with a few query parameters:

既然我们已经获取了关于文件和跟踪器的信息,让我们与跟踪器通信,向其宣告我们作为一个对等节点的存在,并获取其他对等节点的列表。我们只需要向.torrent文件提供的announce URL发起一个GET请求,并附带一些查询参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func (t *TorrentFile) buildTrackerURL(peerID [20]byte, port uint16) (string, error) {
base, err := url.Parse(t.Announce)
if err != nil {
return "", err
}
params := url.Values{
"info_hash": []string{string(t.InfoHash[:])},
"peer_id": []string{string(peerID[:])},
"port": []string{strconv.Itoa(int(Port))},
"uploaded": []string{"0"},
"downloaded": []string{"0"},
"compact": []string{"1"},
"left": []string{strconv.Itoa(t.Length)},
}
base.RawQuery = params.Encode()
return base.String(), nil
}

The important ones:(其中一些重要的参数有:)

  • info_hash: Identifies the file we’re trying to download. It’s the infohash we calculated earlier from the bencoded info dict. The tracker will use this to figure out which peers to show us.
  • peer_id: A 20 byte name to identify ourselves to trackers and peers. We’ll just generate 20 random bytes for this. Real BitTorrent clients have IDs like -TR2940-k8hj0wgej6ch which identify the client software and version—in this case, TR2940 stands for Transmission client 2.94.
  • info_hash: 标识我们要下载的文件,即我们之前从bencoded info字典计算得到的infohash。跟踪器将使用此信息确定要显示给我们的对等节点。
  • peer_id: 用于向跟踪器和其他对等节点标识我们自己的20字节名称。我们将为此生成20个随机字节。真实的BitTorrent客户端具有像-TR2940-k8hj0wgej6ch这样的ID,用于标识客户端软件和版本——在本例中,TR2940代表Transmission客户端2.94版本。

blog.jse.li__torrent__info-hash-peer-id

Parsing the tracker response

We get back a bencoded response:

1
2
3
4
5
6
d
8:interval
i900e
5:peers
252:(another long binary blob)
e

Interval tells us how often we’re supposed to connect to the tracker again to refresh our list of peers. A value of 900 means we should reconnect every 15 minutes (900 seconds).

Interval告诉我们应该多久连接一次跟踪器以刷新我们的对等节点列表。值900告诉我们应该每隔15分钟(900秒)刷新一次。

Peers is another long binary blob containing the IP addresses of each peer. It’s made out of groups of six bytes. The first four bytes in each group represent the peer’s IP address—each byte represents a number in the IP. The last two bytes represent the port, as a big-endian uint16. Big-endian, or network order, means that we can interpret a group of bytes as an integer by just squishing them together left to right. For example, the bytes 0x1A, 0xE1 make 0x1AE1, or 6881 in decimal.*

Peers是另一个包含每个对等节点的IP地址的长二进制数据块。它由六字节一组构成。每个组中的前四个字节代表对等节点的IP地址,其中每个字节表示IP地址中的一个数字。最后两个字节以大端字节序表示端口,作为一个无符号的16位整数(uint16)。大端序(或网络字节序)意味着我们可以通过将一组字节从左到右拼接在一起来将其解释为一个整数。例如,字节0x1A和0xE1可以组成0x1AE1,即十进制的6881。

blog.jse.li__torrent__address

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Peer encodes connection information for a peer
type Peer struct {
IP net.IP
Port uint16
}

// Unmarshal parses peer IP addresses and ports from a buffer
func Unmarshal(peersBin []byte) ([]Peer, error) {
const peerSize = 6 // 4 for IP, 2 for port
numPeers := len(peersBin) / peerSize
if len(peersBin)%peerSize != 0 {
err := fmt.Errorf("Received malformed peers")
return nil, err
}
peers := make([]Peer, numPeers)
for i := 0; i < numPeers; i++ {
offset := i * peerSize
peers[i].IP = net.IP(peersBin[offset : offset+4])
peers[i].Port = binary.BigEndian.Uint16(peersBin[offset+4 : offset+6])
}
return peers, nil
}

Downloading from peers

Now that we have a list of peers, it’s time to connect with them and start downloading pieces! We can break down the process into a few steps. For each peer, we want to:

既然我们有了对等节点的列表,现在是时候与它们建立连接并开始下载部分文件了!我们可以将这个过程分解为几个步骤。对于每个对等节点,我们希望:

  1. Start a TCP connection with the peer. This is like starting a phone call.
  2. Complete a two-way BitTorrent handshake. “Hello?” “Hello.”
  3. Exchange messages to download pieces. “I’d like piece #231 please.”

Start a TCP connection

1
2
3
4
conn, err := net.DialTimeout("tcp", peer.String(), 3*time.Second)
if err != nil {
return nil, err
}

I set a timeout so that I don’t waste too much time on peers that aren’t going to let me connect. For the most part, it’s a pretty standard TCP connection.

我设置了一个超时时间,这样我就不会浪费太多时间在不让我连接的对等节点上。大部分情况下,这是一个相当标准的TCP连接。

Complete the handshake

We’ve just set up a connection with a peer, but we want do a handshake to validate our assumptions that the peer:

我们刚刚与一个对等节点建立了连接,但我们希望进行握手来验证以下假设,即对等节点:

  1. can communicate using the BitTorrent protocol
  2. is able to understand and respond to our messages
  3. has the file that we want, or at least knows what we’re talking about

blog.jse.li__torrent__handshake

My father told me that the secret to a good handshake is a firm grip and eye contact. The secret to a good BitTorrent handshake is that it’s made up of five parts:

父亲告诉我,一个好的握手的秘诀是握手要坚定,同时要有眼神接触。而一个良好的BitTorrent握手的秘诀在于它由五个部分组成:

  1. The length of the protocol identifier, which is always 19 (0x13 in hex)

  2. The protocol identifier, called the pstr which is always BitTorrent protocol

  3. Eight reserved bytes, all set to 0. We’d flip some of them to 1 to indicate that we support certain extensions. But we don’t, so we’ll keep them at 0.

  4. The infohash that we calculated earlier to identify which file we want

  5. The Peer ID that we made up to identify ourselves

  6. 协议标识符的长度,始终为19(十六进制为0x13)。

  7. 协议标识符,称为pstr,始终为”BitTorrent protocol”。

  8. 保留的8个字节,全部设置为0。如果我们支持某些扩展,我们会将其中一些字节设置为1来指示。但因为我们不支持任何扩展,所以我们将它们保持为0。

  9. 我们之前计算的infohash,用于标识我们想要下载的文件。

  10. 我们编造的对等节点标识(Peer ID),用于标识我们自己。

Put together, a handshake string might look like this:

1
\x13BitTorrent protocol\x00\x00\x00\x00\x00\x00\x00\x00\x86\xd4\xc8\x00\x24\xa4\x69\xbe\x4c\x50\xbc\x5a\x10\x2c\xf7\x17\x80\x31\x00\x74-TR2940-k8hj0wgej6ch

After we send a handshake to our peer, we should receive a handshake back in the same format. The infohash we get back should match the one we sent so that we know that we’re talking about the same file. If everything goes as planned, we’re good to go. If not, we can sever the connection because there’s something wrong. “Hello?” “这是谁? 你想要什么?” “Okay, wow, wrong number.”

在我们向对等节点发送握手后,我们应该以相同的格式收到握手回复。回复中的infohash应该与我们发送的一致,这样我们就知道我们正在讨论同一个文件。如果一切按计划进行,我们就可以开始了。如果有问题,我们可以终止连接,因为出现了错误。“你好?” “这是谁?你想要什么?” “哇,好吧,打错电话了。”

In our code, let’s make a struct to represent a handshake, and write a few methods for serializing and reading them:

在我们的代码中,让我们创建一个结构体来表示握手,并编写一些方法来进行序列化和读取:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// A Handshake is a special message that a peer uses to identify itself
type Handshake struct {
Pstr string
InfoHash [20]byte
PeerID [20]byte
}

// Serialize serializes the handshake to a buffer
func (h *Handshake) Serialize() []byte {
buf := make([]byte, len(h.Pstr)+49)
buf[0] = byte(len(h.Pstr))
curr := 1
curr += copy(buf[curr:], h.Pstr)
curr += copy(buf[curr:], make([]byte, 8)) // 8 reserved bytes
curr += copy(buf[curr:], h.InfoHash[:])
curr += copy(buf[curr:], h.PeerID[:])
return buf
}

// Read parses a handshake from a stream
func Read(r io.Reader) (*Handshake, error) {
// Do Serialize(), but backwards
// ...
}

Send and receive messages

Once we’ve completed the initial handshake, we can send and receive messages. Well, not quite—if the other peer isn’t ready to accept messages, we can’t send any until they tell us they’re ready. In this state, we’re considered choked by the other peer. They’ll send us an unchoke message to let us know that we can begin asking them for data. By default, we assume that we’re choked until proven otherwise.

一旦完成初始握手,我们就可以发送和接收消息。不过,如果对方对接收消息还没有准备好,我们就无法发送任何消息,直到他们告诉我们他们已准备好。在这种状态下,对方认为我们被阻塞了。他们会发送一个取消阻塞的消息(unchoke message),以让我们知道我们可以开始向他们请求数据。默认情况下,我们假设自己被阻塞,除非有证据表明相反。

Once we’ve been unchoked, we can then begin sending requests for pieces, and they can send us messages back containing pieces.

一旦我们被解除阻塞,我们就可以开始发送对部分文件的请求,而对方则可以回复我们包含部分文件的消息。

blog.jse.li__torrent__choke

Interpreting messages

A message has a length, an ID and a payload. On the wire, it looks like:

每个消息都包含长度、ID和负载。在传输过程中,它的格式如下所示:

blog.jse.li__torrent__message

A message starts with a length indicator which tells us how many bytes long the message will be. It’s a 32-bit integer, meaning it’s made out of four bytes smooshed together in big-endian order. The next byte, the ID, tells us which type of message we’re receiving—for example, a 2 byte means “interested.” Finally, the optional payload fills out the remaining length of the message.

一条消息以长度指示器开始,告诉我们消息将有多少字节长。这是一个32位整数,意味着它由四个按大端序排列在一起的字节组成。接下来的一个字节,即ID,告诉我们我们正在接收哪种类型的消息,例如,2个字节表示“感兴趣”。最后,可选的有效载荷填充了消息的剩余长度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
type messageID uint8

const (
MsgChoke messageID = 0
MsgUnchoke messageID = 1
MsgInterested messageID = 2
MsgNotInterested messageID = 3
MsgHave messageID = 4
MsgBitfield messageID = 5
MsgRequest messageID = 6
MsgPiece messageID = 7
MsgCancel messageID = 8
)

// Message stores ID and payload of a message
type Message struct {
ID messageID
Payload []byte
}

// Serialize serializes a message into a buffer of the form
// <length prefix><message ID><payload>
// Interprets `nil` as a keep-alive message
func (m *Message) Serialize() []byte {
if m == nil {
return make([]byte, 4)
}
length := uint32(len(m.Payload) + 1) // +1 for id
buf := make([]byte, 4+length)
binary.BigEndian.PutUint32(buf[0:4], length)
buf[4] = byte(m.ID)
copy(buf[5:], m.Payload)
return buf
}

To read a message from a stream, we just follow the format of a message. We read four bytes and interpret them as a uint32 to get the length of the message. Then, we read that number of bytes to get the ID (the first byte) and the payload (the remaining bytes).

要从流中读取一条消息,我们只需按照消息的格式进行操作。我们先读取四个字节,并将它们解释为uint32类型,以获取消息的长度。然后,我们读取相应数量的字节来获取ID(第一个字节)和有效载荷(剩余字节)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Read parses a message from a stream. Returns `nil` on keep-alive message
func Read(r io.Reader) (*Message, error) {
lengthBuf := make([]byte, 4)
_, err := io.ReadFull(r, lengthBuf)
if err != nil {
return nil, err
}
length := binary.BigEndian.Uint32(lengthBuf)

// keep-alive message
if length == 0 {
return nil, nil
}

messageBuf := make([]byte, length)
_, err = io.ReadFull(r, messageBuf)
if err != nil {
return nil, err
}

m := Message{
ID: messageID(messageBuf[0]),
Payload: messageBuf[1:],
}

return &m, nil
}

Bitfields

One of the most interesting types of message is the bitfield, which is a data structure that peers use to efficiently encode which pieces they are able to send us. A bitfield looks like a byte array, and to check which pieces they have, we just need to look at the positions of the bits set to 1. You can think of it like the digital equivalent of a coffee shop loyalty card. We start with a blank card of all 0, and flip bits to 1 to mark their positions as “stamped.”

最有趣的消息类型之一是位域(bitfield),它是一种数据结构,用于对等方有效地编码他们能够向我们发送的片段。位域看起来像一个字节数组,要检查他们拥有哪些片段,我们只需查看被设为1的位的位置。你可以将其视为咖啡店会员卡的数字等效物。我们从一张全为0的空白卡开始,通过将位设为1来标记它们的位置为“已盖章”。

blog.jse.li__torrent__bitfield

By working with bits instead of bytes, this data structure is super compact. We can stuff information about eight pieces in the space of a single byte—the size of a bool. The tradeoff is that accessing values becomes a little more tricky. The smallest unit of memory that computers can address are bytes, so to get to our bits, we have to do some bitwise manipulation:

通过使用位而不是字节,这种数据结构非常紧凑。我们可以在一个字节的空间内储存关于八个片段的信息,而一个字节的大小就是一个布尔值。这样做的一个权衡是访问值变得有点复杂。计算机可以寻址的最小内存单位是字节,所以要访问我们的位,我们需要进行一些位操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// A Bitfield represents the pieces that a peer has
type Bitfield []byte

// HasPiece tells if a bitfield has a particular index set
func (bf Bitfield) HasPiece(index int) bool {
byteIndex := index / 8
offset := index % 8
return bf[byteIndex]>>(7-offset)&1 != 0
}

// SetPiece sets a bit in the bitfield
func (bf Bitfield) SetPiece(index int) {
byteIndex := index / 8
offset := index % 8
bf[byteIndex] |= 1 << (7 - offset)
}

Putting it all together

We now have all the tools we need to download a torrent: we have a list of peers obtained from the tracker, and we can communicate with them by dialing a TCP connection, initiating a handshake, and sending and receiving messages. Our last big problems are handling the concurrency involved in talking to multiple peers at once, and managing the state of our peers as we interact with them. These are both classically Hard problems.

我们现在拥有下载种子所需的所有工具:我们从追踪器获取了对等方列表,并且可以通过拨号TCP连接与它们通信,发起握手,并发送和接收消息。我们最后面临的两个大问题是同时与多个对等方通信时处理并发,以及在与他们互动时管理对等方的状态。这两个问题都属于经典的难题。

Managing concurrency: channels as queues

In Go, we share memory by communicating, and we can think of a Go channel as a cheap thread-safe queue.

在Go语言中,我们通过通信来共享内存,并且可以将Go通道视为一种廉价的线程安全队列。

We’ll set up two channels to synchronize our concurrent workers: one for dishing out work (pieces to download) between peers, and another for collecting downloaded pieces. As downloaded pieces come in through the results channel, we can copy them into a buffer to start assembling our complete file.

我们将设置两个通道来同步我们的并发工作器:一个用于在对等方之间分配工作(要下载的片段),另一个用于收集已下载的片段。当下载的片段通过结果通道进入时,我们可以将它们复制到缓冲区中,以开始组装完整的文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Init queues for workers to retrieve work and send results
workQueue := make(chan *pieceWork, len(t.PieceHashes))
results := make(chan *pieceResult)
for index, hash := range t.PieceHashes {
length := t.calculatePieceSize(index)
workQueue <- &pieceWork{index, hash, length}
}

// Start workers
for _, peer := range t.Peers {
go t.startDownloadWorker(peer, workQueue, results)
}

// Collect results into a buffer until full
buf := make([]byte, t.Length)
donePieces := 0
for donePieces < len(t.PieceHashes) {
res := <-results
begin, end := t.calculateBoundsForPiece(res.index)
copy(buf[begin:end], res.buf)
donePieces++
}
close(workQueue)

We’ll spawn a worker goroutine for each peer we’ve received from the tracker. It’ll connect and handshake with the peer, and then start retrieving work from the workQueue, attempting to download it, and sending downloaded pieces back through the results channel.

对于从追踪器接收到的每个对等方,我们将生成一个工作协程(goroutine)。它将与对等方建立连接并进行握手,然后开始从工作队列中获取任务,尝试下载它,并通过结果通道发送已下载的片段。

blog.jse.li__torrent__download

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
func (t *Torrent) startDownloadWorker(peer peers.Peer, workQueue chan *pieceWork, results chan *pieceResult) {
c, err := client.New(peer, t.PeerID, t.InfoHash)
if err != nil {
log.Printf("Could not handshake with %s. Disconnecting\n", peer.IP)
return
}
defer c.Conn.Close()
log.Printf("Completed handshake with %s\n", peer.IP)

c.SendUnchoke()
c.SendInterested()

for pw := range workQueue {
if !c.Bitfield.HasPiece(pw.index) {
workQueue <- pw // Put piece back on the queue
continue
}

// Download the piece
buf, err := attemptDownloadPiece(c, pw)
if err != nil {
log.Println("Exiting", err)
workQueue <- pw // Put piece back on the queue
return
}

err = checkIntegrity(pw, buf)
if err != nil {
log.Printf("Piece #%d failed integrity check\n", pw.index)
workQueue <- pw // Put piece back on the queue
continue
}

c.SendHave(pw.index)
results <- &pieceResult{pw.index, buf}
}
}

Managing state

We’ll keep track of each peer in a struct, and modify that struct as we read messages. It’ll include data like how much we’ve downloaded from the peer, how much we’ve requested from them, and whether we’re choked. If we wanted to scale this further, we could formalize this as a finite state machine. But a struct and a switch are good enough for now.

我们将在一个结构体中跟踪每个对等方,并在读取消息时修改该结构体。结构体将包括我们从对等方下载了多少数据,我们向他们请求了多少数据,以及我们是否被阻塞的信息。如果我们希望进一步扩展,我们可以将其形式化为有限状态机。但是目前来说,一个结构体和一个 switch 语句已经足够了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
type pieceProgress struct {
index int
client *client.Client
buf []byte
downloaded int
requested int
backlog int
}

func (state *pieceProgress) readMessage() error {
msg, err := state.client.Read() // this call blocks
switch msg.ID {
case message.MsgUnchoke:
state.client.Choked = false
case message.MsgChoke:
state.client.Choked = true
case message.MsgHave:
index, err := message.ParseHave(msg)
state.client.Bitfield.SetPiece(index)
case message.MsgPiece:
n, err := message.ParsePiece(state.index, state.buf, msg)
state.downloaded += n
state.backlog--
}
return nil
}

Time to make requests!

Files, pieces, and piece hashes aren’t the full story—we can go further by breaking down pieces into blocks. A block is a part of a piece, and we can fully define a block by the index of the piece it’s part of, its byte offset within the piece, and its length. When we make requests for data from peers, we are actually requesting blocks. A block is usually 16KB large, meaning that a single 256 KB piece might actually require 16 requests.

文件、片段和片段哈希并不是全部的故事——我们可以通过将片段进一步拆分为块来进行更深入的处理。一个块是片段的一部分,我们可以通过它所属的片段索引、在片段内的字节偏移和长度来完全定义一个块。当我们向对等方请求数据时,实际上是在请求块。一个块通常为16KB大小,这意味着单个256KB的片段可能需要16次请求才能完全获取。

A peer is supposed to sever the connection if they receive a request for a block larger than 16KB. However, based on my experience, they’re often perfectly happy to satisfy requests up to 128KB. I only got moderate gains in overall speed with larger block sizes, so it’s probably better to stick with the spec.

如果对等方接收到一个大于16KB的块请求,它应该断开连接。然而,根据我的经验,它们通常很乐意满足多达128KB的请求。使用更大的块大小只能带来适度的整体速度提升,所以遵循规范可能更好一些。

Pipelining

Network round-trips are expensive, and requesting each block one by one will absolutely tank the performance of our download. Therefore, it’s important to pipeline our requests such that we keep up a constant pressure of some number of unfulfilled requests. This can increase the throughput of our connection by an order of magnitude.

网络往返时间很昂贵,逐个请求每个块将严重降低下载性能。因此,重要的是通过流水线方式发出请求,以保持一定数量的未满足请求的持续压力。这可以将连接的吞吐量提高一个数量级

blog.jse.li__torrent__pipelining

Classically, BitTorrent clients kept a queue of five pipelined requests, and that’s the value I’ll be using. I found that increasing it can up to double the speed of a download. Newer clients use an adaptive queue size to better accommodate modern network speeds and conditions. This is definitely a parameter worth tweaking, and it’s pretty low hanging fruit for future performance optimization.

经典的BitTorrent客户端通常保持一个包含5个流水线请求的队列,这是我将要使用的值。我发现增加这个值可以将下载速度提高一倍。较新的客户端使用自适应队列大小来更好地适应现代网络速度和条件。这绝对是一个值得调整的参数,对于未来的性能优化来说是一个较为低 hanging fruit。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// MaxBlockSize is the largest number of bytes a request can ask for
const MaxBlockSize = 16384

// MaxBacklog is the number of unfulfilled requests a client can have in its pipeline
const MaxBacklog = 5

func attemptDownloadPiece(c *client.Client, pw *pieceWork) ([]byte, error) {
state := pieceProgress{
index: pw.index,
client: c,
buf: make([]byte, pw.length),
}

// Setting a deadline helps get unresponsive peers unstuck.
// 30 seconds is more than enough time to download a 262 KB piece
c.Conn.SetDeadline(time.Now().Add(30 * time.Second))
defer c.Conn.SetDeadline(time.Time{}) // Disable the deadline

for state.downloaded < pw.length {
// If unchoked, send requests until we have enough unfulfilled requests
if !state.client.Choked {
for state.backlog < MaxBacklog && state.requested < pw.length {
blockSize := MaxBlockSize
// Last block might be shorter than the typical block
if pw.length-state.requested < blockSize {
blockSize = pw.length - state.requested
}

err := c.SendRequest(pw.index, state.requested, blockSize)
if err != nil {
return nil, err
}
state.backlog++
state.requested += blockSize
}
}

err := state.readMessage()
if err != nil {
return nil, err
}
}

return state.buf, nil
}

main.go

This is a short one. We’re almost there.

这是一个简短的问题。我们快要到达目标了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main

import (
"log"
"os"

"github.com/veggiedefender/torrent-client/torrentfile"
)

func main() {
inPath := os.Args[1]
outPath := os.Args[2]

tf, err := torrentfile.Open(inPath)
if err != nil {
log.Fatal(err)
}

err = tf.DownloadToFile(outPath)
if err != nil {
log.Fatal(err)
}
}

This isn’t the full story

For brevity, I included only a few of the important snippets of code. Notably, I left out all the glue code, parsing, unit tests, and the boring parts that build character. View my full implementation if you’re interested.

为了简洁起见,我只包含了一些重要的代码片段。值得注意的是,我省略了所有的粘合代码、解析、单元测试以及构建角色的无聊部分。如果你有兴趣,可以查看我的完整实现。

评论