🧱 Chunk is a download manager for slow and unstable servers

  • By Eduardo Cuducos
  • Last update: Dec 15, 2022
  • Comments: 14

Chunk

Tests Format Lint GoDoc

Chunk a download tool for slow and unstable servers.

Usage

CLI

Install it with go install github.com/cuducos/chunk then:

$ chunk <URLs>

Use --help for detailed instructions.

API

The Download method returns a channel with DownloadStatus statuses. This channel is closed once all downloads are finished, but the user is in charge of handling errors.

Simplest use case

d := chunk.DefaultDownloader()
ch := d.Dowload(urls)

Customizing some options

d := chunk.DefaultDownloader()
d.MaxRetries = 42
ch := d.Dowload(urls)

Customizing everything

d := chunk.Downloader{...}
ch := d.Download(urls)

How?

It uses HTTP range requests, retries per HTTP request (not per file), prevents re-downloading the same content range and supports wait time to give servers time to recover.

Download using HTTP range requests

In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using HTTP range requests. This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response.

Retries by chunk, not by file

In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (content range), not the whole file.

Control of which chunks are already downloaded

In order to avoid re-starting from the beginning in case of non-handled errors, chunk knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads.

Detect server failures and give it a break

In order to avoid unnecessary stress on the server, chunk relies not only on HTTP responses but also on other signs that the connection is stale and can recover from that and give the server some time to recover from stress.

Why?

The idea of the project emerged as it was difficult for Minha Receita to handle the download of 37 files that adds up to just approx. 5Gb. Most of the download solutions out there (e.g. got) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand.

Download

chunk.zip

Comments(14)

  • 1

    Adds get file size method

    Hey, I've seen that you guys are intending to use the Content Size header from the request as a way to get the file size. As I'm learning Go, here is my small contribution to #3.

    Am I on the right path, @cuducos?

  • 2

    Adds cache to allow downloads do pause and restart

    The suggested cache works as follows:

    chunk, once integrated with this cache system, will create a directory .cache and some files inside it:

    • .cache/.chunk_size with just the size of the chunk, e.g. 8192;
    • a file per file being downloaded, e.g. if we're downloading a.zip and b.zip .cache/a.zip and .cache/b.zip.

    These last files will contain a sequence of 0 and 1, meaning which chunks are downloaded (1) or not (0). For example, if a file needs four chunks, and the second one is done, the cache file's contents will be 0100.

    Every time a new cache is created, it checks for these files, and if the chunk size matches, it loads the info (which chunks are done/pending) from the cache.

  • 3

    Add target directory

    Once we start a download we must be able to tell the Downloader where to save the files. This should be passed via CLI too (e.g. in the structure described in #12) and can have a meaningful default set to the cwd.

  • 4

    Create the parallel download semaphore/workers limit

    Implement a maximum of parallel downloads per domain/subdomain.

    1. Identify each different subdomain in all the URLs
    2. create a fixed number of channels mapped to each one map[string]chan
    3. spin a worker reading each of these channels
    4. process the download in these workers
  • 5

    Adds test for ZIP archive download

    UPDATE

    I was trying to replicate here a problem reported on Windows operational system when I prototyped this project: After downloading, the ZIP file wouldn't be recognized as valid. That was happening only on Windows, and my guess was some specificity, such as using EOL — so I think adding a test like that might help (since we run Windows on the CI as well).

  • 6

    Parallel downloads

    Implements the functionality of downloading chunks in parallel. It limits the number of concurrent connections (active downloads) per server through the HTTP transport.

  • 7

    Make Chunk directory configurable

    #38 added ~/.chunk as the default directory for saving progress files. There's a TODO to make it configurable:

    https://github.com/cuducos/chunk/blob/02999cdd7fd49644210e074e0e2ae3473e2af2c1/progress.go#L16-L28

    Maybe reading from and environment variable like CHUNK_DIR?

  • 8

    Should we export the HTTP client to make it easier for users to customize it?

    https://github.com/cuducos/chunk/blob/53b8bf9bbc188b664d5adafe5b700e370f396886/downloader.go#L59

    Just asking, I don't have any use case in mind.

    Maybe someone wants to use cookies? Or authenticating before the download? Not sure. I think it's difficult because of the way we set it up: the HTTP client handles the heavy lifting of parallelism, but we might have advanced users… idk.

  • 9

    Fix downloading of binary (?) files

    It looks like our simple tests of downloading string-based contents from an HTTP server are OK, but downloading a binary such as a ZIP archive seems to result in a corrupted downloaded file.

    See #28 for a failing test case.

  • 10

    Adds contribution guide

    I'm sorry @vmesel. We are so happy to have you here, but we weren't expecting community contributions that early — this is not a problem. We're excited about it.

    The point is we missed the basics of being a good host, so hope this tiny cheatsheet helps you get started : )

  • 11

    Error on `go install` (following the README)

    $ docker run --rm -it golang:1.19-bullseye /bin/bash
    root@6ebcdedd1767:/go# go install github.com/cuducos/chunk@latest
    go: downloading github.com/cuducos/chunk v1.0.0
    go: downloading github.com/avast/retry-go v3.0.0+incompatible
    package github.com/cuducos/chunk is not a main package
    
  • 12

    Cannot continue stopped download on Windows

    As described in #44:

    Tried to restart download, and the following error was reported:

    (base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
    2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
    (base) PS C:\Users\mauricio\chunk_teste>
    
  • 13

    Error unzipping large file downloaded with `chunk` in Windows

    I'm using Windows 10, with Powershell (with base conda environment automatically activated).

    Tried to download the biggest file (Estabelecimentos0.zip). Had the following error:

    (base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip --force-restart
    Downloading 622.4MB of 878.1MB  70.88%  1.4MB/s2022/12/26 18:51:31 error downloadinf chunk #90073: error downloading https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip: All attempts fail:
    #1: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
    #2: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
    #3: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
    #4: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
    #5: request to https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip ended due to timeout: context deadline exceeded
    (base) PS C:\Users\mauricio\chunk_teste>
    

    Tried to restart download, and the following error was reported:

    (base) PS C:\Users\mauricio\chunk_teste> ..\chunk-v1.0.0-windows-amd64.exe https://dadosabertos.rfb.gov.br/CNPJ/Estabelecimentos0.zip
    2022/12/26 18:52:46 could not creat a progress file: error loading existing progress file: error decoding progress file C:\Users\mauricio\.chunk\c811d2999ff5d6a15340c98b44fd8126-Estabelecimentos0.zip: unexpected EOF
    (base) PS C:\Users\mauricio\chunk_teste>
    

    With the flag --force-restart the download worked, however from the beggining of the file. Once again, after over 500Mb downloaded, the prior timeout error occurred. Can't restart without --force-restart flag`

    The zip file, however, is downloaded and, when I try to unzip it (using 7-zip) it reports a data error, but saves the content (a csv file). But this file cannot be loaded in pandas or even in a spreadsheet software. In a text editor (Notepad++) it shows coherent data for the first lines (about 4.000.000), but after that it's clearly cluttered.

    With a smaller file (Empresas1.zip), it worked correctly. The file was downloaded, unzipped and opened in Pandas (4.494.859 lines)

  • 14

    Two chunk instances downloading the same file

    Do we care about the same file being downloaded simultaneously by two chunk instances? Asking because I believe this can a pre-release feature.

    Example:

    $ ./chunk HTTP://a.b/c &
    $ ./chunk HTTP://a.b/c &
    

    The result of this sequence of operations is unknown.

    We could deal with it after https://github.com/cuducos/chunk/issues/8, by augmenting the infrastructure in place.

    cc/ @cuducos