Chunk
Chunk a download tool for slow and unstable servers.
Usage
CLI
Install it with go install github.com/cuducos/chunk
then:
$ chunk <URLs>
Use --help
for detailed instructions.
API
The Download
method returns a channel with DownloadStatus
statuses. This channel is closed once all downloads are finished, but the user is in charge of handling errors.
Simplest use case
d := chunk.DefaultDownloader()
ch := d.Dowload(urls)
Customizing some options
d := chunk.DefaultDownloader()
d.MaxRetries = 42
ch := d.Dowload(urls)
Customizing everything
d := chunk.Downloader{...}
ch := d.Download(urls)
How?
It uses HTTP range requests, retries per HTTP request (not per file), prevents re-downloading the same content range and supports wait time to give servers time to recover.
Download using HTTP range requests
In order to complete downloads from slow and unstable servers, the download should be done in “chunks” using HTTP range requests. This does not rely on long-standing HTTP connections, and it makes it predictable the idea of how long is too long for a non-response.
Retries by chunk, not by file
In order to be quicker and avoid rework, the primary way to handle failure is to retry that “chunk” (content range), not the whole file.
Control of which chunks are already downloaded
In order to avoid re-starting from the beginning in case of non-handled errors, chunk
knows which ranges from each file were already downloaded; so, when restarted, it only downloads what is really needed to complete the downloads.
Detect server failures and give it a break
In order to avoid unnecessary stress on the server, chunk
relies not only on HTTP responses but also on other signs that the connection is stale and can recover from that and give the server some time to recover from stress.
Why?
The idea of the project emerged as it was difficult for Minha Receita to handle the download of 37 files that adds up to just approx. 5Gb. Most of the download solutions out there (e.g. got
) seem to be prepared for downloading large files, not for downloading from slow and unstable servers — which is the case at hand.
Adds get file size method
Hey, I've seen that you guys are intending to use the Content Size header from the request as a way to get the file size. As I'm learning Go, here is my small contribution to #3.
Am I on the right path, @cuducos?
Adds cache to allow downloads do pause and restart
The suggested cache works as follows:
chunk
, once integrated with this cache system, will create a directory.cache
and some files inside it:.cache/.chunk_size
with just the size of the chunk, e.g.8192
;a.zip
andb.zip
.cache/a.zip
and.cache/b.zip
.These last files will contain a sequence of
0
and1
, meaning which chunks are downloaded (1
) or not (0
). For example, if a file needs four chunks, and the second one is done, the cache file's contents will be0100
.Every time a new cache is created, it checks for these files, and if the chunk size matches, it loads the info (which chunks are done/pending) from the cache.
Add target directory
Once we start a download we must be able to tell the
Downloader
where to save the files. This should be passed via CLI too (e.g. in the structure described in #12) and can have a meaningful default set to thecwd
.Create the parallel download semaphore/workers limit
Implement a maximum of parallel downloads per domain/subdomain.
map[string]chan
Adds test for ZIP archive download
UPDATE
I was trying to replicate here a problem reported on Windows operational system when I prototyped this project: After downloading, the ZIP file wouldn't be recognized as valid. That was happening only on Windows, and my guess was some specificity, such as using EOL — so I think adding a test like that might help (since we run Windows on the CI as well).
Parallel downloads
Implements the functionality of downloading chunks in parallel. It limits the number of concurrent connections (active downloads) per server through the HTTP transport.
Make Chunk directory configurable
#38 added
~/.chunk
as the default directory for saving progress files. There's aTODO
to make it configurable:https://github.com/cuducos/chunk/blob/02999cdd7fd49644210e074e0e2ae3473e2af2c1/progress.go#L16-L28
Maybe reading from and environment variable like
CHUNK_DIR
?Should we export the HTTP client to make it easier for users to customize it?
https://github.com/cuducos/chunk/blob/53b8bf9bbc188b664d5adafe5b700e370f396886/downloader.go#L59
Just asking, I don't have any use case in mind.
Maybe someone wants to use cookies? Or authenticating before the download? Not sure. I think it's difficult because of the way we set it up: the HTTP client handles the heavy lifting of parallelism, but we might have advanced users… idk.
Fix downloading of binary (?) files
It looks like our simple tests of downloading string-based contents from an HTTP server are OK, but downloading a binary such as a ZIP archive seems to result in a corrupted downloaded file.
See #28 for a failing test case.
Adds contribution guide
I'm sorry @vmesel. We are so happy to have you here, but we weren't expecting community contributions that early — this is not a problem. We're excited about it.
The point is we missed the basics of being a good host, so hope this tiny cheatsheet helps you get started : )
Error on `go install` (following the README)
Cannot continue stopped download on Windows
As described in #44:
Error unzipping large file downloaded with `chunk` in Windows
I'm using Windows 10, with Powershell (with
base
conda environment automatically activated).Tried to download the biggest file (
Estabelecimentos0.zip
). Had the following error:Tried to restart download, and the following error was reported:
With the flag
--force-restart
the download worked, however from the beggining of the file. Once again, after over 500Mb downloaded, the prior timeout error occurred. Can't restart without--force-restart
flag`The
zip
file, however, is downloaded and, when I try to unzip it (using7-zip
) it reports a data error, but saves the content (acsv
file). But this file cannot be loaded in pandas or even in a spreadsheet software. In a text editor (Notepad++
) it shows coherent data for the first lines (about 4.000.000), but after that it's clearly cluttered.With a smaller file (
Empresas1.zip
), it worked correctly. The file was downloaded, unzipped and opened in Pandas (4.494.859 lines)Two chunk instances downloading the same file
Do we care about the same file being downloaded simultaneously by two chunk instances? Asking because I believe this can a pre-release feature.
Example:
The result of this sequence of operations is unknown.
We could deal with it after https://github.com/cuducos/chunk/issues/8, by augmenting the infrastructure in place.
cc/ @cuducos