Skip to contents

ineptr2 includes a file-based caching system that avoids redundant API calls, supports resuming interrupted downloads, and lets you download data without holding it all in memory. This article explains how it works.

Enabling the cache

Caching is off by default. Enable it at construction or at any point during a session:

# At construction
ine <- INEClient$new(use_cache = TRUE)

# Or later
ine$use_cache <- TRUE

The default cache directory is the system’s user cache folder for R packages (tools::R_user_dir("ineptr2", "cache")). You can override it:

ine <- INEClient$new(use_cache = TRUE, cache_dir = "my_cache")

What gets cached

The cache has multiple layers, each serving a different purpose:

Layer File pattern Format Purpose
Chunks ine_{indicator}_{lang}_chunks/chunk_0001.json JSON Raw API responses, one file per request
Processed data ine_{indicator}_{lang}_data.rds RDS Tidy data frame ready to use
Metadata ine_{indicator}_{lang}_meta.json JSON Indicator properties (name, dimensions, dates)
Catalog ine_catalog_{lang}.xml XML Full INE indicator catalog

Every cached file is tagged with the indicator code and language, so switching ine$lang between "PT" and "EN" maintains separate caches.

Chunk cache and manifests

When downloading data, the API response is split into chunks (one per HTTP request). Each chunk is saved as an individual JSON file inside a directory:

my_cache/
  ine_0008273_EN_chunks/
    chunk_0001.json
    chunk_0002.json
    chunk_0003.json
  ine_0008273_EN_manifest.json

The manifest is a JSON file that tracks the download state:

{
  "indicator": "0008273",
  "lang": "EN",
  "total_chunks": 3,
  "urls": ["https://...chunk1", "https://...chunk2", "https://...chunk3"],
  "complete": false
}

The complete flag is set to true only after every chunk has been downloaded and validated. This is the mechanism that enables resume support.

Resuming interrupted downloads

If a download is interrupted — network timeout, session crash, or you simply close R — the manifest and any completed chunks remain on disk. When you call download_data() again, ineptr2:

  1. Reads the existing manifest
  2. Checks which chunks are already cached and valid (non-empty, valid JSON)
  3. Skips those and downloads only the remaining chunks
# Session 1: starts downloading, gets interrupted at chunk 40 of 120
ine$download_data("0008206")

# Session 2 (later): resumes from chunk 41
ine$download_data("0008206")
#> Resuming download: 40/120 chunks cached

Each chunk is first written to a temporary .part file and only renamed to its final .json name after the JSON is validated. This prevents half-written files from corrupting the cache.

Processed data cache

After chunks are downloaded and assembled, get_data() processes them into a tidy data frame and caches the result as an .rds file. This cache also stores which dimension filters were used.

On subsequent calls, the cache is reused only if the new request is equal to or a subset of what was previously cached. For example:

# First call: fetches and caches data for three regions
ine$get_data("0008273", dim2 = c("11", "15", "17"))

# Second call: served from cache (equal to the cached filters)
ine$get_data("0008273", dim2 = c("11", "15", "17"))

# Third call: served from cache (subset of the cached filters)
ine$get_data("0008273", dim2 = c("11", "17"))

# Fourth call: cache miss — "20" was not in the original request
ine$get_data("0008273", dim2 = c("11", "20"))

This avoids the problem of silently returning incomplete data when filters change.

Cache invalidation

The chunk cache is automatically invalidated when dimension filters change between downloads. ineptr2 detects this by comparing the API URLs in the manifest against the URLs generated by the new request — different filters produce different URLs.

# Downloads with one set of filters
ine$download_data("0008273", dim2 = c("11", "15"))

# Different filters: chunk cache is cleared and download starts from scratch
ine$download_data("0008273", dim2 = c("11", "15", "17"))
#> Dimension filters changed. Clearing chunk cache.

As a general rule, if 1) you are unsure of what dimensions you may need and 2) disk space is not an issue, it’s better to download the full indicator, and then work with the cache.

You can also manually clear the cache:

# Clear cache for one indicator
ine$clear_cache("0008273")

# Clear everything
ine$clear_cache()

Inspecting the cache

Use list_cached() to see what’s currently stored:

ine$list_cached()
#>   indicator has_metadata has_data chunks_downloaded chunks_total download_complete
#> 1   0008273         TRUE     TRUE                 3            3              TRUE
#> 2   0008206         TRUE    FALSE                40          120             FALSE

This is useful to check the state of partial downloads or to decide what to clear.

download_data() vs get_data()

download_data() always writes to the file cache, even when use_cache = FALSE. This is by design — its purpose is to populate the cache for later use via load_raw_data():

ine <- INEClient$new()  # use_cache defaults to FALSE

# Downloads to cache without loading into memory
ine$download_data("0008206")

# Later: load the raw cached data
raw <- ine$load_raw_data("0008206")

get_data() respects the use_cache setting: when enabled, it checks the processed data cache before fetching; when disabled, it always hits the API.