Why does my dataset show more records than my source file?

Forum|Forum|1 month ago
June 5, 2026
0 replies
20 views

+1

Rémy
Community Manager

You notice that your published dataset contains more records than your source file. For example, your file has 22,000 rows, but the dataset shows 24,000+. Clearing the cache and republishing does not fix the issue.

Root cause

This is a known behaviour with the XLSX extractor on FTP sources.

When a new file is deposited on the FTP under a different name than the previous one, the platform does not recognise it as a replacement; it treats it as an additional file. As a result, records from both the old and new files accumulate in the dataset instead of the old ones being replaced.

The XLSX extractor does not detect deleted files in an FTP folder. Unlike CSV, it has no mechanism to automatically flush previous data when a new file with a different name is added.

Solutions

✅ Recommended — Keep the same filename

The cleanest fix is to always use the same filename when replacing a file on the FTP. When the platform detects an update to a known file (same name), it correctly flushes the old data before importing the new version.

Example:

✅ export_weekly.xlsx → replaced each week by a new export_weekly.xlsx
❌ export_2026-05-12.xlsx → replaced by export_2026-05-26.xlsx (different name = accumulation)

🔧 Workaround — Unpublish / Republish manually

If renaming files is not an option, you can force a full reset by:

Going to the dataset in edit mode
Clicking Unpublish
Immediately clicking Publish again

This forces the platform to re-read the FTP folder from scratch and discard accumulated records.

⚠️ Note: this workaround is manual and needs to be repeated after each file update. It is not compatible with automated scheduling if filenames change every time.

🔄 Alternative — Switch to CSV format

CSV files behave differently: the platform supports cache clearing before republication, which avoids accumulation. However, this requires converting your source files to CSV, and the operational benefit over the manual unpublish/republish approach is limited.

What about the scheduler?

If your dataset uses a scheduled republication, it will not resolve the accumulation issue on its own; unless filenames are kept consistent. The scheduler triggers a normal republication, which does not flush previously accumulated records from differently-named files.

Planned improvement

This behaviour is expected to evolve as part of an ongoing platform rework. Future versions of the XLSX extractor will handle deleted or renamed files more gracefully.

Have a question or a different use case? Drop a comment below!!!