peppol_sync.py¶
Overview¶
The peppol_sync.py script is a command-line tool designed to synchronize business card data from the PEPPOL directory. It downloads a large XML file, splits it into smaller, more manageable XML files organized by country, and stores them in the extracts/ directory.
Command-line Usage¶
usage: peppol_sync.py [-h] [-V] [-F] [-C] [-K] [-T TMP] [-M MAX] {sync,check,download,huge}
Synchronize PEPPOL export into git-managed files
positional arguments:
{sync,check,download,huge}
Action to perform
options:
-h, --help show this help message and exit
-V, --verbose Enable verbose output
-F, --force Force re-download of XML file even if it exists
-C, --nocleanup Do not delete existing XML files in extracts/ before starting (default: delete)
-K, --keep-tmp Keep temporary files after processing (default: delete)
-T, --tmp TMP Temporary directory (default: tmp)
-M, --max MAX Maximum number of bytes per output file (default: 1000000)
Actions¶
The first argument to the script must be one of the following actions:
sync: This is the main action. It performs the entire synchronization process, which includes downloading the XML file (if necessary) and splitting it into country-specific files.check: This action checks the configuration and prints the temporary and extracts directories.download: This action only downloads the PEPPOL business card XML file and saves it to the temporary directory.huge: This action lists the largest XML files found in theextracts/directory.
Options¶
-h,--help: Shows the help message and exits.-V,--verbose: Enables verbose output, providing more detailed information about the script's execution.-F,--force: Forces the script to re-download the main XML file, even if a local copy already exists.-C,--nocleanup: By default, the script deletes all existing XML files in theextracts/directory before starting a new sync. This flag prevents the cleanup, preserving the existing files.-K,--keep-tmp: Prevents the script from deleting temporary files (like the downloaded XML) after processing is complete.-T,--tmp TMP: Specifies the temporary directory to use for downloading files. Defaults totmp.-M,--max MAX: Sets the maximum size in bytes for each output XML file. When a file exceeds this size, a new one is created. Defaults to 2000000 (2MB).
Functionality¶
The PeppolSync class handles the entire workflow:
-
Download Phase (
download_xml()at line 70)- Streams XML from
https://directory.peppol.eu/export/businesscards - Saves to
tmp/directory-export-business-cards.xml - Shows progress every 100MB
- Skips download if file exists (override with
-F)
- Streams XML from
-
Processing Phase (
process_xml()at line 153)- Uses text-based chunking (1MB chunks) for memory efficiency
- Parses business cards with
lxml.etreefor fast XML handling - Extracts country code from
<entity countrycode="XX"> - Extracts registration date from
<regdate>for statistics - Writes pretty-printed XML to country directories
-
File Splitting Logic (lines 228-250)
- Splits files when they exceed
max_bytes(default: 2MB) - Sequential naming:
business-cards.000001.xml,business-cards.000002.xml, etc. - Each country has its own directory:
extracts/BE/,extracts/NO/, etc. - Automatically creates header and footer tags for valid XML
- Splits files when they exceed
-
Report Generation (
generate_report()at line 269)- Creates
extracts/report.mdwith country statistics - Shows file count, card count, and size per country
- Creates
Running the sync tool¶
# Full sync: download + process XML (recommended for first run)
python3 peppol_sync.py sync
# Force re-download even if file exists
python3 peppol_sync.py sync -F
# Keep temporary files for debugging
python3 peppol_sync.py sync -K
# Don't delete existing extracts before processing
python3 peppol_sync.py sync -C
# Verbose output
python3 peppol_sync.py sync -V
Utility commands¶
# Download XML only (no processing)
python3 peppol_sync.py download
# Check configuration
python3 peppol_sync.py check
# Show largest output files
python3 peppol_sync.py huge -n 20
# Custom max file size (default: 2MB)
python3 peppol_sync.py sync -M 1000000
Dependencies¶
# Install required Python package
pip install lxml
Key Implementation Details¶
Memory-Efficient XML Processing¶
The script processes multi-GB XML files without loading everything into memory:
- Reads in 1MB text chunks
- Splits on
</businesscard>delimiter - Parses individual cards with lxml
- Uses streaming writes to output files
Country Code Extraction¶
Located in extract_country_from_etree() (line 130):
entity = element.find(".//entity")
if entity is not None:
return entity.get("countrycode")
File Rotation¶
When a country file exceeds max_bytes:
- Writes
</root>footer to close current file - Increments sequence number in
self.file_stats[country]['sequence'] - Opens new file with updated sequence
- Writes XML header to new file
Cleanup Behavior¶
- Temporary files (
tmp/): Deleted after processing by default (keep with-K) - Extract files (
extracts/**/*.xml): Deleted before each sync by default (preserve with-C) - Log file (
extracts/peppol_sync.log): Overwritten on each run