What is the limit to encoding base in case of Unicode strings as opposed to base64 having base = 64? -


this related code golf in general, appliable elsewhere. people commonly use base64 encoding store large amounts of binary data in source code.

assuming programming languages happy read unicode source code, what max n, can reliably devise basen encoding?

reliability here means being able encode/decode data, every single combination of input bytes can encoded, , decoded. encoded form free rule.

the main goal to minimize character count, regardless of byte-count.

would base2147483647 (32-bit) ?

also, because know may vary browser-to-browser, , have problems copy-pasting code codegolf answers our editors, copy-paste-ability factor here. know there unicode range of characters not displayed.

note: know binary data, base64 expands data, here character-count main factor.

it depends on how reliable want encoding be. character encodings designed trade-offs, , in general more characters allowed, less universally accepted i.e. less reliable. base64 isn't immune this. rfc 3548, published in 2003, mentions case sensitivity may issue, , characters + , / may problematic in scenarios. describes base32 (no lowercase) , base16 (hex digits) potentially safer alternatives.

it not better unicode. adding many characters introduces many more possible points of failure. depending on how stringent requirements are, might have different values n. i'll cover few possibilities large n small n, adding requirement each time.

  • 1,114,112: code points. number of possible code points defined unicode standard.
  • 1,112,064: valid utf. excludes surrogates cannot stand on own.
  • 1,111,998: valid exchange between processes. unicode reserves 66 code points permanent non-characters internal use only. theoretically, maximum n justifiably expect copy-paste scenario, noted, in practice many other unicode strings fail exercise.
  • 120,503: printable characters only, depending on definition. i've defined characters outside of other , separator general categories. also, starting bullet point, n subject change in future versions of unicode.
  • 103,595: nfkd normalized unicode. unfortunately, many processes automatically normalize unicode input standardized form. if process used nfkc or nfkd, information may have been lost. more reliability, encoding should define normalization form, nfkd being better increasing character count
  • 101,684: no combining characters. these "characters" shouldn't stand on own, such accents, , meant combined base character. processes might panic if left standing alone, or if there many combining characters on single base character. i've excluded mark category.
  • 85: ascii85, aka. want ascii back. okay, no longer unicode, felt mentioning because it's lesser known ascii-only encoding. it's used in adobe's postscript , pdf formats, , has 5:4 encoded data size increase, rather base64's 4:3 ratio.

Comments