Encoding & Optimization for Multilingual SMS

When sending SMS messages —whether through the Digital Engagement Platform, API, or SFTP—understanding how character encoding works is essential for managing message length, segmentation, and cost. Messages that include special characters or use extended alphabets—such as accented letters, symbols, or non-Latin scripts—require additional encoding, which can reduce the number of characters allowed per segment. Even a single non-standard character can trigger a switch in encoding mode, causing the message to split into multiple segments and increasing delivery costs.

This section explains how encoding affects message size, when and why encoding switches occur, and provides practical recommendations to help you craft message text that minimizes segmentation and optimizes delivery efficiency.

SMS Character Encoding

SMS technology has strict encoding standards that determine how characters are represented and transmitted. When composing messages in Messangi—either through the platform UI or APIs—the character set you use affects:

  • The encoding type (GSM-7 or Unicode).
  • The message length limit.
  • The cost per message, since messages exceeding a certain length are split into multiple segments.

Understanding how encoding works and which characters trigger Unicode helps you keep messages within a single segment and avoid unexpected billing or delivery behavior.

GSM-7 Encoding

The GSM-7 character set is a 7-bit encoding standard designed for SMS. When your message contains only characters covered by GSM-7 (including its “extended” subset), the system uses GSM-7.

For a message fully encoded in GSM-7, the maximum length is 160 characters (one segment).

If the message exceeds 160 characters (using GSM-7), the system splits it into multiple segments (concatenated SMS). In that case, each segment typically allows about 153 characters (because part of the message payload is used for concatenation metadata).

The following table lists the characters supported by the GSM-7 character set. These are recognized as standard SMS characters, although their display on the recipient’s mobile phone may vary depending on the handset’s compatibility and font support.

S P K 0 i P ¿
_!Ä1AQa
Φ"krbRB
æ-=MÑmñ
Γ#ä3CSc
Λ¤ø4DTd
Ω%Æ5EuU
Π&,6FVf
Ψ'<7GWg
Σ(L8HXh
Θ)Ö9IYi
FΞI*:JZ
ØEöSC+;
.>NÜnüå
?O§oà  

The characters listed below are included in the GSM-7 character set but require an escape sequence when used in an SMS message, which means each counts as two characters toward the total message length:

^, {, }, \, [, ], ~, |,

Unicode (UCS-2) Encoding

If the message contains any character not supported by GSM-7, the system switches the encoding to Unicode (UCS-2).

Under Unicode, the character limit per segment drops to about 70 characters. That is, when using the Unicode charset, each character is seen as two characters. This decreases the number of characters you can put in your SMS from 160 to 70.

When concatenation occurs under Unicode, each subsequent segment can typically carry approximately 67 characters (due to metadata overhead).

Thus, a single non-GSM-7 character triggers the lower capacity limit and may substantially increase segments and cost.

GSM-7 vs Unicode (UCS-2)

The choice between GSM-7 and Unicode encoding has a direct impact on message segmentation, delivery, and billing. The following table summarizes the key differences between both encoding standards:

Encoding TypeCharacter SetMax Characters (Single SMS)Characters per Segment (Concatenated)Common Use Cases
GSM-7Basic Latin alphabet, digits, and selected punctuation marks.160153Standard English or Latin-based languages without accents or special symbols.
Unicode (UCS-2)Supports all global scripts, symbols, and emojis.7067Messages that include non-Latin characters, accented letters, emojis, or complex symbols.

In practice, the encoding is automatically determined by the characters used in the message. As soon as one non-GSM-7 character is detected, the system switches to Unicode encoding for the entire message.

The following table lists characters frequently responsible for switching to Unicode or consuming extra units in GSM-7, along with recommended replacements when your goal is to stay within one segment.

Character Description Recommended Replacement Example
á, à, â, ä, ã, å Accented “a” variations a “mañana” → “manana”
é, è, ê, ë Accented “e” variations e “éxito” → “exito”
í, ì, î, ï Accented “i” variations i “país” → “pais”
ó, ò, ô, ö, õ Accented “o” variations o “avión” → “avion”
ú, ù, û, ü Accented “u” variations u “tú” → “tu”
ñ Spanish “ñ” n “niño” → “nino”
ç Cedilla “c” c “façade” → “facade”
Euro symbol EUR “€50” → “EUR 50”
–, — En dash / Em dash - Replace “–” with “-”
‘, ’, “, ” Curly quotation marks ' , " Replace “smart quotes” with straight ones
Ellipsis ... Use three dots
Emojis / symbols Pictograms or icons Remove or use text equivalents “✔ Confirmed” → “Confirmed”

The usage of plain equivalents may slightly alter tone or language (especially in Spanish). You must weigh readability vs cost/segment optimisation.

Conclusion

  • If your message is encoded as GSM-7 and remains under ~160 characters, it typically is sent as one segment.

  • If it exceeds ~160 characters in GSM-7, it will be sent as multiple segments (each roughly ~153 characters). That is, when a GSM-7 message exceeds 160 characters, concatenation metadata is added to link the parts together. This metadata consumes 3 characters per segment, reducing the available capacity from 160 to 153 characters. For example, a 180-character GSM-7 message will be sent as 2 segments (157 + 23 = 180).

  • If the message includes any non-GSM-7 character, the encoding switches to Unicode (UCS-2). In Unicode, each segment can contain 67 usable characters, since 3 characters are reserved for concatenation information. This means the maximum per segment decreases from 160 to 67 characters.

  • The impact on cost is direct: each additional segment is a billable unit.

  • Example: A 140-character message that uses only GSM-7 characters will be sent as 1 segment. If the same message includes an emoji or any unsupported character, the encoding switches to Unicode. Because each Unicode segment holds 67 usable characters, the 140-character message will require 3 segments (67 + 67 + 6) — effectively tripling the delivery cost.

  • For high-volume SMS campaigns, even small increases in per-message segments can significantly affect the budget.

By understanding which characters trigger Unicode encoding and following best practices for content creation, you can optimize SMS delivery - maintaining clarity in communication while controlling costs and ensuring efficient message distribution.