Data Encoding and Encryption Using Python Libraries
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How do we encode data on a digital computer?
How do we encrypt information using AES?
How do we serialize and deserialize data structures using JSON?
Objectives
numpy
: Working with Arrays
Numpy
is a module implementing fundamental aspects of scientific computing for Python.
It provides the programmers with several tools among which is the Numpy Array.
A Numpy array is an N-dimmensional array, container of elements of the same type, usually numbers.
A Numpy array is similar to a list covered earlier, but more efficient for scientific computations.
To create a Numpy array, you need to import the Numpy module, and then use the array
class:
import numpy
arr1 = numpy.array([1,2,3,4,5])
arr2 = numpy.array((6,7,8,9))
arr3 = numpy.array(10,11,12,13)
arr1
arr2
arr3
array([1, 2, 3, 4, 5])
array([6, 7, 8, 9])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: only 2 non-keyword arguments accepted
Notice the error for arr3
.
You can use numpy.array
with []
or ()
but not by just listing the numbers.
Working with Binary Data
In the previous episode, we introduced two kinds of data types in Python: numerical (integers and real numbers) and string types. However, these human-centric concepts eventually are represented in computer as a series of binary bits (zeros and ones). How these data are represented as bits and intepreted from bits is a matter of encoding. Modern computers store data in units of bytes: A byte is an eight-bit integer, whose value can be anywhere between 0 and 255 (inclusive endpoints).
How Does My Data Look Like Under the Computer’s Hood?
Let’s mention some examples on how different types of data are encoded in a computer:
A larger integer consists of multiple bytes. For example, a 64-bit integer consists of an ordered sequence of eight bytes.
A standard real number in Python is the so-called IEEE 754 double precision data type, which under the hood is a 64-bit object containing one bit for sign, several bits for exponent, and the rest of the bits for mantissa (significant digits).
An ASCII string is a sequence of bytes, where only 7 lowest bits can be nonzero (therefore, only 128 distinct characters are possible). Today, ASCII is today’s universal representation for basic Latin characters (unaccented letters A-Z, a-z, 0-9 and standard punctuation marks). See the figure below for the standard ASCII encoding. For example, the letters
A
anda
are ASCII characters 65 and 97, respectively.The Unicode standard greatly extends the character set from ASCII standard by accommodating characters from virtually all known languages (Western and Eastern European, Russia, Chinese, Korean, Japanese, Thai, Arabic, Hindi, Bengali, and many more) and common symbols and graphics (math operators, arrows, common graphics, and even more recently, emojis). Unicode has now become a universal standard in computing, therefore Python 3 uses Unicode for its string data type. Unicode standard continues to evolve; as of 2020 there are more than 140,000 characters supported. How to represent Unicode characters that are ever growing in number? There are many standard encoding, including UTF-8, UTF-16, and UTF-32. UTF-8 is backward compatible with ASCII, in which bytes valued 0 through 127 map to the same characters as in ASCII. Characters with higher numbers are encoded in special ways using multiple bytes. UTF-32 will be the most straightforward way but least space efficient (requiring exactly four bytes per character).
Why do we need to be concerned with the low-level matters when discussing encryption? Encryption works on the low-level representation of the data. The encryption algorithm per se does not know how to encrypt a human phrase such as “Go Monarch!” unless the data has been represented in the binary form as an array of bytes.
In the first half of our lesson module, we will consider how text (string) data will be represented as a string of bytes for the purpose of encryption and decryption. In the second half we will consider how an arbitarily long binary data (integers, to be specific) will be encoded efficiently for data transmission.
As mentioned earlier, Python strings use Unicode characters,
and therefore require mapping to a string of bytes.
In Python, the
bytes
datatype contains a sequence of bytes.
In many ways this is analogous to str
,
which contains a sequence of Unicode characters.
The difference is that bytes
can store an arbitrary data,
whereas str
must strictly conform to Unicode standard.
Conversion is quite simple with Python 3:
S = "Hello world"
B = bytes(S, encoding="utf-8")
print(B)
There are alternative ways which you may see elsewhere:
B_alt1 = S.encode("utf-8")
import codecs
B_alt2 = codecs.encode(S, "utf-8")
In the last code snippet, the "utf-8"
argument is optional
in the encode
method/function calls,
since UTF-8 is already the default encoding.
Encoding vs. Encryption
Encoding is a means to represent data as we understand it in terms of bits (or bytes) on the computer. Encryption, on the other hand, is a process of obfuscating or hiding information to protect its disclosure to or by unauthorized parties. (Some authors would include encryption as a part of encoding; but in this lesson module we will differentiate the two so as to make it clear that the goal of encoding is not to hide the information from the unauthorized parties—in contrast to encryption.) One big contrast is that encoding does not involve a secret key.
Equivalence of Data Representations
To conclude the discussions in the preceding sections, the four objects below represent the same data (UTF-8 string-to-byte encoding scheme is implied here):
- a Unicode string:
'Hello world'
- a byte string:
b'Hello world'
- a hex string:
48656c6c6f20776f726c64
(optionally prefixed with0x
) - a long integer:
87521618088882671231069284
Now that we know how to convert between the different representations, we can convert a string message to a form suitable for encryption.
Data Representation Conversion: C/C++ vs. Python
Those who program in C may not need to think as much about this kind of encoding, because C’s string is essentially is a string of bytes, and conversion between data types are often done silently in C by casting pointer types (e.g. casting
char *
to avoid *
). Even a single character (char
) in C is basically an integer, no different thanint
, other than the number of bits! Not so in Python: data types are strictly adhered to in Python, despite the dynamic nature of the language. Explicit data conversions described in this lesson must therefore be performed with great care, as they carry the risk of corrupting data. An example: it is absurd to perform arithmetic (additions, subtractions, multiplications, divisions) on the long integer above,87521618088882671231069284
, without respecting the fact that the number above was meant to represent a text string, not an ordinary numeral value.
AES encryption
AES refers to the Advanced Encryption Standard developed in late 1990s and was adopted by U.S. National Institute of Standards and Technology (NIST) as the encryption standard in 2001. It is widely used worldwide today. AES uses a symmetric-key encryption algorithm, meaning that the same key is used for encryption and decryption. This key therefore must be kept as a secret. The length of the key can be 128, 192, 256 bits long; the longer the key, the stronger the encryption.
In a nutshell, AES uses a sequence of reversible bit scrambling operations involving the secret key and a carefully crafted byte-for-byte mapping called “S-box” (short for subtitution box).
How to use encrypt a message in Python
We conveniently provide you with a module called AES
that implements the AES algorithm in Python.
What we want to do here is encrypt and decrypt a message using the provided module.
The following code snipet show how to do this:
import codecs
import aes
# The master key (a secret) must be less than 128 bits (16 bytes):
master_key = 0x5e413c
# Initializing "E", the object that can perform the encrypting / decrypting:
E = AES(master_key)
# You can change any plaintext with 16 bytes in hexadecimal
# the string must also under 16 letters
text_string = 'Idea Fusion'
# encode it to hex string
plaintext_string = codecs.encode(text_string.encode(),'hex')
#convert the hex string to number for encryption
plaintext = int(plaintext_string,16)
print ('The plaintext in decimal is:',plaintext)
# do the encryption
ciphertext = E.encrypt(plaintext)
print ('The ciphertext in hexdecimal is:',hex(ciphertext)) # it should be 16 bytes
# do the decryption
decry = E.decrypt(ciphertext)
print ('The decrypted text in decimal is:',decry)
# get the hex number in string
hex_str = str(hex(decry))[2:]
# get the bytes in number string
decode_hex = bytes.fromhex(hex_str)
# decode the bytes into ascii
decode_text = codecs.decode(decode_hex,'ascii')
print ('The decrypted text is:',decode_text)
What is done in the above listing is :
- The message to encrypt is first converted to a hexadecimal string then converted to a decimal
- The resulting decimal is then encrypted using
aes.encrypt
- Next, the encrypted message is decryoted using
aes.decrypt
- The result from decrypt is reconverted back to a human readeble string.
JSON
JSON
is a text-based data exchange format derived from JavaScript.
It is used as a common format to serialize a deserialize data in
applications that communicate with each other through the Internet.
Applications can be written in any language, and ran in various
environments.
JSON
provides a way to standardize the data form so that the
applications can understand each other.
Syntax
JSON
defines two data structures: objects and arrays.
An object in JSON
is a set of name-value pairs similar to Python
dictionaries.
An array is a list of values just as in Python.
Objects are enclosed in braces{}
, their name-value pairs are separated
by a comma ,
and the name and value pairs are separated by a colon :
.
Names in name-value pairs are strings.
Values in name-value pair of an object could be of any type, including
another object or an array.
Arrays are inclosed in brackets []
, and the values they contain are
separated by a comma ,
.
Arrays values may be of different types, including another array or an
object.
From these syntax, objects can contain other objects or arrays, and arrays
can also contain other arrays or objects.
Example of JSON
data
{
"Name" : "John",
"LastName" : "Doe",
"Age" : 24,
"Classes" : ["Literature", "Algebra", "Computer science"],
"Grades" : {"Literature" : "A", "Algebra" : 3.7, "Computer science" : 2.7},
"PhoneNumbers" : [{"home" : "(111) 111-1111"}, {"Mobile": 2222222222}]
}
JSON and Python
Python offers a module to handle JSON data called json
.
To use this module you need to import it:
import json
Parsing JSON to Python:
Let’s say you have a JSON
data in the form of a string.
To access this data in your Python script, you will need to first parse it
into a Python variable and then use it from there:
import json
# A JSON data as a string
person = '{"Name" : "John", "LastName" : "Doe", "Age" : 24, "Classes" :["Literature", "Algebra", "Computer science"], "Grades" : {"Literature" :"A", "Algebra" : 3.7, "Computer science" : 2.7}, "PhoneNumbers" : [{"home": "(111) 111-1111"}, {"Mobile": 2222222222}]}'
# Parse person to python
pyPerson = json.loads(person)
# Access the data
print(pyPerson["Name"])
print(pyPerson["Grades"]
John
{u'Algerbra': 3.7, u'Literature': u'A', u'Computer science': 2.7}
Convert Python to JSON
To convert a python data to JSON
you can use the json.dumps
method:
import json
# A python variable
cities = [{"name" : "Norfolk", "population" : 242628}, {"name":"Virginia
Beach", "population" : 442707}, {"name" : "Portsmouth", "population" :
95684}]
# Convert to JSON string
jsonCities = json.dumps(cities)
print(jsonCities)
[{"name": "Norfolk", "population": 242628}, {"name": "Virginia Beach",
"population": 442707}, {"name": "Portsmouth", "population": 95684}]
Key Points