Defining Bytecode
- Bytecode is used in Java, Python, and other languages
- However, not all bytecode is created equal
- Bytecode is a generic term for an intermediate language used by compilers and interpreters
- For example, Java bytecode contains information about primitive data types
- On the other hand, Python bytecode does not contain this information
- As a result, the Python virtual machine (PVM) is slower than the Java virtual machine (JVM)
- Specifically, the bytecode in the PVM takes longer to execute than the bytecode in the JVM
Differentiating between Compilers and Interpreters
-
An interpreter is a program that executes a given language to receive some desired output
- The program is a function, the language is our input, and some expected outcome is our output
- This program is typically machine code
- This language can be bytecode, other machine code, etc.
-
This language can be high-level or low-level
source language, input --> | interpreter | --> output
-
A compiler is a program that translates a source language into a destination language
- The program is a function, the source language is our input, and the destination language is our output
- This program can be written in many different languages
- This source language is usually some higher-level language
-
This destination language is usually some lower-level equivalent
source language --> | compiler | --> destination language
- Roughly, a compiler is nothing more than a language translator
- Roughly, an interpreter is just a CPU emulator
- A CPU is an interpreter of machine code
Compilation and Interpretation of CPython
-
A CPython compiler translates source code into CPython bytecode
- The source code is a .py file
- The bytecode is a .pyc file
- The CPython compiler is written in Python and C
- The bytecode is cached in a pycache folder
- The program will run this bytecode unless any changes are made to the program
-
A CPython interpreter executes bytecode in a Python virtual machine (PVM)
- The interpreter is a precompiled C program
- Meaning, the interpreter is machine code
- The bytecode is read in to the interpreter similar to how a text file is read in C
- Meaning, the Python program is never actually converted into machine code
- Instead, the machine code (i.e. interpreter) executes the Python program (as bytecode)
- Thus, the machine code (i.e. interpreter) returns the desired output of the program
Comparing the PVM and JVM
- During run-time, the bytecode is interpreted by a JVM interpreter within the JVM
- Before interpretation, a JIT compiler compiles the bytecode into machine code within the JVM
- Unlike Python, Java is able to do this because Java is statically-typed
- Therefore, type checking has already happened during compile-time
-
Returning to the illustration of an interpreter, the JVM interpreter looks like the following:
bytecode, input --> | JVM interpreter | --> output
- Since the data types of input are known, a JIT compiler can be used at run-time
-
Therefore, the JVM compilation process looks like the following:
bytecode --> | JIT compiler | --> machine code
-
With the addition of the JIT compiler at run-time, the JVM interpreter now looks like this:
machine code, input --> | JVM interpreter | --> output
- Bytecode is platform dependent
- Machine code is platform dependent
- Specifically, there is different machine code for different processors
- This is why the JIT compiler exists within the JVM and can't be compiled beforehand
Highlighting Differences in Compile and Run Time
-
At compile time:
- Language syntax is checked
- Data types are checked for statically-typed languages
-
At run time:
- Computations such as addition, division, etc.
- Data types are checked for dynamically-typed languages
Summarizing Static and Dynamic Languages
- Generally, a dynamically-typed language executes many common programming behaviors at runtime
- A statically-typed language is able to execute these behaviors at compile time
- This is because statically-typed languages give the compiler much more information (e.g. variable types, etc.)
- Specifically, the compiler has information about the structure of the program and its data
- With this information, the compiler will be able to optimize both memory access and computations
- As a result, statically-typed languages are generally faster than dynamically-typed languages
Challenges of Writing Compilers for Python
- Essentially, the bytecode of a statically-typed language will run faster compared to the bytecode of a dynamically-typed language
-
This is because bytecode of statically-typed languages still need to determine information like variable types
- Statically-typed languages have already done this
- Dynamically-typed languages need to do this because a user could pass a variable as a list, integer, etc. at runtime
-
To effectively compile dynamically-typed languages:
- Enforce a static structure of data
- Infer the types of all variables, classes, etc.
- A compiler of a dynamically-typed language could enforce the above conditions
- However, implementing these additional checks and inferences leads to larger bytecode
- Meaning, running the bytecode becomes slower
References
- Python Essential Reference
- Python in a Nuteshell
- Understanding Differences between Compilers
- Python as an Interpreted Language
- Differences between Python and Java Bytecode
- Compilation Strategies of Python Implementation
- Compiled and Interpreted Languages
- Why isn't there a Python Compiler?
- How Python Bytecode Runs in CPython
- Describing the JIT Compiler
- Difference between Compilers and Interpreters
- Confusion between Compilers and Interpreters
- Lecture Notes about Compilers
- Definitions of Compilers and Interpreters
- Interpreters and Machine Code
- Runtime and Compile Time
- Challenges of Compilers for Dynamically Typed Languages