An APL workspace as (Relational) DATABASE, and RDBMS, Codd's Rules

These notes are based on an articles by Soop, Vaughan-Nichols' interpretation of Codd's Rules, and King. The first part deals with a general and Soop's discussion of using APL workspace as a database. The second with a Relational Data Base Management System (RDBMS) and Codd's rules. The third part discusses using dimensions to represent database attributes.

To view the APL functions and other sections that include APL character use the MYSYS version by entering the following from a MYSYS or MYS2 workspace: LESSON'x19'

Use the links below to select a section of this note, or you may scroll through the whole text.

An APL ws as a database, parsing and tokens The following uses information from APL84, p303,
"Can an APL workspace be used as a data base?; Karl Soop; IBM Nordic Labratory, Lindingo, Sweden."

Larger workspaces make using APL as a DATA BASE pratical for some applications. The workspace size now depends on internal memory. The praticality of using a workspace depends on the available hardware and application. Many techniques are available. Some major advantages of an APL ws are:

  1. Ability to accomodate many different structures.
  2. Random access to a much higher degree than possible even with database systems that are said to support "direct access".
  3. Ability to store data together with manipulative functions.

Aspects of no. 3 above:

An application can be designed by defining all data as APL variables. This quickly leads to the point where the complexity of the manipulative functions impeeds further expansion. The real limit is the size of the symbol table.

A more thoughtful design homgenizes the data and arranges them in such a way that there are fewer variables. A technique will be described below which describes an entire data base by two variables: V, a vocabulary; and TR, a ternary data relation array.

Traditional file and data-base systems often store data verbatim. The files are often rigidly structured into records and fields, whose lengths and roles are not easily changed.`

Systems based on relational support tend to encode the data. Each data word is encoded into a "token" of relatively small and fixed size. Tokenizing data may be worthwhile for compaction only.

Arrays can represent relations. i.e. a matrix of numeric tokens of shape J K can represent a relation of degree K with J tuples. The name of the matrix is the name of the relation.

Facts and issues about relational databases:

Tokens are codes for data. There are many schemes possible. The av is one scheme for storing characters and fns in 8 bit binary numbers.

The token scheme for data depends on the range of different values. APL in common with most computer environments has an underlying coding scheme. It is well to to examine the details of the scheme and the data before choosing a tokenizing scheme.

Some basics:

 1 byte = 8 bits = 1 Character (256 variations in av); dr = 82
 1 small integer = 2 bytes (2*16-1); dr = 163  (-32768 0 32767)
 1 large integer = 8 bytes (2*32-1); dr =645

 The mapping of data item  token must be 1 to 1
 Conversion must be reliable both ways
 The upper limit must not be less than the no. of data items
 Ideally there should be no upper limit on the no. of data items
 Conversion should be efficient (especially for short words)`

 Data items can be tokenized in a variety of ways:
 a. The index no. of a list of data items
 b. Direct coding from an alphabet
 c. Preferred tokens which are a combination of a. & b.
 
A version of "a", the index of a list can be developed using functions developed to utilize the power of ss. It should be noted that numbers less than 1+2*16 65535 occupy only 2 bytes and are space efficient tokens.

The functions: 'ID' and 'WD' using SS have been developed and are used for compacing data, etc. (originally for PCS data analysis)

 ra ID w;A
  ra ID w; Loc. w in a, append to a if not found; amsv91/2/4
 Aa  w'',(uc(w' ')/w),''
 (0r(A SS w)/+\A SS '')/0  a,'',a,',1','w'  Aa  LC

 ra WD w
 r1(w=+\a SS '')/a  Word in a at location w; amsv91/2/4
 
ID is used to determine the token of a word and if not found the term is appended to the list and the token determined from the extended list. It should be noted that the list is a segemented string with '' seperators. The sperator character may be revised in the functions as desired, but should be a character which will not appear otherwise in the list.

WD is the retrival of a word from the list given its token. These fns have not been used extensively and are probably subject to improvement. Since they use two bytes to store a word the size of the tkn can be extended by using the negative as well as the positive 2 byte integers (i.e. by subtracting 2*15).

An improvement in ID and WD might be coding two character words directly as characters. This would tend to reduce the list size but would require tests for data representation with dr.`

Lists can be arranged as simple arrays and indexing of the first dimension can provide a token. This arrangement is not as space efficient as the one noted above but is a more common method of building table driven systems. There are many versions of this approach and the various translation fns of PCSIF.

Many schemes can be used to code and compact lists. Simple compaction of text is to make it all one case as is shown below using the fn uc. The list is parsed by using SStoMAT to make it an array. Blanks are used as seperators.

 SStoMAT ' ',uc 'now is the TIME for Washington and WAR'
 NOW IS THE TIME FOR WASHINGTON AND WAR
 8 10

 ra AT w
 ra[w;]  retrieval from array a at loc. w; amsv91/2/3

 ra ROW w;A;j;x
  rowno. of w in list a (must be entered in char.); ams920921
 L1:j1Aa  x1A  (0r(((,A)SS jw)/xj)j)/0
 a,'',a,',[1]j''',w,''''  L1
 

A design for a token scheme

The preferred token schme described in Soop's paper and below uses numbers in the range from 1 to 2*31 which are coded according to a character set alf and vocabulary VOC.

The scheme consists of coding up to 4 characters directly and indexing strings longer than 4 characters to a vocabulary consiting of 4 character codes and addresses.`

The AMS version of the preferred token scheme variables and functions described in Soop's paper follow:

The character set is named alf (a set which includes all available from Zenith supersport 286 using APL unified key board, blank is no.1 ) is as follows:

  1234567890-=qwertyuiop[]asdfghjkl;'zxcvbnm,./\
  !@#$%^&*()_+QWERTYUIOP{}ASDFGHJKL:"ZXCVBNM<>?|~
  
  

 The vocabulary VOC, a 2n matrix, a few rows of which appear as:

 145043621 177680201
 216924417         1
 144685242         2
 184684441 272648601

 The above vocabulary (and tokens) were produced by:
      TOKEN 'the transportation problem'
 145243201 3 4

 PTOKEN w
 Ptkn prs w  tokens for w; amsv89/1/12

 rtkn w;a;I;J;K;L;N
  tokens fm word array w; ref APL84, p303, Soop; ams890112
 r200alf((j1w),4)w
 (^/~I(w 0 4 w).' ')/0  a(I/r),[1.5]tkn Iw
 Ja i VOC  NJ>1VOC  LK=K1++/^\a.aNa
 J[N/J](1VOC)+K-(+\~L)[K]  r[I/r]J  VOCVOC,[1]La

 rprs w
  rwords fm text w; ams890112
 (2=w)/'wTCNL MATtoSS w,[w]TCNL'
 rSStoMAT 1(~w SS ' ')/w'',w,' '`

 Conversion from tokens to words uses the fns wd and cfp:

 rwd w;I;B
 rcfp w  words fm tokens; amsv89/1/12
 r(' ',alf,'')[1+(4200)r]  r(1 4 1r) 3 1 2 r
 I(2(w), 1 1),(1+1r)  B,~^I0,^\' '=r  rr,' '
 rB/(I[1],/I[2 3])r  0w  r,r

 rcfp w;J
 r(,w).+,0  c token array fm mix; amsv89/1/12
 (^/~Jw1VOC)/0
 r[J/w;]VOC[J/w;1]  rr,Jcfp VOC[J/w;2]

      testTOKEN 'president BUSH is on his way to WASHINGTON'
 7 706694878 169080201 177680201 248845401 121043801 144880201 8

      exb , wd test
 president BUSH is on his way to WASHINGTON.
 
End of discussion based on Soop's paper.

________________________________________________________________

Notes on a Relational Data Base Management System (Codd's Rules)

Ref: BYTE, Dec 1990, p320 "A RDBMS must meet Codd's rules, or it isn't a relational database manager." article by Stephen J. Vaughan-Nichols

"The fundemental rule of an RMDBMS is that all information must be manegable entirely through relational means. On the logical level, then, everything in a relational database must be represented by values in tables.... In a true RDBMS the database description, or catalog, must be contained in tables and controlled by the data-manipulation language."

"Null values represent missing or inapplicable data... nulls are the same thing as empty or blank fields or the concept of zero." Eg  in APL is a null number and '' is a null character, 0 can also be used. These can be used to create empty arrays. E.g. 1 0 is an empty 1 row array.

SQL, or "Structed Querry Language is the most popular relational language, identifies primary record keys by the combination of their unique identities and by not being null."

In theory a RDBMS must meet the rules set out by Edgar Cobb of IBM in the 1960's. These rules are:

"An RDBMS is a fundemental improvement over most DBMS's because you can add, delete, or change data throughout an entire database by treating it as a single set. Ordinary database managers require record-by-record updates that can drastically slow performance.

Ingres, Oracle, R:base 3.0, and IBM's mainframe-driven DB2 all attempt to meet the demands of the RDBMS model.

________________________________________________________________

APL workspaces, Multidimensional Data Bases, and DIMENSIONS to represent Data Base Attributes

Ref: King, Dan M. (Gulf Canada Ltd, Toronto); 'Using DIMENSIONS to represent database Attributes'; 1985 ACM 0-89791-157-1/85/005/0151

For some applications a particular data element may be related to several locators. In the example given by King the inventory quantity was categorized by:
Type, Year, Month, Region, Terminal, Activity, Product. Each of these was represented by a seperate array dimension.

More than three dimensions are difficult for applications developers to visualize. This however was overcome by making a simple list of the data objects's dimensions and what each dimension means. A transpose then can be used to reorder the list.

Using dimensions to represent attributes gives the following benefits:

The implimentation described by King used only three dimensions and the remaining were carried by the data item in an ordered set forming a vector. The actual arrangement was determined on the basis of which attributes were most frequently used. A number of special functions were used to deal with posting, retrieving, etc. A reporting function produced the desired views as required.

The use of a pure dimensional system would depend on the size of the database and the frequency of use. Undoubtedly such a system would be easy to manipulate using the APL primitives.

Links:

  1. An APL ws as a database, parsing and tokens
  2. Facts and issues about a relational database
  3. A design of a token scheme
  4. RDBMS (Relational Database Management System) and Codd's Rules.
  5. Using Dimensions as database attributes
  6. General List of Transportation Topics
  7. Intro. to Transport, a Physical and Quantitative approach
  8. TOPICS included in MYSYS
  9. Introductions to TOPICS included in MYSYS
  10. Time Value and Engineering Economy Topics
  11. Index of MYNET files

End to date, ams 971204, END MYSYS version, ams950123