We slightly edited this XML instance for the sake of readability. Similarly, we annotated symbols el- number of spelling errors ortho and a lot of abbrevia- ement symb that often serve as abbreviations e. Slightly more than 35 annotations are for the word at , but that can be used as punctuation present in the current version of the database, that is, roughy marks as well.
Smileys are typical of SMSs and are being 5 annotations per message. First names element prenom , fam- ponc rire ily names element nom , numbers element numero as ortho coquille well as email addresses element mail , url element web abrev symb and normal addresses element addresse are marked as synt forme inconnue well, in order to facilitate the anonymisation of the mes- majus nom sages. Also, parts of text messages that are in another lan- binet element inconnu 28 guage are marked by the bloc lang element along with accord numero 21 the language being recognized attribute langue.
Different kind negat of errors are encoded with different element types. Typos are marked by the element coquille e. Counts of the 21 annotation types in our database. This means that only one ele- ourselves. Syntactical errors — e. Also, the accord soon as possible is annotated as being an English element indicates an agreement error, e. Dit-lui for bloc of text element bloc lang in our database, but not dis-lui tell him.
We also annotated each error in- as an abbreviation.
Sometimes, the attribute comment is volving casing element majus. Most often, they con- used to document such situations. Last, there were some forms 3. The database we could not transcribe.
We tagged forme inconnue The texto4science database takes the form of a tar and element inconnu each word-form and symbol we file composed of three files: Clearly, the SMSs we received are character- survey, as well as a bunch of tools that facilitate the treat- ized by a high rate of missing punctuations ponc , a large ment of both databases. The Database of SMSs na ns na ns na ns na ns na ns At the time of writing, we treated a total of 7 text 0 5 10 15 We received 5. Those SMSs are part of the database, but did not 4 9 14 74 19 19 86 1 receive any annotation apart that they are written in a lan- guage other than French.
The main characteristics of the Table 4: SMS normalized Number of tokens 90 ability and disponible available or the form 1 which Number of types 11 9 stands both for une a, feminine and un a, mascu- Number of hapax 7 5 line. The most ambiguous abbreviation we found is txt which is used for various morphologically derived forms char word of the word texte text: We also noticed a great variability in common expressions Table 2: Similarly, we found 6 different ways of writing shorter texts, due in part to the length limitation applied by the word demain tomorrow: As we shown, smileys element binet have been One possible use of an annotated corpus consists in compil- annotated.
We tagged a total of 98 different smileys in ing a dictionary dedicated to SMSs. Some are already avail- our corpus, this is much less than the ones observed able, such as www. The 3 most frequent ones are: In the same vein, we annotated laughing marks out of an SMS corpus.
Building such a dictionary from in our database element rire , for a total of dif- our database is simply a matter of querying abrev ele- ferent forms; many being variants of the same form e. As an illustration of this, we collected thanks to an hahahaha, hahah , others being more surprising e. Perhaps one characteristic of our database is the high This is reported in Table 3.
Table 4 provides the of the number of annotations per message. It is interest- form freq Top-3 abbreviations ing to note that only of them 3. On the other hand, one message que ke q 13 received no less than 86 annotations. It is reported in Fig- pour pr ure 4, along with its transcription. As we can see, this is avec 92 ak 48 aek 22 ac 11 a rather long message: In the current version of the Table 3: This is Since when are you using SMSs?
How many SMSs a week? The survey Whom are you writting SMSs to? Volunteers who gave their SMSs to the texto4science fam friend lover col compet other project were invited to fill up a webform containing 23 resp. The answers rank 1 38 12 2 4 provided are organized into an XML file which is part of avg. For obvious reasons, some informa- tion has been withdrawn from the database, such as phone Why are you using SMSs?
Buy for others
Still, it is possible to cross this database with the tel cost info app contact chat one of SMSs, since contributors have been serialized simi- resp. However, we noticed that a third of rank 1 58 55 36 47 56 the responders did provide a phone number different from avg.

The questions of the webform are grouped into 4 main categories: Where are you composing or reading your SMSs? Five questions regarding the use of SMSs. See the most messages are addressed, etc. Abilities in writing SMSs familiarity of responders with We analyzed the distribution Clearly, many responders abbreviations and other codes frequent in text mes- are located in Montreal, and a significant part are located sages; their use of such idioms in their production; downtown.
This underlines the difficulty we had at motivat- their tendency to mix several languages in a single ing people to donate their SMSs to the project. We are cur- SMS. Technical device kind of device subjects are mostly tex- ting from 12 touchtone pad, qwerty keyboard, tablets, Usage of SMSs The way responders are using the SMS etc. This analy- week only less than 5 messages received and sent. Only sis is articulated along the four dimensions aforementioned.
Each responder could database of SMSs. The re- tion the most appropriate and a score of 5 to the less appro- sponders aged 27 on average, the youngest person was 12 priate one. Options not relevant were marked as such. For year-old, the oldest Most re- ders that gave an option the first rank, and avg. A few responders live abroad Canada. Tu tdoute ke sa ma vrm fait de koi paske jtai pas recrit depuis alors ke shu kk1 ki oublie facilement dhabitude.. Avant kon passe Par-dessus jvx juste te dire ke si sa ma autant fait de koi c paske oui je le sais ke c vrm un probleme sam mnuit vrm d fois cpour sa ke jvx Vrm le regler, pis ski ma fait chier c ktu mdise sa au moment ou jtai dit ke javais fais d efort dernierement pis ksa sameliorait..
Mais cte Soir la jdevais pas feeler en partant, jc pas pk jy ai tellement pense.. Javais vrm envie pleurer pis dparler a kk The message with the highest number of annotations in the texto4science database. Forms in bold are typical of Quebec French. This is certainly related to were judged irrelevant. The kind of device used for texting is likely tion is rather high: All those figures contribute to indi- evolving fast, and the impact this evolution has on the qual- cate that responders in great part are sending SMSs to their ity of the SMSs produced deserves some investigations in friends or lovers, which is not entirely surprising.
We have noticed When asked about their motivations for using the SMS that several responders mentioned that typing accents on technology, most responders mentioned they are using it tablets is difficult often, it requires to switch the keyboard , mainly as a replacement of emails and telephone calls and that typing with a QWERTY keyboard reduces the use tel.
Which keyboard are you using? Technical aspects of SMS writing. Code switch- ing is a common practice among responders: Discussion to another from time to time. Although English is the lan- We have presented an overview of the texto4science guage they switch to most frequently, other languages are project and its database which is freely available for being used as well, among which Spanish, German, and download at URL: Arabic are the most popular ones.
First, it is Facilities for navigating online trough the database are cur- noticeable that our responders are not making a great use rently being built and will be available as well. We are cur- of dictionary facilities: First, we are developing two translation or only occasionally. Sarkar, ond one normalizes SMSs according to a statistical transla- and A. Investigation and modeling of the tion engine we trained on the texto4science database. International journal on This statistical engine is hybridized with rules that are de- document analysis and recognition, Also, we to standard language dictionary.
New challenges, new applications, Cahiers ments. Therefore we developed a system for recognizing du Cental, 7. Presses Universitaires de Louvain. To ap- appointments in SMSs and extracting their pertinent infor- pear. Orthographe et langue dans les sms. First, we want to extend the markup P. This would ease the comparison of C. A translated the texto4science database with other databases for corpus of 30 french sms. Preliminary investigations are indi- Genova. Using speech to reply to sms specific annotations. Second, we are still receiving SMSs messages while driving: An in-car simulator user study.
It is our intention to update the cur- In ACL Short paper session, pages —, Upp- rent database with those new messages, possibly by semi- sala. Unwillingness-to-communicate and col- annotated. Telem- notation choices we would like to correct in future versions atics and Informatics, Skep- who kindly provided us at no cost the platform we used for ner. Prefix-based disambiguation for collecting our SMSs.
